ESMITHY.NET - SQLAlchemy Cascade Delete: Clarity through Examples

SQLAlchemy Cascade Delete: Clarity through Examples

By Eric — Sat 20 June 2020 — 11 minute read

Since there is overlapping "cascade delete" functionality supported by SQLAlchemy -- with the ORM handling some deletes and the database itself handling others -- it can be hard to know the right way to set it up. Here are some examples to help clarify how it all works.

Starting Example

Here we have a couple of models: A Project which can have several Tasks associated with it. For now, we don't have any kind of cascades set up.

from sqlalchemy import create_engine, Column, Integer, ForeignKey
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relationship
from sqlalchemy.orm.session import sessionmaker

Base = declarative_base()


class Project(Base):
    __tablename__ = "project"
    id = Column(Integer, primary_key=True)
    tasks = relationship("Task", back_populates="project")


class Task(Base):
    __tablename__ = "task"
    id = Column(Integer, primary_key=True)
    project_id = Column(Integer, ForeignKey("project.id"))
    project = relationship("Project", back_populates="tasks")


engine = create_engine("postgresql:///cascade")
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)
session = Session()

project = Project(tasks=[Task(), Task()])
session.add(project)
session.commit()
session.delete(project)
session.commit()

Note: To run this code, you'll need a database. I'm using PostgreSQL, so I first ran createdb cascade from the shell.

When the code deletes project, SQLAlchmemy's default behavior is to explicitly set task.project_id to None (or NULL) for each task that belongs to the project:

UPDATE task SET project_id=%(project_id)s WHERE task.id = %(task_id)s
({'project_id': None, 'task_id': 1}, {'project_id': None, 'task_id': 2})
DELETE FROM project WHERE project.id = %(id)s
{'id': 1}

If we don't want tasks unassociated with a project floating around, we'll require that a project_id be set by adding nullable=False:

    project_id = Column(Integer, ForeignKey("project.id"), nullable=False)

Note: If you're following along, remember that to change the schema, you'll want to re-create the database by running dropdb cascade; createdb cascade. Or you could put in a call to Base.metadata.drop_all(engine).

Running the updated code results in an exception:

psycopg2.errors.NotNullViolation: null value in column "project_id"
violates not-null constraint

This is a good thing. It means that the database is enforcing the integrity of the data by preventing a task without a project. So how do we delete a project?

You could do this (but don't):

for task in project.tasks:
    session.delete(task)
session.delete(project)
session.commit()

DELETE FROM task WHERE task.id = %(id)s
({'id': 3}, {'id': 4})
DELETE FROM project WHERE project.id = %(id)s
{'id': 2}
COMMIT

That works, but using a cascade removes the tasks consistently without the extra code.

ORM Cascades

The ORM can do the same task deletions automatically if we add a cascade="delete" to the right relationship. But we've got two relationships -- which is the right one? Answer: It goes on the tasks relationship in the Project model:

class Project(Base):
    __tablename__ = "project"
    id = Column(Integer, primary_key=True)
    tasks = relationship(
        "Task", back_populates="project", cascade="delete, merge, save-update"
    )

DELETE FROM task WHERE task.id = %(id)s
({'id': 1}, {'id': 2})
DELETE FROM project WHERE project.id = %(id)s
{'id': 1}
COMMIT

What if you accidentally put the cascade="delete" on the wrong relation? The cascade wouldn't work when deleting the Task object, so you'd get the NotNullViolation again when there are left-over tasks with no project.

Gotcha: merge, save-update

You probably noticed that I set the cascade parameter to "delete, merge, save-update" rather than just "delete". This is because the ORM has other cascade behaviors aside from "delete", and "merge, save-update" are the ones that are on by default. If you set cascade="delete", you're turning off some of the default behavior of the ORM.

Using "save-update" as an example, you may know that with SQLAlchemy you don't normally have to explicitly call session.add() on every single object to add it to the database. If you had to do that, it would look like this:

task1 = Task()
session.add(task1)
task2 = Task()
session.add(task2)
project = Project(tasks=[task1, task2])
session.add(project)
session.commit()

Instead, since there is a relationship from a project to its tasks, we can do the simpler:

project = Project(tasks=[Task(), Task()])
session.add(project)
session.commit()

It is "save-update" on the relationship that enables this. If you remove "save-update" from the cascade and run the second code block, the INSERTs won't happen and SQLAlchemy will emit a warning:

SAWarning: Object of type <Task> not in session, add operation along
'Project.tasks' will not proceed

You might also want to look at the shorthand "all" value for cascade (which, I'm assuming for some backward compatibility reason, isn't actually all the allowed values), since "all, delete-orphan" is a common value when you want to do cascade deletes.

You can read more about what all the values mean in the Cascades section of the documentation.

Database Cascades

To have the database delete the tasks itself, we can add ondelete="CASCADE" to Task.project_id foreign key. This changes the DDL of the foreign key, which we can see with a shell into the database:

cascade> \d task
+------------+---------+----------------------------------------------------+
| Column     | Type    | Modifiers                                          |
|------------+---------+----------------------------------------------------|
| id         | integer |  not null default nextval('task_id_seq'::regclass) |
| project_id | integer |  not null                                          |
+------------+---------+----------------------------------------------------+
Indexes:
    "task_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "task_project_id_fkey" FOREIGN KEY (project_id) REFERENCES project(id) ON DELETE CASCADE

Time: 0.020s

If we only add ondelete="CASCADE", however, we get another 'NotNullViolation' when running the code:

sqlalchemy.exc.IntegrityError: (psycopg2.errors.NotNullViolation) null value
in column "project_id" violates not-null constraint

This is a situation where some real confusion can set in, and what I and some of my colleagues have done in the past is flail around a while until we end up turning ORM-level cascades back on, declare victory and move on. The ondelete="CASCADE" might remain, but it is never used since the ORM will have done the deletions before the database gets a chance.

The integrity error happens because the ORM still sets the project_id of each task to None:

UPDATE task SET project_id=%(project_id)s WHERE task.id = %(task_id)s
({'project_id': None, 'task_id': 1}, {'project_id': None, 'task_id': 2})

The key is that, we need to additionally set passive_deletes=True on the Project.tasks relationship, which disables the ORM from loading each related task and setting project_id to None.

class Project(Base):
    __tablename__ = "project"
    id = Column(Integer, primary_key=True)
    tasks = relationship("Task", back_populates="project", passive_deletes=True)


class Task(Base):
    __tablename__ = "task"
    id = Column(Integer, primary_key=True)
    project_id = Column(
        Integer, ForeignKey("project.id", ondelete="CASCADE"), nullable=False
    )
    project = relationship("Project", back_populates="tasks")

Now the ORM only emits a delete for the project row:

DELETE FROM project WHERE project.id = %(id)s
{'id': 1}
COMMIT

The database took care of deleting the tasks itself:

cascade> SELECT * from task;
+------+--------------+
| id   | project_id   |
|------+--------------|
+------+--------------+
SELECT 0
Time: 0.020s

Which to Use?

The SQLAlchemy documentation says:

Database level ON DELETE cascade is vastly more efficient than that of SQLAlchemy.

Which is totally fine since, as the documentation continues to point out, taking advantage of database cascades within SQLAlchemy is pretty smooth.

Another reason to choose database cascades is that they still happen even if you want to break out into some raw SQL at some point.

Those reasons are compelling. Why did SQLAlchemy even bother to implement ORM cascades? A few reasons I thought of are:

Maybe you have an existing schema that you're not free to change for some reason, but you still want automatic cascade deletes in your code.
One of the models you'd like to cascade uses joined table inheritance and you want to avoid a "half deleted" object.
Some databases don't support (or don't support by default) FOREIGN KEY, and therefore ON DELETE CASCADE.
Maybe delete was added for completeness along with the other cascade behaviors that only make sense in the context of an ORM.
Other ORMs do it, so there is precedence.

Summary

Database-level cascade deletes are set on ForeignKey:

project_id = Column(
    Integer, ForeignKey("project.id", ondelete="CASCADE"), nullable=False
)

This says: ON DELETE of the row I'm referencing CASCADE and delete this row too. Remember passive_deletes=True!

ORM-level cascade deletes are set on relationship:

tasks = relationship(
    "Task", back_populates="project", cascade="all, delete-orphan"
)

This says: Cascade the deletion of this object to these related objects.

Here is the final code example:

from sqlalchemy import create_engine, Column, Integer, ForeignKey
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relationship
from sqlalchemy.orm.session import sessionmaker

Base = declarative_base()


class Project(Base):
    __tablename__ = "project"
    id = Column(Integer, primary_key=True)
    tasks = relationship("Task", back_populates="project", passive_deletes=True)


class Task(Base):
    __tablename__ = "task"
    id = Column(Integer, primary_key=True)
    project_id = Column(
        Integer, ForeignKey("project.id", ondelete="CASCADE"), nullable=False
    )
    project = relationship("Project", back_populates="tasks")


engine = create_engine("postgresql:///cascade")
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)
session = Session()

project = Project(tasks=[Task(), Task()])
session.add(project)
session.commit()
session.delete(project)
session.commit()