Databases: Soft Deletion

To delete, or not to delete, that is the question.

Introduction

The most important thing in the kitchen is the waste paper basket and it needs to be centrally located.

Donald Knuth

In this one, we're taking a look at an important database pattern: soft deletion. It's important because it's considered the norm in most codebases, not necessarily as a function of its merit.

If you're working on a large codebase, creating new tables and writing queries, this should be something you're familiar with. If you're not, or you want to improve on your existing knowledge, read on.

Theory

Soft deletion is typically used for one thing: retaining data after it's been deleted. Instead of doing a traditional database DELETE on a record in any given table, you still want to retain the data for further use.

As a result, you need to find another way to do this, usually by adding a new column to indicate deletion. This might come in the form of a boolean is_deleted column, or a timestamp such as deleted_at.

Implementation

Let's say we have a table called user. The easiest way to delete stuff in a relational database it to do just that: delete it.

Let's delete a specific user with an imaginary ID (represented by the $1 parameter).

-- Do a hard deletion.
DELETE FROM user
WHERE user_id = $1

After running this, the user is now gone forever. You could roll back to previous version, but you'll lose any new changes made during the time it takes you to do that. This can be really bad.

  • What if you just launched a big marketing campaign and you're getting a bunch of new activity?
    • You could lose the data of thousands of potential customers, destroy trust, confuse your existing users.
    • People don't think "they probably deleted something on accident". They think your product is flakey and untrustworthy.
  • What if you work at a large company, and your service handles thousands of write requests per minute?
    • In the thirty minutes it takes for you to re-deploy your backup, hundreds of thousands (even millions) of changes could be destroyed forever.
    • If you want to retain this lost data, your recovery strategy needs to incorporate some sort of data-stiching with rules for what data takes precedence.

Don't have a recovery strategy? You could lose millions of changes. For obvious reasons outside of the scope of this article, big companies have plenty of replicas and fallbacks. But, if a company's infrastructure prioritizes availability over consistency (think Instagram, for example), a SQL DELETE might erase this data permanently before their database replicas can achieve "eventual consistency". A soft-deletion doesn't help with consistency challenges, but I'd rather have a column mismatch than a missing record.

Enter soft deletion.

-- Do a soft deletion.
UPDATE user
SET deleted_at = NOW()
WHERE user_id = $1

Your origin database retains the record. If the deletion occurred by accident, this can easily be reversed with another update query. If a user accidentally deletes some data through your application, you can retrieve it. If a script goes awry and deletes a bunch of users, this would be easy enough to reverse if they were soft-deleted.

-- Reverse a soft deletion.
UPDATE user
SET deleted_at = NULL
WHERE user_id = $1

Considerations

However, some issues still remain. Don't we need to filter out these user records across all queries? In this case, the answer is a resounding yes. For each query you do that involves the table in question, user, you need to add WHERE deleted_at IS NULL to the end.

This becomes trickier when you have hundreds or thousands of different queries. The typical solution to this in most applications is to append all queries with this clause, ideally through an ORM or query builder instead of raw SQL (think human error). Engineers need to be extra careful to write sturdy tests for these queries, and avoid regressions at all costs.

In the vein of consistency, what happens if one of your database replicas contains old data? In most cases, especially if a deletion is user-initiated, a few seconds of latency is acceptable. But what if it's something that needs to be achieved quickly?

Extending the idea of delete_at, what if you had another column banned_at to indicate a user is banned from your service. What happens if one service pod reads from a database replica that hasn't been synced yet? Your banned user might still have access.

This is unlikely in most setups in the real world, and if you were writing logic this sensitive, your service would probably prioritize consistency over availability in this case. But, it's still something to chew on. Can you think of other examples where this might happen?

Conclusion

This is all to say: tradeoffs. Both solutions have pros and cons, and I think both solutions suffer from some of the same problems. Doing a full SQL delete doesn't solve distributed systems challenges, and soft-deletion bleeds complexity from your database into your application code.

I still think soft-deletion is here to stay, for better or worse.