Big Data and data removal - truncate or delete?

Versions: PostgreSQL 11

When I started to work with data on my very first PHP and Java projects, I used only DELETE operator to remove the data. When I switched to (big) data engineering, I found more efficient ways to deal with this operation through TRUNCATE or DROP operations.

In this post I will focus on DELETE and TRUNCATE operations applied to data removal. In the first 2 sections, I will explain how they're executed on top of PostgreSQL 11. In the next part, I will compare both operations, mostly from a performance point of view.

DELETE on PostgreSQL

To see how DELETE is executed, I will start a Docker container and create a simple empty table:

docker run --name delete_truncate_test -e POSTGRES_PASSWORD=root -e POSTGRES_USER=root -e POSTGRES_DB=delete_truncate_test -d postgres:11

# access the container
docker exec -ti ff76b7cad1fbe45c4b7dfaa47f47d419d8a6170c9c584e394bd552eeba0ff139 psql -d delete_truncate_test

   user_id serial PRIMARY KEY,
   username VARCHAR (50) UNIQUE NOT NULL,
   password VARCHAR (255) NOT NULL,
   created_on TIMESTAMP NOT NULL,
   last_login TIMESTAMP

And let's check now, what happens for a DELETE operation without conditions:

delete_truncate_test=# EXPLAIN ANALYZE DELETE FROM users;
                                              QUERY PLAN                                               
 Delete on users  (cost=0.00..10.80 rows=80 width=6) (actual time=0.004..0.004 rows=0 loops=1)
   ->  Seq Scan on users  (cost=0.00..10.80 rows=80 width=6) (actual time=0.003..0.003 rows=0 loops=1)
 Planning Time: 0.087 ms
 Execution Time: 0.028 ms

As you can see, the engine does a sequential scan on the table and later applies a delete on it. When I saw that I was quite surprised. "Why the planner scans the table when there is no *WHERE* clause?" was my question. First, I thought that I had made a mistake but a quick research on the documentation proved the contrary:

TRUNCATE quickly removes all rows from a set of tables. It has the same effect as an unqualified DELETE on each table, but since it does not actually scan the tables it is faster.

So yes, DELETE scans the tables. Even though I'm far away from being competent from writing C code, I'll give a try and tend to analyze what happens when a row is deleted. If I understood the code organization correctly, the method responsible for deletes in stored in nodeModifyTable.c file and is called ExecDelete. The first thing to notice is the word "tuple", used here to represent the manipulated row:

 *        When deleting from a table, tupleid identifies the tuple to
 *        delete and oldtuple is NULL.  When deleting from a view,
 *        oldtuple is passed to the INSTEAD OF triggers and identifies
 *        what to delete, and tupleid is invalid.  When deleting from a
 *        foreign table, tupleid is invalid; the FDW has to figure out
 *        which row to delete using data from the planSlot.  oldtuple is
 *        passed to foreign table triggers; it is NULL when the foreign
 *        table has no relevant triggers.

Why this row-by-row execution? I found an answer on different blog posts saying that DELETE returns the number of removed rows. Actually, we can find this also on the code source, just here:

# Explains es_processed meaning
 *        There is no return value, but output tuples (if any) are sent to
 *        the destination receiver specified in the QueryDesc; and the number
 *        of tuples processed at the top level can be found in
 *        estate->es_processed.

# es_processed used in ExecDelete
    if (canSetTag)

    /* Tell caller that the delete actually happened. */
    if (tupleDeleted)
        *tupleDeleted = true;

But the execution is conditioned, so does it really work? To prove to myself that DELETE works on row basics, I went to ExecModifyTable class that triggers ExecDelete:

     * Fetch rows from subplan(s), and execute the required table modification
     * for each row.
    for (;;)
# ...
        switch (operation)
# ...
            case CMD_DELETE:
                slot = ExecDelete(node, tupleid, oldtuple, planSlot,
                                  &node->mt_epqstate, estate,
                                  true, node->canSetTag,
                                  false /* changingPart */ , NULL, NULL);
# ...


Let's see now what happens for a TRUNCATE operation. The implementation is defined inside ExecTruncate of tablecmds.c file. Since you can have different use cases like the one involving cascading, I will stay simple in this post and focus only on what happens when a table is truncated. The answer hides a little bit further in the file, just here:

# ExecuteTruncateGuts
     * OK, truncate each table.
    mySubid = GetCurrentSubTransactionId();

    foreach(cell, rels)

When you analyze the code, at least the comments as I did, you will see what this TRUNCATE is all about. One of the first things the engine does is the creation of new empty storage for the truncated table. It's made with the call to RelationSetNewRelfilenode. Later, the operation resets the sequence ids for the table. After that, a new entry, common for all tables invoked in the TRUNCATE, is written to the Write-Ahead Log. At the end of the transaction, the old physical storage is removed and it's indicated in RelationSetNewRelfilenode:

      * Schedule unlinking of the old storage at transaction commit.

Let's see now if it really happens like that:

test_alter_default=# SELECT pg_relation_filepath('users');
(1 row)

test_alter_default=# TRUNCATE TABLE users;
test_alter_default=# SELECT pg_relation_filepath('users');
(1 row)

As expected, the physical storage for the users table changed after executing the first TRUNCATE.

DELETE and TRUNCATE comparison

You can see that the TRUNCATE operators more on the "metadata" level whereas DELETE on the "data" one. But that's not the single difference and I would like to summarize few of them, independently on the database.

Metadata operation should be always faster than the data one, right? In fact no, or at least it's something that was proven in TRUNCATE vs DELETE: Efficiently Clearing Data from a Postgres Table. For the tables with big volumes of data, TRUNCATE will very often (always?) perform much better than DELETE. On the other side, the quoted article points out that it's DELETE which will perform better for small tables because the execution time of DELETE depends on the number of rows and for TRUNCATE it's constant.

Another important difference is that TRUNCATE cleans the whole table and you cannot use it to, for instance, remove only a subset of rows from a conditional expression. It's not the case of DELETE which supports WHERE clause.

Also, data removal has a different meaning for both operations. TRUNCATE really deletes everything, even resetting the auto-incremented sequence values! On the other hand, DELETE mark some rows as "for deletion" but actually frees up space only if you call a VACUUM operation (probably PostgreSQL and Redshift-specific).

The next difference comes from the internal execution details related to the WAL. As I mentioned before, TRUNCATE will create a single entry to the WAL whereas the overhead of DELETE will depend on the number of deleted rows, leading to an increased I/O and, therefore, slower operation.

But after all, what to do if we need to remove only a subset of rows? Well, there is a pattern which, in Redshift nomenclature, is called deep copy (CREATE TABLE x AS SELECT * FROM y). The idea here is to create a new table and insert selected rows to it from the source table. After the operation, the old table can be removed and the new one renamed. Why can it work faster than the DELETE? It's also related to the WAL I/O pressure which, for deep copy is much lower than for physical data operations like DELETE or INSERT.

Before this article, I was a convinced user of TRUNCATE but I didn't know any details about why it was performing better. The article helped me not only to understand them (I have a feeling that it's mostly about WAL pressure?) but also gave me an example of when DELETE can perform better.