EXISTS operator in SQL

on waitingforcode.com

EXISTS operator in SQL

You want to become a data engineer and don't know where to start? I was like you 4 years ago when I started to learn the data. From that experience I prepared a 12-weeks course that will help you to become a data engineer. Join the class today! Join the class!
Years ago when I started to work as a software engineer, I was overusing IN/NOT IN operator. One day, one of my colleagues suggested me to replace it in some queries by EXISTS/NOT EXISTS. And it helped to improve the performances of these queries. If among you are some people like "me years ago", I prepared this short post introducing to EXISTS/NOT EXISTS operator by comparing it to IN/NOT IN one.

This post presents SQL EXISTS operator. The first part gives some basic information and explains the difference with IN. The second one makes zoom at the physical execution on PostgreSQL.

Not IN

EXISTS operator is sometimes confused with IN operator. And it's quite normal since both are used to check whether one attribute is included in a set. Both can also be negated the same way, with NOT keyword. But the similarities end here.

The real purpose of EXISTS is to check whether the subquery contains at least one record. For instance, the following query will return rows 1 and 2 from numbers table:

WITH numbers (nr) AS (
  SELECT 1 AS nr UNION ALL
  SELECT 2 AS nr UNION ALL
  SELECT 3 AS nr
), letters (letter, nr) AS (
  SELECT 'A' AS letter, 1 AS nr UNION ALL 
  SELECT 'B' AS letter, 2 AS nr 
)
SELECT * FROM numbers n WHERE nr EXISTS (SELECT nr FROM letters WHERE nr= n.nr);

Of course, you can interchangeably write it like SELECT * FROM numbers n WHERE n.nr IN (SELECT nr FROM letters). Both will return the same results. The difference is that the one with EXISTS can be faster because it will stop after getting the first row whereas IN will select all of them (if IN returns a lot of rows). However, please notice another thing. EXISTS uses correlated subquery so automatically it can be replaced with a LEFT JOIN with a NULL check (aka "frustrated join").

Execution plan

In this part I will start a new PostgreSQL container with docker run --name wfc_exists_test -e POSTGRES_PASSWORD=root -e POSTGRES_USER=root -e POSTGRES_DB=test_exists -d postgres:11 and access to the PostgreSQL console with docker exec -ti da4daa09d213 psql -d test_exists. In my dataset I will use the same tables and columns like in Minus/except operator in SQL. And before we go to the EXISTS tests, let's get all not bought products with NOT IN operator:

test_case_when=# EXPLAIN ANALYZE SELECT * FROM products WHERE id NOT IN (SELECT product_id FROM bought_products);
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Seq Scan on products  (cost=38.25..50.50 rows=90 width=422) (actual time=0.069..0.072 rows=5 loops=1)
   Filter: (NOT (hashed SubPlan 1))
   Rows Removed by Filter: 2
   SubPlan 1
     ->  Seq Scan on bought_products  (cost=0.00..32.60 rows=2260 width=4) (actual time=0.006..0.009 rows=3 loops=1)
 Planning Time: 0.180 ms
 Execution Time: 0.130 ms
(7 rows)

As you can see, the plan, since any index isn't involved in the query, makes a sequential scan of bought_products before applying the filter. EXISTS, as you can see in the next snippet, behaves differently:

test_case_when=# EXPLAIN ANALYZE SELECT * FROM products WHERE NOT EXISTS (SELECT product_id FROM bought_products WHERE id = product_id);
                                                                    QUERY PLAN                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Anti Join  (cost=0.15..47.80 rows=90 width=422) (actual time=0.038..0.045 rows=5 loops=1)
   ->  Seq Scan on products  (cost=0.00..11.80 rows=180 width=422) (actual time=0.012..0.014 rows=7 loops=1)
   ->  Index Only Scan using bought_products_pkey on bought_products  (cost=0.15..0.61 rows=11 width=4) (actual time=0.003..0.003 rows=0 loops=7)
         Index Cond: (product_id = products.id)
         Heap Fetches: 2
 Planning Time: 0.177 ms
 Execution Time: 0.080 ms
(7 rows)


The underlying query is executed as a nested loop anti join (learn more about them in Nested loop join and LEFT ANTI JOIN). By comparing both small examples, you can see the difference in the expected execution time (0.130 ms vs 0.080 ms) for slightly similar planning time. So if you have some queries using IN which can be replaced by EXISTS version, it's a good moment to check if they could execute faster.

Just to close the chapter, let's see now the plan for EXISTS:

test_case_when=# EXPLAIN ANALYZE SELECT * FROM products WHERE EXISTS (SELECT product_id FROM bought_products WHERE id = product_id);
                                                                    QUERY PLAN                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Semi Join  (cost=0.15..47.80 rows=90 width=422) (actual time=0.030..0.046 rows=2 loops=1)
   ->  Seq Scan on products  (cost=0.00..11.80 rows=180 width=422) (actual time=0.011..0.013 rows=7 loops=1)
   ->  Index Only Scan using bought_products_pkey on bought_products  (cost=0.15..0.61 rows=11 width=4) (actual time=0.003..0.003 rows=0 loops=7)
         Index Cond: (product_id = products.id)
         Heap Fetches: 2
 Planning Time: 0.261 ms
 Execution Time: 0.084 ms
(7 rows)

As we could expect, instead of ANTI JOIN, we retrieve here SEMI JOIN. Is it true for a not correlated query like this one ?

test_case_when=# EXPLAIN ANALYZE SELECT * FROM products WHERE EXISTS (SELECT product_id FROM bought_products);
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Result  (cost=0.01..11.81 rows=180 width=422) (actual time=0.013..0.015 rows=7 loops=1)
   One-Time Filter: $0
   InitPlan 1 (returns $0)
     ->  Seq Scan on bought_products  (cost=0.00..32.60 rows=2260 width=0) (actual time=0.007..0.007 rows=1 loops=1)
   ->  Seq Scan on products  (cost=0.01..11.81 rows=180 width=422) (actual time=0.002..0.004 rows=7 loops=1)
 Planning Time: 0.077 ms
 Execution Time: 0.036 ms
(7 rows)

As you can see in the plan, not really. This time the subquery does a sequential scan but you can see that it's almost for free (0.007 ms) and it will return only 1 row since it's semantic of EXISTS. If you compare these statistics with the plan for NOT IN query, you will see a lot of similarities, probably because of a very small dataset. You should observe more differences for a bigger dataset, especially if your NOT IN subquery will return a lot of rows (EXISTS returns always 1 as you can see in the plan).

Even though it looks similar, EXISTS is different from IN clause. It stops when the subquery finds the first row. Switching from IN to EXISTS can be then a good hint if your query is slow and you want to speed it up. Even though, as you could read, EXISTS can be replaced at its turn by a LEFT JOIN called also "frustrated join" (see "Effective SQL: 61 Specific Ways to Write Better SQL").

Share on:

Share, like or comment this post on Twitter: