Correlated subqueries

on waitingforcode.com

Correlated subqueries

Even though the RDBMS is more and more completed (replaced?) by NoSQL solutions, it still remains an important piece of the data processing. It's even more true with the distributed databases as BigQuery supporting SQL standards so the correlated subqueries. But they're also implemented in other Big Data engines as Spark SQL or more classical ones as PostgreSQL.

This post explains the correlated subqueries. The first part of this post describes them from the bird eye view. The next one shows the code used to execute the examples. Two last sections focus on the correlated subquery execution in PostgreSQL. The execution plan will be analyzed in order to understand what happend underneath.

Correlated subquery defined

The correlated subquery is also known as a synchronized subquery. It's called correlated because it's related to the columns defined in the outer query. Naively we could think that because of this dependency, the correlated query is executed one by one for every row. However, it's not true since it's up to the query planner to decide how to execute given query.

To better understand what the correlated subquery is, let's make an analogy with the nested subquery. The most noticeable difference is about the dependency. With the nested subquery it's the outer query that depends on it. The dependency relation with correlated subquery is inverse - it's the subquery that depends on the outer query. To highlight the difference, nothing better than an example. First, the nested subquery:

 
correlated_subqueries=# SELECT * FROM articles WHERE title IN (SELECT title FROM articles WHERE views > 200);
 id | category |   title   | views
----+----------+-----------+-------
  1 |        1 | Article#1 |   220
  2 |        1 | Article#2 |  1450
  3 |        1 | Article#3 |   583
  4 |        2 | Article#4 |   220
(4 rows)

Now, let's execute a correlated subquery:

correlated_subqueries=# SELECT a.title, a.category, a.views, (SELECT MAX(views) AS max_views FROM articles WHERE category = a.category) - a.views AS views_difference FROM articles a;
   title   | category | views | views_difference
-----------+----------+-------+------------------
 Article#1 |        1 |   220 |             1230
 Article#2 |        1 |  1450 |                0
 Article#3 |        1 |   583 |              867
 Article#4 |        2 |   220 |                0
 Article#5 |        2 |   164 |               56
(5 rows)

As you can clearly notice, the correlated subquery retrieves the max value for given column by a "fake" join on outer query's. We'll explore the correlated subqueries through the next sections.

Setup

But before going into the queries, let's setup a small PostgreSQL from the official Docker's image:

FROM postgres:10.2 ENV POSTGRES_PASSWORD=root ENV POSTGRES_USER=root ENV POSTGRES_DB=correlated_subqueries COPY queries.sql /docker-entrypoint-initdb.d/queries.sql

The content of queries.sql looks like:

CREATE TABLE articles (
  id INT NOT NULL,
  category SMALLINT NOT NULL,
  title VARCHAR(150) NOT NULL,
  views INT NOT NULL,
  PRIMARY KEY(id)
);

CREATE TABLE tags (
  id INT NOT NULL,
  tag VARCHAR(150) NOT NULL,
  PRIMARY KEY(id)
);

CREATE TABLE tags_articles (
  tags_id INT NOT NULL,
  articles_id INT NOT NULL,
  views INT NOT NULL,
  PRIMARY KEY(tags_id, articles_id)
);

INSERT INTO articles (id, category, title, views) VALUES
(1, 1, 'Article#1', 220),
(2, 1, 'Article#2', 1450),
(3, 1, 'Article#3', 583),
(4, 2, 'Article#4', 220),
(5, 2, 'Article#5', 164);


INSERT INTO tags (id, tag) VALUES
(1, 'tag 1'), (2, 'tag 2'), (3, 'tag 3'), (4, 'tag 4');

INSERT INTO tags_articles (tags_id, articles_id, views) VALUES
(1, 1, 190),
(1, 2, 30),
(2, 1, 390),
(2, 2, 330),
(2, 3, 350),
(2, 4, 380),
(3, 1, 10),
(3, 2, 93),
(3, 3, 480),
(4, 1, 30),
(4, 4, 190),
(5, 2, 89),
(5, 3, 75)
;

We can next launch the image with docker build -t psql_correlated . and access to it in the another terminal tab thanks to docker exec -it 577618fa3067 bash .

Correlated subquery 1-n relationship example

After the setup we can start by executing the first correlated subquery for the 1-n relationship. In this example we want to select all tags belonging to each article and represent them as an array. There is the query with the results:

correlated_subqueries=# SELECT a.title, a.category, ARRAY(SELECT t.tag FROM tags t JOIN tags_articles ta ON ta.tags_id = t.id WHERE ta.articles_id = a.id) AS tags FROM articles a;
   title   | category |               tags                
-----------+----------+-----------------------------------
 Article#1 |        1 | {"tag 1","tag 2","tag 3","tag 4"}
 Article#2 |        1 | {"tag 1","tag 2","tag 3"}
 Article#3 |        1 | {"tag 2","tag 3"}
 Article#4 |        2 | {"tag 2","tag 4"}
 Article#5 |        2 | {}
(5 rows)

Now let's try to understand what happens during the execution thanks to the EXPLAIN command:

correlated_subqueries=# EXPLAIN ANALYZE SELECT a.title, a.category, ARRAY(SELECT t.tag FROM tags t JOIN tags_articles ta ON ta.tags_id = t.id WHERE ta.articles_id = a.id) AS tags FROM articles a; 
                                              QUERY PLAN                                               
-------------------------------------------------------------------------------------------------------
 Seq Scan on articles a  (cost=0.00..10864.80 rows=220 width=352) (actual time=0.055..0.102 rows=5 loops=1)
   SubPlan 1
     ->  Hash Join  (cost=34.14..49.33 rows=10 width=318) (actual time=0.013..0.014 rows=2 loops=5)
           Hash Cond: (t.id = ta.tags_id)
           ->  Seq Scan on tags t  (cost=0.00..12.30 rows=230 width=322) (actual time=0.001..0.002 rows=4 loops=4)
           ->  Hash  (cost=34.01..34.01 rows=10 width=4) (actual time=0.009..0.009 rows=3 loops=5)
                 Buckets: 1024  Batches: 1  Memory Usage: 8kB
                 ->  Bitmap Heap Scan on tags_articles ta  (cost=23.45..34.01 rows=10 width=4) (actual time=0.004..0.005 rows=3 loops=5)
                       Recheck Cond: (articles_id = a.id)
                       Heap Blocks: exact=4
                       ->  Bitmap Index Scan on tags_articles_pkey  (cost=0.00..23.45 rows=10 width=0) (actual time=0.002..0.002 rows=3 loops=5)
                             Index Cond: (articles_id = a.id)
 Planning time: 0.210 ms
 Execution time: 0.151 ms
(10 rows)

The query plan starts with the Bitmap Index Scan applied on the rows matching the index condition tags_articles.articles_id = a.id. Later the Bitmap Heap Scan happens. The engine resolves by that way the rows matching the index condition and fetches them from the detected locations. At the end the read values are put into the hash table (Hash) and joined with the scanned tags (the whole represents the correlated subquery). At the end the articles from the outer query are joined with the values put into the hash table. We can notice the loops parameter that is the key to understand the number of executions of the subplan's Hash Join. According to it, the subquery will be executed once per outer row.

Correlated subquery with aggregations example

For the second case we want to discover the articles viewed more than average number of views:

correlated_subqueries=# SELECT title, views FROM articles a WHERE a.views > (
  SELECT AVG(views) AS avg_views_category FROM articles  WHERE category = a.category
  GROUP BY category 
);
   title   | views 
-----------+-------
 Article#2 |  1450
 Article#4 |   220
(2 rows)

Explain:

correlated_subqueries=# EXPLAIN ANALYZE SELECT title, views FROM articles a WHERE a.views > (
correlated_subqueries(#   SELECT AVG(views) AS avg_views_category FROM articles WHERE category = a.category
correlated_subqueries(#   GROUP BY category  
correlated_subqueries(# );
                                                   QUERY PLAN                   
-----------------------------------------------------------------------------------------------------------------
 Seq Scan on articles a  (cost=0.00..2822.15 rows=73 width=322) (actual time=0.120..0.148 rows=2 loops=1)
   Filter: ((views)::numeric > (SubPlan 1))
   Rows Removed by Filter: 3
   SubPlan 1
     ->  GroupAggregate  (cost=0.00..12.77 rows=1 width=34) (actual time=0.013..0.013 rows=1 loops=5)
           Group Key: articles.category
           ->  Seq Scan on articles  (cost=0.00..12.75 rows=1 width=6) (actual time=0.002..0.004 rows=3 loops=5)
                 Filter: (category = a.category)
                 Rows Removed by Filter: 2
 Planning time: 1.202 ms
 Execution time: 0.287 ms
(11 rows)

The plan starts by a sequential scan on the articles and applies the filter on the views column by executing the correlated subquery. The subquery is a simple aggregation where only the rows with the category matching the outer article's category are kept and computed. Here too, the subplan is executed for each of selected rows (loops=5 on the 2nd sequential scan of the articles table).

In this post we discovered the idea of correlated subqueries. As told in the first section, the correlated subqueries differentiate from the classical subqueries in their dependency on the values defined in the outer query (hence correlated). The second section presented some code used in the 2 tests shown in the 3rd and 4th parts. As we could see in both cases with the value associated to the loops argument, the subqueries are executed once per outer row, so they'll be potentially slow with big datasets (unless the optimizer decides to use another plan).

Share, like or comment this post on Twitter: