SQL and intersect operation on waitingforcode.com

Versions: PostreSQL 10.2

Thanks to modern Big Data solutions like BigQuery or Apache Spark SQL, the knowledge of the advanced SQL concepts is important. After covering the operations like window functions or grouping sets, it's time to show another interesting SQL feature, the INTERSECT operator.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I'm currently writing one on that topic and the first chapters are already available in 👉 Early Release on the O'Reilly platform

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

This post presents INTERSECT operator, implemented in most of the major databases, including the distributed ones. The first section presents the general information about it. The second part gives an example of the INTERSECT use. Finally, the last section focuses on the execution plan details in PostgreSQL.

Definition

If you have never used INTERSECT before, no worries. It's not the most complicated SQL operator. It does exactly the same thing as the sets intersection, i.e. it takes the elements that are present in all of the defined datasets. In SQL it's expressed similarly to the UNION operator. It means that we write several SELECT statements that are joined together with INTERSECT.

To use INTERSECT efficiently we need to remember 2 main rules. The first one is that the number of columns in intersected SELECT clauses must be the same. Moreover, they must be defined in the same order - INTERSECT is a position-based operator. Also, in order to find the intersection, the data types of selected columns must be compatible, i.e. the engine must be able to convert them implicitly. For instance, a TEXT and INT columns are not compatible and intersection won't be found:

WITH nr_letter (nr, letter) AS (
  SELECT 1 AS nr, 'A' AS letter UNION ALL
  SELECT 2 AS nr, 'B' AS letter
), letter_nr (letter, nr) AS (
  SELECT 'Bbis' AS letter, 2 AS nr
)
SELECT nr, letter FROM nr_letter
INTERSECT
SELECT letter, nr FROM letter_nr;
# fails with
ERROR:  INTERSECT types integer and text cannot be matched
LINE 9: SELECT letter, nr FROM letter_nr;

INTERSECT is supported not only in classical relational databases like PostgreSQL but also in modern data warehouse-oriented solutions, including Redshift, BigQuery or Azure Data Warehouse. As of this writing, among the popular Open Source solutions, MySQL doesn't support INTERSECT operator.

Example

We'll see the INTERSECT in action on the example of an e-commerce store where each user can add a product to its favorites and buy it later. Before executing the queries, let's start the Docker container:

docker run --name wfc_intersect_test -e POSTGRES_PASSWORD=root -e POSTGRES_USER=root -e POSTGRES_DB=test_intersect -d postgres:11

The dataset used in the tests looks like:

CREATE TABLE IF NOT EXISTS products (
  id INT NOT NULL,
  name VARCHAR(200) NOT NULL,
  PRIMARY KEY(id)
);

CREATE TABLE IF NOT EXISTS bought_products (
  product_id INT NOT NULL,
  user_id INT NOT NULL,
  PRIMARY KEY(product_id, user_id)
);

CREATE TABLE IF NOT EXISTS favorite_products (
  product_id INT NOT NULL,
  user_id INT NOT NULL,
  PRIMARY KEY(product_id, user_id)
);

INSERT INTO products (id, name) VALUES
(1, 'milk'),
(2, 'water'),
(3, 'coffee'),
(4, 'tea'),
(5, 'orange juice'),
(6, 'apple juice'),
(7, 'pineapple juice');

INSERT INTO bought_products (product_id, user_id) VALUES
(1, 10),
(1, 11),
(2, 10);

INSERT INTO favorite_products (product_id, user_id) VALUES
(1, 20),
(1, 10),
(1, 11),
(2, 10),
(4, 21),
(5, 22),
(6, 23);

After a quick psql --d test_intersect and the execution of the above queries, we can use the INTERSECT to get the bought products that are in the favorites list of each user:

SELECT intersection.product_id, intersection.user_id, p.name FROM (
  SELECT product_id, user_id FROM favorite_products
  INTERSECT
  SELECT product_id, user_id FROM bought_products
) AS intersection JOIN products p ON p.id = intersection.product_id;

As expected, the query returns 3 rows:

 product_id | user_id | name
------------+---------+-------
          1 |      11 | milk
          1 |      10 | milk
          2 |      10 | water
(3 rows)

We are not limited to the intersection of 2 sets. We can apply the intersection to more tables. In the next example, we want to know the products that were not only liked and bought but also commented:

CREATE TABLE IF NOT EXISTS commented_products (
  product_id INT NOT NULL,
  user_id INT NOT NULL,
  PRIMARY KEY(product_id, user_id)
);
INSERT INTO commented_products (product_id, user_id) VALUES
(1, 10),
(1, 11),
(6, 10);

SELECT intersection.product_id, intersection.user_id, p.name FROM (
  SELECT product_id, user_id FROM favorite_products
  INTERSECT
  SELECT product_id, user_id FROM bought_products
  INTERSECT
  SELECT product_id, user_id FROM commented_products
) AS intersection JOIN products p ON p.id = intersection.product_id;

As expected, the water doesn't appear because it was not commented:

 product_id | user_id | name
------------+---------+------
          1 |      11 | milk
          1 |      10 | milk

Execution plan

In the analysis about the execution plan we will ignore the JOIN for the sake of simplicity. Therefore, the analyzed query will be:

EXPLAIN ANALYZE  SELECT product_id, user_id FROM favorite_products
  INTERSECT
  SELECT product_id, user_id FROM bought_products;

The result returns something we've never seen in this blog before - a HashSetOp:

                                                           QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------
 HashSetOp Intersect  (cost=0.00..155.60 rows=226 width=12) (actual time=0.041..0.043 rows=3 loops=1)
   ->  Append  (cost=0.00..133.00 rows=4520 width=12) (actual time=0.009..0.033 rows=10 loops=1)
         ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..55.20 rows=2260 width=12) (actual time=0.007..0.017 rows=7 loops=1)
               ->  Seq Scan on favorite_products  (cost=0.00..32.60 rows=2260 width=8) (actual time=0.005..0.008 rows=7 loops=1)
         ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..55.20 rows=2260 width=12) (actual time=0.003..0.007 rows=3 loops=1)
               ->  Seq Scan on bought_products  (cost=0.00..32.60 rows=2260 width=8) (actual time=0.002..0.004 rows=3loops=1)
 Planning Time: 0.065 ms
 Execution Time: 0.077 ms
(8 rows)

The query starts with the sequential scans of both tables. The results are later appended to an in-memory hash table represented by HashSetOp. If you're interested, you can find the implementation details in Support hashing for duplicate-elimination in INTERSECT and EXCEPT queries Pull Request. If not, we can summarize the HashSetOp as an in-memory hash table where each of rows from intersected datasets is appended. Each entry in the table has a counter that is updated with every new scanned row. When the number of rows is the same as expected (2 in our example), the engine returns it in the final results.

As its name suggests, INTERSECT operator lets us find the common elements in 2 or more datasets. It follows similar requirements to the UNION operator. It's also supported in a lot of nowadays databases. One of INTERSECT use cases could be a query to figure out users actions, like it was shown in the second section with likes which could lead to orders and comments. The internal execution depends on the database. In this post I used a PostgreSQL instance. It implements INTERSECT as an in-memory hash table with the counter to figure out which rows should be returned to the user.