Minus/except operator in SQL

Versions: PostreSQL 10.2

Last time we've discovered the INTERSECT operator. To recall it quickly, it returns all rows that are defined in the combined datasets. Today we'll discover another operator, doing the opposite and called depending on the vendor: MINUS or EXCEPT.

A virtual conference at the intersection of Data and AI. This is not a conference for the hype. Its real users talking about real experiences.
- 40+ speakers with the likes of Hannes from Duck DB, Sol Rashidi, Joe Reis, Sadie St. Lawrence, Ryan Wolf from nvidia, Rebecca from lidl
- 12th September 2024
- Three simultaneous tracks
- Panels, Lighting Talks, Keynotes, Booth crawls, Roundtables and Entertainment.
- Topics include (ingestion, finops for data, data for inference (feature platforms), data for ML observability
- 100% virtual and 100% free

👉 Register here

Like the last time, the post will start by a short explanation of the operator. The second part will give an example of its use in PostgreSQL while the last section will focus on the internal details in PostgreSQL.

Definition

As the opposite for INTERSECT, MINUS (or EXCEPT) retrieves the rows that are present in the left dataset and not in the right one. For 2 datasets: [1, 2, 3] and [3, 4, 5], the operator will return [1, 2]. Looks similar? Yes, MINUS is often compared to a LEFT OUTER JOIN where the join key in the joined table is NULL:

WITH products AS (
    SELECT 1 AS id, 'milk' AS name
    UNION
    SELECT 2 AS id, 'water' AS name
), bought_products AS (
    SELECT 1 AS product_id, 30 AS price
)
SELECT products.* FROM products LEFT JOIN bought_products bp ON bp.product_id = products.id WHERE bp.product_id IS NULL;

The MINUS/EXCEPT operator is supported in the most of major databases: Oracle, PostgreSQL and SQL Server. It's also supported in modern Big Data warehousing solutions like AWS Redshift or GCP BigQuery. As you can notice, it's not implemented natively in MySQL. It can be simulated with above code though.

Exactly as INTERSECT, MINUS/EXCEPT must follow some rules. First, it's a position-based operator, hence the columns must be defined in the same order. Moreover, they must have the same or implicitly convertible type.

Example

It's not difficult to see that MINUS/EXCEPT is a great candidate for detecting the absence of data. We could use it for instance to retrieve all products that were commented but not bought by any customer. We could also use the operator to gather all users who haven't terminated the inscription process. We could use it in a JIRA-like application to check the tasks without assignment and so forth.

Here for the sake of simplicity we'll take the same example as the one used in the post about SQL and intersect operation:

docker run --name wfc_intersect_test -e POSTGRES_PASSWORD=root -e POSTGRES_USER=root -e POSTGRES_DB=test_intersect -d postgres:11

With the following dataset:

CREATE TABLE IF NOT EXISTS products (
  id INT NOT NULL,
  name VARCHAR(200) NOT NULL,
  PRIMARY KEY(id)
);

CREATE TABLE IF NOT EXISTS bought_products (
  product_id INT NOT NULL,
  user_id INT NOT NULL,
  PRIMARY KEY(product_id, user_id)
);

CREATE TABLE IF NOT EXISTS favorite_products (
  product_id INT NOT NULL,
  user_id INT NOT NULL,
  PRIMARY KEY(product_id, user_id)
);

INSERT INTO products (id, name) VALUES
(1, 'milk'),
(2, 'water'),
(3, 'coffee'),
(4, 'tea'),
(5, 'orange juice'),
(6, 'apple juice'),
(7, 'pineapple juice');

INSERT INTO bought_products (product_id, user_id) VALUES
(1, 10),
(1, 11),
(2, 10);

INSERT INTO favorite_products (product_id, user_id) VALUES
(1, 20),
(1, 10),
(1, 11),
(2, 10),
(4, 21),
(5, 22),
(6, 23);

CREATE TABLE IF NOT EXISTS commented_products (
  product_id INT NOT NULL,
  user_id INT NOT NULL,
  PRIMARY KEY(product_id, user_id)
);

INSERT INTO commented_products (product_id, user_id) VALUES
(1, 10),
(1, 11),
(6, 10);

We can for instance start by looking for the products from the favorite lists which haven't been bought yet by each user:

SELECT except_dataset.product_id, except_dataset.user_id, p.name FROM (
  SELECT product_id, user_id FROM favorite_products
  EXCEPT
  SELECT product_id, user_id FROM bought_products
) AS except_dataset JOIN products p ON p.id = except_dataset.product_id;
 product_id | user_id |     name
------------+---------+--------------
          5 |      22 | orange juice
          6 |      23 | apple juice
          4 |      21 | tea
          1 |      20 | milk
(4 rows)

Like for INTERESECT, we can apply MINUS/EXCEPT on more than 2 datasets. Here, we'll use it to discover the favorite and not bought products which haven't been commented by the user:

# Let's add a line just to have the result different than in the previous case
INSERT INTO commented_products (product_id, user_id) VALUES (5, 22);

SELECT except_dataset.product_id, except_dataset.user_id, p.name FROM (
  SELECT product_id, user_id FROM favorite_products
  EXCEPT
  SELECT product_id, user_id FROM bought_products
  EXCEPT
  SELECT product_id, user_id FROM commented_products
) AS except_dataset JOIN products p ON p.id = except_dataset.product_id;
 product_id | user_id |    name
------------+---------+-------------
          6 |      23 | apple juice
          4 |      21 | tea
          1 |      20 | milk
(3 rows)

The result contains almost the same rows as the first query. The difference consists on just added comment for orange juice product.

Aside from using multiple MINUS/EXCEPT operators, we can also combine them with INTERSECT. Our example will get all favorite and not bought products which were commented, independently on the user:

SELECT p.id, p.name FROM (SELECT except_dataset.product_id FROM (
  SELECT product_id FROM favorite_products
  EXCEPT
  SELECT product_id FROM bought_products
) AS except_dataset   
INTERSECT
SELECT product_id FROM commented_products) AS commented_favorite_not_bought
JOIN products p ON p.id = commented_favorite_not_bought.product_id;
 id |    name
----+-------------
  6 | apple juice
(1 row)

Let's verify query by query whether the result is correct:

SELECT product_id FROM favorite_products
EXCEPT
SELECT product_id FROM bought_products
 product_id
------------
          5
          4
          6
(3 rows)

Three products (4, 5, 6) were added to favorite lists without being bought. Now, let's take all commented products:

SELECT product_id FROM commented_products
 product_id
------------
          1
          1
          6
(3 rows)

Only 2 products where commented: 1 and 6. Therefore, the intersection between (4, 5, 6) and (1, 6) returns only one id - 6.

Query plan

It's good moment now to focus on the query plan. Let's take a simple MINUS/EXCEPT query:

EXPLAIN ANALYZE (SELECT product_id FROM favorite_products
EXCEPT
SELECT product_id FROM bought_products)
                                                           QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------
---
 HashSetOp Except  (cost=0.00..144.30 rows=200 width=8) (actual time=0.043..0.045 rows=3 loops=1)
   ->  Append  (cost=0.00..133.00 rows=4520 width=8) (actual time=0.011..0.035 rows=10 loops=1)
         ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..55.20 rows=2260 width=8) (actual time=0.010..0.019 rows=7 loops=1)
               ->  Seq Scan on favorite_products  (cost=0.00..32.60 rows=2260 width=4) (actual time=0.008..0.012 rows=7 loops=
1)
         ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..55.20 rows=2260 width=8) (actual time=0.003..0.007 rows=3 loops=1)
               ->  Seq Scan on bought_products  (cost=0.00..32.60 rows=2260 width=4) (actual time=0.002..0.004 rows=3 loops=1)
 Planning Time: 0.056 ms
 Execution Time: 0.082 ms
(8 rows)

Internally EXCEPT uses the same operation as the INTERSECT - HashSetOp. It creates an in-memory hash table with counters. Only the resolution rule is different. EXCEPT checks whether the counter of the "left" dataset is bigger than 0 and the one of the "right" dataset equal to 0. If it's the case, it returns the row:

case SETOPCMD_EXCEPT:
    if (pergroup->numLeft > 0 && pergroup->numRight == 0)
        setopstate->numOutput = 1;
    else
        setopstate->numOutput = 0;
    break;
# ...
if (setopstate->numOutput > 0)
{
    setopstate->numOutput--;
    return ExecStoreMinimalTuple(entry->shared.firstTuple,
                                 resultTupleSlot,
                                 false);
}

However, sometimes the EXCEPT operator may be slower than its corresponding LEFT OUTER JOIN version. Some of the external links are provided in "Read also" section. Thus, in a case of any performance issues you could always compare both solutions - probably one of them will always perform better than another.

MINUS/EXCEPT operator is the opposite of the INTERSECT action, described previously in this blog. It takes all elements from the "left" dataset that are not present in the "right" dataset. It's then a great candidate for the detection of the not existent data, as for instance not assigned tasks, favorite and not bought products, and so forth. Internally, at least for PostgreSQL, it uses the same structure as INTERSECT, namely HashSetOp.