Window functions in SQL

Versions: PostreSQL 10.2

Window functions are one of another SQL features that we'll probably discover during the work with data-oriented application. They can be also used in more classical programs though.

Looking for a better data engineering position and skills?

You have been working as a data engineer but feel stuck? You don't have any new challenges and are still writing the same jobs all over again? You have now different options. You can try to look for a new job, now or later, or learn from the others! "Become a Better Data Engineer" initiative is one of these places where you can find online learning resources where the theory meets the practice. They will help you prepare maybe for the next job, or at least, improve your current skillset without looking for something else.

👉 I'm interested in improving my data engineering skillset

See you there, Bartosz

This post describes the feature of window functions in PostgreSQL database. The first section explains the general idea behind them. The second part describes the example used in our tests. The next sections list one-by-one the methods related to the window functions and shows how they can be used against some real-world-like examples.

Window functions defined

The window functions can be explained to a programmer very easily thanks to the similarities between maps data structures (Python dictionaries) and for loops. Translated to a pseudo-code it could look like:

rows = Map(KeyType, ValueType)
for key, values in rows:
   for value in values:
     // …

That is, the window function groups the rows by some keys and iterates over them by retaining rows separate identities. As you'll discover in the 3rd section, the rows are linked through window-specific functions allowing us to access either previous iterated value, the first iterated value or even the distance from the beginning. We tell that the iterated values are scoped to the window frame.

The use of window function consists on 2 keywords: OVER and PARTITION BY. For instance, if we'd like to iterate over the values grouped by the zip_code column and compute an average age of its residents, we'd use the following instruction: ...AVG(age) OVER (PARTITION BY zip_code) FROM residents_list. As you can see, there is no magic and the definition is pretty straightforward.

But what happens if we'd like to use the same window in more than 1 place ? In order to not repeat ourselves we can use the feature called named window functions. Thanks to it we can define one window at the end of the SQL statement and reference it in different places before the definition. Such window is specified with WINDOW $name instruction. The parts using this window can still reference it with OVER keyword. The difference is that PARTITION BY is replaced by the value associated to $name variable. For instance, if we'd like to reuse a window function for our average age problem, we'd write the following: ...AVG(age) OVER zip_code_window ...WINDOW zip_code AS (PARTITION BY zip_code).

Code used in the examples

Below you can find the query creating the data used in the tests. We'll use the same Docker's PostgreSQL image as in the case of the post about correlated subqueries:

CREATE TYPE user_actions AS ENUM ('left_click', 'hover', 'select', 'scroll', 'inactive',

CREATE TABLE actions (
  user_id INT NOT NULL,
  action user_actions,
  page VARCHAR(100) NOT NULL,
  action_time TIMESTAMP NOT NULL,
  PRIMARY KEY(user_id, action, action_time)

INSERT INTO actions (user_id, action, page, action_time) VALUES
(1, 'inactive', 'index.html', TIMESTAMP '2018-01-01 15:00:00'),
(1, 'hover', 'index.html', TIMESTAMP '2018-01-01 15:00:45'),
(1, 'left_click', 'index.html', TIMESTAMP '2018-01-01 15:01:15'),
(1, 'select', 'index.html', TIMESTAMP '2018-01-01 15:01:20'),
(1, 'right_click', 'index.html', TIMESTAMP '2018-01-01 15:01:30'),
(1, 'scroll', 'index.html', TIMESTAMP '2018-01-01 15:02:15'),
(1, 'left_click', 'index.html', TIMESTAMP '2018-01-01 15:02:30'),
(1, 'inactive', 'about_us.html', TIMESTAMP '2018-01-01 15:03:15'),
(1, 'scroll', 'about_us.html', TIMESTAMP '2018-01-01 15:05:00'),

(2, 'right_click', 'about_us.html', TIMESTAMP '2018-01-01 15:03:00'),
(2, 'right_click', 'about_us.html', TIMESTAMP '2018-01-01 15:03:20'),
(2, 'inactive', 'about_us.html', TIMESTAMP '2018-01-01 15:03:30'),
(2, 'inactive', 'about_us.html', TIMESTAMP '2018-01-01 15:03:40'),
(2, 'inactive', 'about_us.html', TIMESTAMP '2018-01-01 15:03:50'),
(2, 'inactive', 'about_us.html', TIMESTAMP '2018-01-01 15:04:00'),
(2, 'left_click', 'about_us.html', TIMESTAMP '2018-01-01 15:04:05'),
(2, 'left_click', 'about_us.html', TIMESTAMP '2018-01-01 15:04:15');

As you can see, in the next section's examples we'll try to write some sessionization-like queries making an insight on user activity on our website.

Window function examples

Because of the readability reasons, the examples are defined in the following list where each bullet represents one of available functions:

The window functions are quite powerful SQL feature that can be used in all kind of tasks needing some ordered sequencing. The first part shown how to define them in SQL. As we could see, the operators like OVER and PARTITION BY are the keys of the definition. We also discovered that one window can be reused thanks to the named window functions. Next, after discovering the tested dataset in the 2nd part, the 3rd one shown some of methods provided with window functions. We could learn there that window functions made possible the reading of previous and next values relative to currently read row.