Guess what? My time-consuming learning mode based on reading the documentation paid again! This time on Azure because while reading about Stream Analytics windows I discovered that I missed some of them in the past. And since today is the day of the cloud, I will see if the same types of windows exist on AWS and GCP streaming services. And if no, what are the differences.
What would it take for you to trust your Databricks pipelines in production?
A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.
Konieczny
Please notice one thing, though. GCP doesn't have a pure stream analytics service like AWS Kinesis Data Analytics or Azure Stream Analytics. However, you can achieve similar things by writing streaming Dataflow pipelines. That's the reason why I will consider Dataflow as a streaming service on GCP, even though technically it's a bit different than the services present on AWS and Azure. But enough talking, let's see what types of windows we can find in these cloud services!
Tumbling window
That's the most popular one because it exists on all of the 3 analyzed services. A tumbling window is a not overlapping window having a fixed duration. The windows are sequential, meaning that the service creates a new window at the end of the previous one:
The window emits the accumulated results at the end of the interval.
Sliding window
Tumbling window doesn't overlap, meaning that an item is only present in one window. But in the windows world, we also have overlapping versions, and the first is called a sliding window. The sliding window, exactly like the tumbling window, has a duration property. However, unlike the tumbling window, it emits results whenever a new item appears.
Hopping window
I would call it a special type of sliding window with 2 differences. It emits results at the end of the window and also creates each window after the time defined in a "hop" interval parameter:
As you can see, we created 3 windows every 1 second ("hop" interval), meaning that some of the input events were present in different windows.
Hopping window is not present on every cloud provider. I found it only in Azure Stream Analytics.
Session window and stagger window
It's probably the most complicated one from the list, the session window, aka stagger window on Kinesis Data Analytics. This window works on an element basis. Put another way, every window contains records associated with one entity, like a user navigation session. As you can see, it's not a time-based window, even though the time is still present here! A session window uses a temporal concept of a "gap duration", "session gap" or "timeout" (name depends on the service), to define how long the window for a specific item can remain open without receiving new elements. If this delay passes, the window terminates.
In Azure Stream Analytics, you can also define a TTL value the session window. For example, if an item "A" gets new observations every minute and its session window has 2 minutes of gap duration, it will never end. However, if you set the maximum duration to 10 minutes, every session window will not be longer than that period.
A similar concept to the session window is the stagger window, present in Kinesis Data Analytics SQL. It also opens a new window when the first element of a group arrives, but instead of allowing a gap duration, it defines the fixed duration of the window, a little bit like the session window in Azure Stream Analytics. That's why its visual representation is the same.
Snapshot window
The last one is specific to Azure Stream Analytics. A snapshot window captures all events occurring at the same time. In other words, you don't need to specify a window definition with the duration and trigger period. Instead, you can simply select a timestamp attribute of the data source and consider it as a "window" group generator:
Before writing this article, I knew some of the presented window types. That's why I was delighted to discover that there are other interesting types in addition to the tumbling, sliding, and session windows!
Data Engineering Design Patterns
Looking for a book that defines and solves most common data engineering problems? I wrote
one on that topic! You can read it online
on the O'Reilly platform,
or get a print copy on Amazon.
I also help solve your data engineering problems contact@waitingforcode.com đź“©
Read also about Windows to the clouds here:
- Kinesis Data Analytics Windowed Queries Dataflow streaming pipelines Introduction to Stream Analytics windowing functions
Related blog posts:
- What's new on the cloud for data engineers - part 12 (10.2023-02.2024)
- Vertical autoscaling for data processing on the cloud
- What's new on the cloud for data engineers - part 11 (06-09.2023)
Snapshot has different meanings in the data world. But did you know it's also a type of window? In the new blog post, I prepared a list of windows you can use in cloud streaming services ? https://t.co/G58j267uMV
— Bartosz Konieczny (@waitingforcode) July 18, 2021
