Guess what? My time-consuming learning mode based on reading the documentation paid again! This time on Azure because while reading about Stream Analytics windows I discovered that I missed some of them in the past. And since today is the day of the cloud, I will see if the same types of windows exist on AWS and GCP streaming services. And if no, what are the differences.
New ebook 🔥
Learn 84 ways to solve common data engineering problems with cloud services.
Please notice one thing, though. GCP doesn't have a pure stream analytics service like AWS Kinesis Data Analytics or Azure Stream Analytics. However, you can achieve similar things by writing streaming Dataflow pipelines. That's the reason why I will consider Dataflow as a streaming service on GCP, even though technically it's a bit different than the services present on AWS and Azure. But enough talking, let's see what types of windows we can find in these cloud services!
That's the most popular one because it exists on all of the 3 analyzed services. A tumbling window is a not overlapping window having a fixed duration. The windows are sequential, meaning that the service creates a new window at the end of the previous one:
The window emits the accumulated results at the end of the interval.
Tumbling window doesn't overlap, meaning that an item is only present in one window. But in the windows world, we also have overlapping versions, and the first is called a sliding window. The sliding window, exactly like the tumbling window, has a duration property. However, unlike the tumbling window, it emits results whenever a new item appears.
I would call it a special type of sliding window with 2 differences. It emits results at the end of the window and also creates each window after the time defined in a "hop" interval parameter:
As you can see, we created 3 windows every 1 second ("hop" interval), meaning that some of the input events were present in different windows.
Hopping window is not present on every cloud provider. I found it only in Azure Stream Analytics.
Session window and stagger window
It's probably the most complicated one from the list, the session window, aka stagger window on Kinesis Data Analytics. This window works on an element basis. Put another way, every window contains records associated with one entity, like a user navigation session. As you can see, it's not a time-based window, even though the time is still present here! A session window uses a temporal concept of a "gap duration", "session gap" or "timeout" (name depends on the service), to define how long the window for a specific item can remain open without receiving new elements. If this delay passes, the window terminates.
In Azure Stream Analytics, you can also define a TTL value the session window. For example, if an item "A" gets new observations every minute and its session window has 2 minutes of gap duration, it will never end. However, if you set the maximum duration to 10 minutes, every session window will not be longer than that period.
A similar concept to the session window is the stagger window, present in Kinesis Data Analytics SQL. It also opens a new window when the first element of a group arrives, but instead of allowing a gap duration, it defines the fixed duration of the window, a little bit like the session window in Azure Stream Analytics. That's why its visual representation is the same.
The last one is specific to Azure Stream Analytics. A snapshot window captures all events occurring at the same time. In other words, you don't need to specify a window definition with the duration and trigger period. Instead, you can simply select a timestamp attribute of the data source and consider it as a "window" group generator:
Before writing this article, I knew some of the presented window types. That's why I was delighted to discover that there are other interesting types in addition to the tumbling, sliding, and session windows!