When I was learning about watermarks in Apache Flink, I saw they were taking the smallest event times instead of the biggest ones in Apache Spark Structured Streaming. From that I was puzzled... How is it possible the pipeline doesn't go back to the past? The answer came when I reread the Streaming Systems book. There was one keyword I had missed that clarified everything.
Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote
one on that topic! You can read it online
on the O'Reilly platform,
or get a print copy on Amazon.
I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩
Before I reveal the keyword, let's see a situation where a streaming job could move back if the watermark was simply the smallest event time.
For the sake of the example, let's suppose a sessionization pipeline where the current watermark is 10:00, therefore all sessions prior to this time has already been closed and emitted downstream. Now a late event for 08:00 arrives to the system. As your imaginary system takes the smallest event time ever, the event for 08:00 would make the watermark back to 08:00. Having done that would involve restoring all closed states after 08:00 and emitting them again, once the watermark again reaches 10:00. Of course, the things can go nasty. If after receiving the event for 08:00 you would get another one for 06:00 but for the day before, you would need to restore everything prior that... Needless to say, it would be a nightmare involving not only keeping the state for long time but also reconciliating the old state with the new one.
As you can imagine, it would have a serious impact on the hardware resources and the state lifecycle management. Besides, it would also confuse your downstream consumers. After all, they are considering all data prior to 10:00 to be complete compared to the watermark.
That's why the watermark never goes back. And when a thing doesn't go back in computer science we say that it's monotonically increasing. And that's the keyword I missed while I was oversimplifying the watermarks in streaming systems to be an outcome of a simple MAX function.
Well, technically I was not wrong by relying on the MAX. Apache Spark Structured Streaming takes the MAX event time from all partitions and considers it as a new watermark if the value is bigger than the current watermark. Apache Flink does it too but only at a subtask level that you can consider as a local watermark. The public watermark, i.e. the one emitted downstream, uses the MIN event time among all local watermarks.
Let's analyze it in a table where we assume an allowed latency for the watermark to be 0. To make it simple, the current watermark at the task and subtasks level is 10:30. As you will notice, subtask 1 advances slower than subtask 2 which results in a slower advance of the public watermark. Only the last execution leads to a change of its value. Besides, due to the combination of the MAX function at the subtask level and the MIN function at the task level, the public watermark is monotonically increasing, i.e. it never gets back:
Subtask 1 new event times | Subtask 1 new watermark | Subtask 2 new event times | Subtask 2 new watermark | New task watermark |
---|---|---|---|---|
10:20, 10:29 | MAX(10:30, 10:29) | 10:30, 10:31 | MAX(10:30, 10:31) | MIN(10:30, 10:31) |
10:28, 10:29 | MAX(10:30, 10:29) | 10:31, 10:35 | MAX(10:31, 10:35) | MIN(10:30, 10:35) |
10:28, 10:30 | MAX(10:30, 10:30) | 10:36, 10:40 | MAX(10:35, 10:40) | MIN(10:30, 10:40) |
10:34, 10:36 | MAX(10:30, 10:36) | 10:39, 10:45 | MAX(10:40, 10:45) | MIN(10:36, 10:45) |
The MAX-MIN approach brings a risk of idle or highly skewed subtasks that prevent the whole stream from moving on and emitting the watermark-conditioned results. Apache Flink tends to solve them with watermark alignment and idleness detection but since they're much more technical than the monotonicity, I'll try to explain them in the future blog posts!
Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions.
As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and
drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!
👉 contact@waitingforcode.com
đź”— past projects