Anatomy of a Structured Streaming job

Apache Spark Structured Streaming relies on the micro-batch pattern which evaluates the same query in each execution. That's only a high level vision, though. Under-the-hood, there are many other interesting things that happen.

Continue Reading β†’

Min rate limits for Apache Kafka

I bet you know it already. You can limit the max throughput for Apache Spark Structured Streaming jobs for popular data sources such as Apache Kafka, Delta Lake, or raw files. Have you known that you can also control the lower limit, at least for Apache Kafka?

Continue Reading β†’

What's new on the cloud for data engineers - part 12 (10.2023-02.2024)

It's time for another part of "What's new on the cloud for data engineers". Let's see what happened in the last 5 months.

Continue Reading β†’

Table file formats - streaming writer: Delta Lake

The previous blog from the series we discovered streaming reader. However, an end-to-end streaming Delta Lake pipeline also requires a writer which will be our focus today.

Continue Reading β†’

Apache Flink and cluster components deep dive

Previously you could read about transformation of a user job definition into an executable stream graph. Since this explanation was relatively high-level, I decided to deep dive into the final step executing the code.

Continue Reading β†’

Static enrichment dataset with Delta Lake

Data enrichment is one of common data engineering tasks. It's relatively easy to implement with static datasets because of the data availability. However, this apparently easy task can become a nightmare if used with inappropriate technologies.

Continue Reading β†’

Table file formats - streaming reader: Delta Lake

Even though I'm into streaming these days, I haven't really covered streaming in Delta Lake yet. I only slightly blogged about Change Data Feed but completely missed the fundamentals. Hopefully, this and next blog posts will change this!

Continue Reading β†’

Files streaming is quite a challenge

It's technically possible to process files in a continuous way from a streaming job. However, if you are expecting some latency sensitive job, this will always be slower than processing data directly from a streaming broker. Why?

Continue Reading β†’

Stream processing models

If you're interested in stream processing, I bet your thinking is technology-based. It's not wrong, after all, the ability to use a tool gives you and me a job. However, for a long-term consideration it's better to reason in terms of patterns or models. Being aware of a more general vision helps assimilate new tools.

Continue Reading β†’

2023 retrospective on waitingforcode.com

This is one of my favorite blog posts, the yearly retrospective. Every year I summarize what happened in the past 12 months and share with you my future plans. It's time for the 2023 Edition!

Continue Reading β†’

Streamhouse, the next house to move into?

I must admit it, if you want to catch my attention, you can use some keywords. One of them is "stream". Knowing that, the topic of my new blog post shouldn't surprise you.

Continue Reading β†’

Order is king for the performance

Even though nowadays data processing frameworks and data stores have smart query planners, they don't take our responsibility to correctly design the job logic.

Continue Reading β†’

Data+AI Summit 2023, retrospective part 2

One week later than initially announced, but here it is, the second part for Data+AI Summit 2023 retrospective. I don't know how, but I managed to include some streaming-related talks here too!

Continue Reading β†’

Vertical autoscaling for data processing on the cloud

The "vertical scaling" has caught my attention a few times already when I have been reading about cloud updates. I've always considered horizontal scaling as the single true scaling policy for elastic data processing pipelines. Have I been wrong?

Continue Reading β†’

Accumulators and reliability

In March I wrote a blog showing how to use accumulators to know the application of each filter statement. Turns out, the solution may not be perfect as mentioned by Aravind in one of the comments. I bet you already have an idea but if not, keep reading. Everything will be clear in the end!

Continue Reading β†’