Data processing articles

Looking for something else? Check the categories of Data processing:

Apache Beam Apache Flink Apache Spark Apache Spark GraphFrames Apache Spark GraphX Apache Spark SQL Apache Spark Streaming Apache Spark Structured Streaming PySpark

If not, below you can find all articles belonging to Data processing.

StreamingQueryListener, from states to questions

Apache Spark leverages the observer design pattern for the framework-to-code communication. One of the consumers' implementations is StreamingQueryListener.

Continue Reading β†’

Processing time trigger, to be or not to be?

That's the question. The lack of the processing time trigger means more a reactive micro-batch triggering but it cannot be considered as the single true best practice. Let's see why.

Continue Reading β†’

Apache Flink and the input data reading

I'm writing this unexpected blog post because I got stuck with watermarks and checkpoints and felt that I was missing some basics. Even though this introduction is a bit negative, the exploration for the data reading enabled my other discoveries.

Continue Reading β†’

Anatomy of a Structured Streaming job

Apache Spark Structured Streaming relies on the micro-batch pattern which evaluates the same query in each execution. That's only a high level vision, though. Under-the-hood, there are many other interesting things that happen.

Continue Reading β†’

Min rate limits for Apache Kafka

I bet you know it already. You can limit the max throughput for Apache Spark Structured Streaming jobs for popular data sources such as Apache Kafka, Delta Lake, or raw files. Have you known that you can also control the lower limit, at least for Apache Kafka?

Continue Reading β†’

Apache Flink and cluster components deep dive

Previously you could read about transformation of a user job definition into an executable stream graph. Since this explanation was relatively high-level, I decided to deep dive into the final step executing the code.

Continue Reading β†’

Static enrichment dataset with Delta Lake

Data enrichment is one of common data engineering tasks. It's relatively easy to implement with static datasets because of the data availability. However, this apparently easy task can become a nightmare if used with inappropriate technologies.

Continue Reading β†’

Accumulators and reliability

In March I wrote a blog showing how to use accumulators to know the application of each filter statement. Turns out, the solution may not be perfect as mentioned by Aravind in one of the comments. I bet you already have an idea but if not, keep reading. Everything will be clear in the end!

Continue Reading β†’

Apache Flink - anatomy of a job

Have you written your first successful Apache Flink job and are still wondering the high-level API translates into the executable details? I did and decided to answer the question in the new blog post.

Continue Reading β†’

What's new in Apache Spark 3.5.0 - watermark propagation

Watermark, or rather multiple watermarks management, has been a thorn in the side of Apache Spark Structured Streaming. It has improved in the previous release (3.4.0) but still had some room for improvement. Well, it did have because the 3.5.0 release brought a serious fix for the multiple watermarks scenario.

Continue Reading β†’

What's new in Apache Spark 3.5.0 - Structured Streaming

It's time to start the series covering Apache Spark 3.5.0 features. As the first topic I'm going to cover Structured Streaming which has got a lot of RocksDB improvements and some major API changes.

Continue Reading β†’

Watermark and input data filtering in Apache Spark Structured Streaming

I've already written about watermarks in a few places in the blog but despite that, I still find things to refresh. One of them is the watermark used to filter out the late data, which will be the topic of this blog post.

Continue Reading β†’

Making applyInPandasWithState less painful

Do not get the title wrong! Having applyInPandasWithState in the PySpark API is huge! However, due to Python duck typing, some operations are more difficult and more risky to express in the code than in the strongly typed Scala API.

Continue Reading β†’

Arbitrary stateful processing in PySpark with applyInPandasWithState

It's always a huge pleasure to see the PySpark API covering more and more Scala API features. Starting from Apache Spark 3.4.0 you can even write arbitrary stateful processing jobs! But since the API is a little bit different than the one available on the Scala side, I wanted to take a deeper look.

Continue Reading β†’

Apache Flink best practices - Flink Forward lessons learned

I won't hide it, I'm still a fresher in the Apache Flink world and despite my past streaming experiences with Apache Spark Structured Streaming and GCP Dataflow, I need to learn. And to learn a new tool or concept, there is nothing better than watching some conference talks!

Continue Reading β†’

_spark_metadata in Apache Spark Structured Streaming issue is no more!

There are probably not that many people working today on the flat files with Structured Streaming than 5 years ago thanks to the table file formats. However, if you are in this group and are still generating CSVs or JSONs with the streaming sink, brace yourself, the memory problems are coming if you don't take action!

Continue Reading β†’

The first state in Apache Spark Structured Streaming arbitrary stateful processing

When you wrote your first arbitrary stateful processing pipelines, the state expiration is maybe the first tricky point you had to deal with. Why is that? After all, it's just about setting the timeout, doesn't it? Most of the time, yes, but there is an exception.

Continue Reading β†’

State expiration in stream-to-stream joins with event time range condition

You certainly know it, the watermark (aka GC Watermark) is responsible for cleaning state store in Apache Spark Structured Streaming. But you may not know that it's not the single time-based condition. There is a different one involved in the stream-to-stream joins.

Continue Reading β†’

How to initialize state in Apache Spark Structured Streaming stateful jobs?

Starting from Apache Spark 3.2.0 is now possible to load an initial state of the arbitrary stateful pipelines. Even though the feature is easy to implement, it hides some interesting implementation details!

Continue Reading β†’

Multiple queries running in Apache Spark Structured Streaming

That's often a dilemma, whether we should put multiple sinks working on the same data source in the same or in different Apache Spark Structured Streaming applications? Both solutions may be valid depending on your use case but let's focus here on the former one including multiple sinks together.

Continue Reading β†’