Data processing articles

Home Data processing

Looking for something else? Check the categories of Data processing:

Apache Beam Apache Flink Apache Spark Apache Spark GraphFrames Apache Spark GraphX Apache Spark SQL Apache Spark Streaming Apache Spark Structured Streaming PySpark

If not, below you can find all articles belonging to Data processing.

February 21, 2024 • Apache Spark Structured Streaming

Min rate limits for Apache Kafka

I bet you know it already. You can limit the max throughput for Apache Spark Structured Streaming jobs for popular data sources such as Apache Kafka, Delta Lake, or raw files. Have you known that you can also control the lower limit, at least for Apache Kafka?

Continue Reading →

January 31, 2024 • Apache Flink

Apache Flink and cluster components deep dive

Previously you could read about transformation of a user job definition into an executable stream graph. Since this explanation was relatively high-level, I decided to deep dive into the final step executing the code.

Continue Reading →

January 24, 2024 • Apache Spark Structured Streaming

Static enrichment dataset with Delta Lake

Data enrichment is one of common data engineering tasks. It's relatively easy to implement with static datasets because of the data availability. However, this apparently easy task can become a nightmare if used with inappropriate technologies.

Continue Reading →

November 29, 2023 • Apache Spark Structured Streaming

Accumulators and reliability

In March I wrote a blog showing how to use accumulators to know the application of each filter statement. Turns out, the solution may not be perfect as mentioned by Aravind in one of the comments. I bet you already have an idea but if not, keep reading. Everything will be clear in the end!

Continue Reading →

November 15, 2023 • Apache Flink

Apache Flink - anatomy of a job

Have you written your first successful Apache Flink job and are still wondering the high-level API translates into the executable details? I did and decided to answer the question in the new blog post.

Continue Reading →

November 1, 2023 • Apache Spark Structured Streaming

What's new in Apache Spark 3.5.0 - watermark propagation

Watermark, or rather multiple watermarks management, has been a thorn in the side of Apache Spark Structured Streaming. It has improved in the previous release (3.4.0) but still had some room for improvement. Well, it did have because the 3.5.0 release brought a serious fix for the multiple watermarks scenario.

Continue Reading →

October 25, 2023 • Apache Spark Structured Streaming

What's new in Apache Spark 3.5.0 - Structured Streaming

It's time to start the series covering Apache Spark 3.5.0 features. As the first topic I'm going to cover Structured Streaming which has got a lot of RocksDB improvements and some major API changes.

Continue Reading →

October 18, 2023 • Apache Spark Structured Streaming

Watermark and input data filtering in Apache Spark Structured Streaming

I've already written about watermarks in a few places in the blog but despite that, I still find things to refresh. One of them is the watermark used to filter out the late data, which will be the topic of this blog post.

Continue Reading →

October 4, 2023 • Apache Spark Structured Streaming

Making applyInPandasWithState less painful

Do not get the title wrong! Having applyInPandasWithState in the PySpark API is huge! However, due to Python duck typing, some operations are more difficult and more risky to express in the code than in the strongly typed Scala API.

Continue Reading →

September 27, 2023 • Apache Spark Structured Streaming

Arbitrary stateful processing in PySpark with applyInPandasWithState

It's always a huge pleasure to see the PySpark API covering more and more Scala API features. Starting from Apache Spark 3.4.0 you can even write arbitrary stateful processing jobs! But since the API is a little bit different than the one available on the Scala side, I wanted to take a deeper look.

Continue Reading →

September 15, 2023 • Apache Flink

Apache Flink best practices - Flink Forward lessons learned

I won't hide it, I'm still a fresher in the Apache Flink world and despite my past streaming experiences with Apache Spark Structured Streaming and GCP Dataflow, I need to learn. And to learn a new tool or concept, there is nothing better than watching some conference talks!

Continue Reading →

August 10, 2023 • Apache Spark Structured Streaming

_spark_metadata in Apache Spark Structured Streaming issue is no more!

There are probably not that many people working today on the flat files with Structured Streaming than 5 years ago thanks to the table file formats. However, if you are in this group and are still generating CSVs or JSONs with the streaming sink, brace yourself, the memory problems are coming if you don't take action!

Continue Reading →

August 2, 2023 • Apache Spark Structured Streaming

The first state in Apache Spark Structured Streaming arbitrary stateful processing

When you wrote your first arbitrary stateful processing pipelines, the state expiration is maybe the first tricky point you had to deal with. Why is that? After all, it's just about setting the timeout, doesn't it? Most of the time, yes, but there is an exception.

Continue Reading →

July 25, 2023 • Apache Spark Structured Streaming

State expiration in stream-to-stream joins with event time range condition

You certainly know it, the watermark (aka GC Watermark) is responsible for cleaning state store in Apache Spark Structured Streaming. But you may not know that it's not the single time-based condition. There is a different one involved in the stream-to-stream joins.

Continue Reading →

July 21, 2023 • Apache Spark Structured Streaming

How to initialize state in Apache Spark Structured Streaming stateful jobs?

Starting from Apache Spark 3.2.0 is now possible to load an initial state of the arbitrary stateful pipelines. Even though the feature is easy to implement, it hides some interesting implementation details!

Continue Reading →

July 7, 2023 • Apache Spark Structured Streaming

Multiple queries running in Apache Spark Structured Streaming

That's often a dilemma, whether we should put multiple sinks working on the same data source in the same or in different Apache Spark Structured Streaming applications? Both solutions may be valid depending on your use case but let's focus here on the former one including multiple sinks together.

Continue Reading →

June 30, 2023 • Apache Flink

Yes, I'm learning Apache Flink - beginner's problems

Surprised? You shouldn't. I've always been eager to learn, including 5 years ago when for the first time, I left my Apache Spark comfort zone to explore Apache Beam. Since then I had a chance to write some Dataflow streaming pipelines to fully appreciate this technology and work on AWS, GCP, and Azure. But there is some excitement for learning-from scratch I miss.

Continue Reading →

June 23, 2023 • Apache Spark

What's new in Apache Spark 3.4.0 - shuffle changes

Shuffle is a permanent point in the What's new in Apache Spark series. Why? It's often one the most time consuming part of the jobs and knowing the improvement simply helps writing better pipelines.

Continue Reading →

June 15, 2023 • Apache Spark

What's new in Apache Spark 3.4.0 - Spark Connect

Spark Connect is probably the most expected feature in Apache Spark 3.4.0. It was announced in the Data+AI Summit 2022 keynotes and has a lot of coverage in social media right now. I'll try to add my small contribution to this by showing some implementation details.

Continue Reading →

May 31, 2023 • Apache Spark Structured Streaming

What's new in Apache Spark 3.4.0 - Structured Streaming

The asynchronous progress tracking and correctness issue fixes presented in the previous blog posts are not the single new feature in Apache Spark Structured Streaming 3.4.0. There are many others but to keep the blog post readable, I'll focus here only on 3 of them.

Continue Reading →

⟵ Previous
1
2
3
4
5
6
7
8
Next ⟶