Data processing articles

Looking for something else? Check the categories of Data processing:

Apache Beam Apache Flink Apache Spark Apache Spark GraphFrames Apache Spark GraphX Apache Spark SQL Apache Spark Streaming Apache Spark Structured Streaming PySpark

If not, below you can find all articles belonging to Data processing.

Structured Streaming and Apache Kafka Schema Registry

The topic of this post brought Luan Carvalho who shared with me an Open Source project connecting Apache Spark to Apache Kafka Schema Registry. Initially, I wanted to exclusively focus on the project but on my way I discovered some other interesting points.

Continue Reading β†’

Join hints in Apache Spark SQL

With the Adaptive Query Execution module, you can have a feeling that Apache Spark will optimize the job for you. In part, yes, because it'll be able to optimize the job based on the runtime parameters you don't necessarily know. However, you also can master the execution, and ones of these mastery tools are hints.

Continue Reading β†’

Collect action and determinism

Even though nowadays RDD tends to be a low level abstraction and we should use SQL API, some of its methods are still used under-the-hood by the framework. During one of my explorations I wanted to analyze the task responsible for listing the files to process. Initially, my thought was "oh,it uses collect() and the results will be different every time". However, after looking at the code I saw how wrong I was!

Continue Reading β†’

Arbitrary stateful processing: update and put dependency

At first glance, the update operation in an arbitrary stateful application looks just like another map's put function. However, it has an impact on what happens later with the state store. In this blog post, you will see an example that can eventually help you to reduce an I/O pressure of the updates.

Continue Reading β†’

Does maxOffsetsPerTrigger guarantee idempotent processing?

If you've used Apache Kafka source in Structured Streaming, you undoubtedly noticed a property called maxOffsetsPerTrigger. According to the documentation, it helps to "limit on maximum number of offsets processed per trigger interval". My initial reaction to this property was, "Cool! We can enforce idempotent processing". I was not wrong, but the blog post will show you that I wasn't entirely right either!

Continue Reading β†’

Apache Kafka transactional writer with foreach sink, is it possible?

Even though Apache Kafka supports transactional producers, they're not present in Apache Spark Kafka sink. But despite that, is it possible to implement a transactional producer in Apache Spark Structured Streaming? You should see that at the end of this article.

Continue Reading β†’

Shuffle writers: UnsafeShuffleWriter

It's the last part of the shuffle writers series. The picture so far composed of SortShuffleWriter and BypassMergeSortShuffleWriter, will be completed today with UnsafeShuffleWriter.

Continue Reading β†’

Shuffle writers: BypassMergeSortShuffleWriter

In the previous blog post we discovered the SortShuffleWriter. However, the SortShuffleManager's first choice is BypassMergeSortShuffleWriter, presented in this article.

Continue Reading β†’

Shuffle writers: SortShuffleWriter

In the beginning I thought that the mappers sent shuffle files to the reducers. After understanding that it was the opposite, I was thinking that a part of the shuffle data is kept in memory for the performance purposes... Once I corrected all these misbeliefs about shuffle, I noted a few points to explore. One of these points are shuffle writers that I will present in the next 3 blog posts.

Continue Reading β†’

Under-the-hood: repartition

Previously we discovered what happens when you coalesce a dataset. To recall, it doesn't involve shuffle operation. It's then the opposite of a repartition operation which is a first class shuffle citizen.

Continue Reading β†’

State store metrics

State store is a critical part of any stateful Structured Streaming application. It's important to know what happens when your business logic and input data interact with it. State store metrics will provide you some key insight into this interaction. If you don't know them now, no worries, it's the topic of this blog post!

Continue Reading β†’

Checkpoint file manager - FileSystem and FileContext

If you read my blog post, you certainly noticed that very often I get lost on the internet. Fortunately, very often it helps me write blog posts. But the internet is not the only place where I can get lost. It also happens to me to do that with Apache Spark code and one of my most recent confusions was about FileSystem and FileContext classes.

Continue Reading β†’

What's new in Apache Spark 3.1.1 - new built-in functions

Every Apache Spark release brings not only completely new components but also new native functions. The 3.1.1 is not an exception and it also comes with some new built-in functions!

Continue Reading β†’

What's new in Apache Spark 3.1 - JDBC (WIP) and DataSource V2 API

Even though the change I will describe in this blog post is still in progress, it's worth attention, especially that I missed the DataSource V2 evolution in my previous blog posts.

Continue Reading β†’

What's new in Apache Spark 3.1 - Kubernetes Generally Available!

After several months spent as an "experimental" feature in Apache Spark, Kubernetes was officially promoted to a Generally Available scheduler in the 3.1 release! In this blog post, we'll discover the last changes made before this promotion.

Continue Reading β†’

What's new in Apache Spark 3.1 - nodes decommissioning

I have a feeling that a lot of things related to the scalability happened in the 3.1 release. General Availability of Kubernetes that I will cover next week is only one of them. The second one is the nodes decommissioning!

Continue Reading β†’

What's new in Apache Spark 3.1 - predicate pushdown for JSON, CSV and Apache Avro

Predicate pushdown is a data processing technique taking user-defined filters and executing them while reading the data. Apache Spark already supported it for Apache Parquet and RDBMS. Starting from Apache Spark 3.1.1, you can also use them for Apache Avro, JSON and CSV formats!

Continue Reading β†’

What's new in Apache Spark 3.1 - Project Zen

I mentioned it very shortly in the first blog post ever about PySpark. Thanks to the Project Zen initiative, the Python part of Apache Spark will become more Pythonic and user friendly. How? Let's check that in this blog post!

Continue Reading β†’

What's new in Apache Spark 3.1 - Structured Streaming

Aside from the joins presented in the previous blog post, Structured Streaming also got a few other interesting new features that I will present here.

Continue Reading β†’

What's new in Apache Spark 3.1 - streaming joins

In the previous blog post, you discovered what changed for joins in Apache Spark 3.1. If you remember the summary sentence, it was not the single join changes in this new release. Apart from them, you can also do a bit more with Structured Streaming joins!

Continue Reading β†’