Data processing articles

Looking for something else? Check the categories of Data processing:

Apache Beam Apache Flink Apache Spark Apache Spark GraphFrames Apache Spark GraphX Apache Spark SQL Apache Spark Streaming Apache Spark Structured Streaming PySpark

If not, below you can find all articles belonging to Data processing.

Kubernetes concepts for Apache Spark

I had the idea for this blog post when I was preparing the "What's new in Apache Spark..." series. At that time, I was writing about Kubernetes in the context of Apache Spark but needed to "google" a lot of things aside - mostly the Kubernetes API terms.

Continue Reading β†’

Distinct vs group by key difference

I've heard an opinion that using DISTINCT can have a negative impact on big data workloads, and that the queries with GROUP BY were more performant. Is it true for Apache Spark SQL?

Continue Reading β†’

What's new in Apache Spark 3.2.0 - miscellaneous changes

My Apache Spark 3.2.0 comes to its end. Today I'll focus on the miscellaneous changes, so all the improvements I couldn't categorize in the previous blog posts.

Continue Reading β†’

What's new in Apache Spark 3.2.0 - Apache Parquet and Apache Avro improvements

I still have 2 topics remaining in my "What's new..." backlog. I'd like to share the first of them with you today, and see what changed for Apache Parquet and Apache Avro data sources.

Continue Reading β†’

What's new in Apache Spark 3.2.0 - performance optimizations

Apache Spark 3.0 extended the static execution engine with a runtime optimization engine called Adaptive Query Execution. It has changed a lot since the very first release and so even in the most recent version! But AQE is not a single performance improvement and I hope you'll see this in the blog post!

Continue Reading β†’

What's new in Apache Spark 3.2.0 - PySpark and Pandas

Project Zen is an initiative to make PySpark more Pythonic and facilitate the Python programming experience. Apache Spark 3.2.0 made a next step in this direction by bringing Pandas to the API!

Continue Reading β†’

What's new in Apache Spark 3.2.0 - Data Source V2

Even though Data Source V2 is present in the API for a while, every release brings something new to it. This time too and we'll see what through this blog post!

Continue Reading β†’

What's new in Apache Spark 3.2.0 - push-based shuffle

In the previous Apache Spark releases you could see many shuffle evolutions such as shuffle files tracking or pluggable storage interface. And the things don't change for 3.2.0 which comes with the push-based merge shuffle.

Continue Reading β†’

What's new in Apache Spark 3.2.0 - SQL changes

Apache Spark SQL evolves and with each new release, it gets closer to the ANSI standard. The 3.2.0 release is not different and you can find many ANSI-related changes. But not only and hopefully, you'll discover all this in this blog post which has an unusual form because this time, I won't focus on the implementation details.

Continue Reading β†’

What's new in Apache Spark 3.2.0 - Structured Streaming

After previous blog posts focusing on 2 specific Structured Streaming features, it's time to complete them with a list of other changes made in the 3.2.0 version!

Continue Reading β†’

What's new in Apache Spark 3.2.0 - session windows

Initially I wanted to include the session windows in the blog post about Structured Streaming changes. But I changed my mind when I saw how many things it involves!

Continue Reading β†’

What's new in Apache Spark 3.2.0 - RocksDB state store

It's big news for Apache Spark Structured Streaming users. RocksDB is now available as a Vanilla Spark-backed state store backend!

Continue Reading β†’

Stage level scheduling

The idea of writing this blog post came to me when I was analyzing Kubernetes changes in Apache Spark 3.1.1. Starting from this version we can use stage level scheduling, so far available only for YARN. Even though it's probably a very low level feature, it intrigued me enough to write a few words here!

Continue Reading β†’

Iterators in Apache Spark

I had this "aha moment" while I was preparing the blog posts about the shuffle readers. Apache Spark uses iterators a lot! In this blog post you will see the places where I had met them the last months.

Continue Reading β†’

Shuffle reading in Apache Spark SQL - wrapping iterators and beyond

It's time for the 2nd blog post about the shuffle readers. Recently, we discovered how Apache Spark fetches the shuffle blocks from local and remote hosts. Today, I would like to share with you the wrapping iterators. Sounds mysterious? It won't be if we start by looking at the iterators participating in the processing of shuffle block files.

Continue Reading β†’

Shuffle reading in Apache Spark SQL

So far I've covered the writing part of the shuffle files. You've learned about 3 different shuffle writers, but what happens with their generated files? Who and how reads them? Is the reading an in-memory operation? I will try to answer this and some other questions in this blog post.

Continue Reading β†’

Apache Spark can be eagerly evaluated too - Commands

Some time ago I participated in an interesting meetup about the MERGE operation in Delta Lake (link in the Further reading section). Jacek Laskowski presented the operation internals and asked an interesting question about the difference between commands and execs. Since I didn't know the answer right away, I decided to explore the commands concepts in this blog post.

Continue Reading β†’

Structured Streaming and Apache Kafka Schema Registry

The topic of this post brought Luan Carvalho who shared with me an Open Source project connecting Apache Spark to Apache Kafka Schema Registry. Initially, I wanted to exclusively focus on the project but on my way I discovered some other interesting points.

Continue Reading β†’

Join hints in Apache Spark SQL

With the Adaptive Query Execution module, you can have a feeling that Apache Spark will optimize the job for you. In part, yes, because it'll be able to optimize the job based on the runtime parameters you don't necessarily know. However, you also can master the execution, and ones of these mastery tools are hints.

Continue Reading β†’

Collect action and determinism

Even though nowadays RDD tends to be a low level abstraction and we should use SQL API, some of its methods are still used under-the-hood by the framework. During one of my explorations I wanted to analyze the task responsible for listing the files to process. Initially, my thought was "oh,it uses collect() and the results will be different every time". However, after looking at the code I saw how wrong I was!

Continue Reading β†’