Data processing articles

Looking for something else? Check the categories of Data processing:

Apache Beam Apache Spark Apache Spark GraphFrames Apache Spark GraphX Apache Spark SQL Apache Spark Streaming Apache Spark Structured Streaming PySpark

If not, below you can find all articles belonging to Data processing.

Shuffle in Apache Spark, back to the basics

If you are a newcomer in the distributed world, someone certainly told you that shuffle is bad and will slow down your processing. But what does it mean? What happens when this infamous shuffle exists in your code? In this article you should find some answers for the shuffle in Apache Spark.

Continue Reading →

Partition-wise joins and Apache Spark SQL

Apache Spark has this great capacity to optimize joins of bucketed tables but does it work on partitions as well? No, and to understand why, I invite you to read the following sections of this blog post ?

Continue Reading →

Data+AI Summit follow-up: drop duplicates and state management

Another stateful operation requiring the state store is drop duplicates. You can use it to deduplicate your streaming data before pushing it to the sink.

Continue Reading →

Drop is a...select

Have you ever wondered what is the relationship between drop and select operations in Apache Spark SQL? If not, I will shed some light on them in this short blog post.

Continue Reading →

Data+AI Summit follow-up: global limit and state management

It's the second follow-up Data+AI Summit post but the first one focusing on the stateful operations and their interaction with the state store.

Continue Reading →

Data+AI follow-up: StateStoreRDD - building block for stateful processing

The main Apache Spark component enabling stateful processing is StateStoreRDD. It creates a partition-based state store instance but also triggers state-based computation.

Continue Reading →

What's new in Apache Spark 3.0 - Kubernetes

I believe Kubernetes is the next big step in the framework after proposing Catalyst Optimizer, modernizing streaming processing with Structured Streaming, and introducing Adaptive Query Execution. Especially that Apache Spark 3 brings a lot of changes in this part!

Continue Reading →

Broadcasting in Structured Streaming

Some time ago @ArunJijo36 mentioned me on Twitter with a question about broadcasting in Structured Streaming. If, like me at this time, you don't know what happens, I think that this article will be good for you 👊

Continue Reading →

What's new in Apache Spark 3.0 - GPU-aware scheduling

GPU-awareness was one of the topics I postponed the most in my Apache Spark 3.0 exploration. But its time has come and in this blog post you will discover what changed in the version 3 of the framework regarding the GPU-based computation.

Continue Reading →

File source and its internals

Few months ago, before the Apache Spark 3.0 features series, you probably noticed a short series about files processing in Structured Streaming. If you enjoyed it, here is a complementary note presenting the file data source :)

Continue Reading →

What's new in Apache Spark 3 - Structured Streaming

Apache Kafka changes in Apache Spark 3.0 was one of the first topics covered in the "what's new" series. Even though there were a lot of changes related to the Kafka source and sink, they're not the single ones in Structured Streaming.

Continue Reading →

What's new in Apache Spark 3.0 - UI changes

Apart from data processing-related changes, Apache Spark 3.0 also brings some changes at the UI level. The interface is supposed to be more intuitive and should help you understand processing logic better!

Continue Reading →

What's new in Apache Spark 3.0 - dynamic partition pruning

There are stories like this, the stories that remain in the backlog for a very long time, and finally, they get implemented. That's exactly what happened with the Dynamic Partition Pruning feature added, after almost 4 years in the backlog, to Apache Spark 3.

Continue Reading →

Broadcast join - complementary notes for local shuffle reader

The Local shuffle reader presented in one of the previous posts might have introduced some doubt in the way how the broadcast join is working. If it's the case, this blog post should shed some light on it. If not, it can give you more in-depth details than the ones introducing this type of join a few years ago.

Continue Reading →

What's new in Apache Spark 3.0 - predicate pushdown support for nested fields

If you noticed that some filter expressions weren't pushed down to your Apache Parquet files, the situation should change in Apache Spark 3.0. The new release supports this feature called nested data predicate pushdown.

Continue Reading →

What's new in Apache Spark 3.0 - delete, update and merge API support

All the operations from the title are natively available in relational databases but doing them with distributed data processing systems is not obvious. Starting from 3.0, Apache Spark gives a possibility to implement them in the data sources.

Continue Reading →

What's new in Apache Spark 3.0 - demote broadcast hash join

It's the last part of the series about the Adaptive Query Execution in Apache Spark SQL. So far you learned about the physical plan optimizations. But they're not alone and you will see that in this blog post.

Continue Reading →

What's new in Apache Spark 3.0 - local shuffle reader

So far you learned about skew optimization and coalesce shuffle partition optimizations made by the Adaptive Query Execution engine. But they're not the single ones and the next one you will discover is also related to the shuffle.

Continue Reading →

What's new in Apache Spark 3.0 - reuse adaptive subquery

Apart from big and complex changes in the Adaptive Query Execution like skews or partitions coalescing, there are also some others, less complex. Although their smaller complexity, it doesn't mean they are not important. Especially when one of these changes offers a reuse of the subqueries.

Continue Reading →

What's new in Apache Spark 3.0 - join skew optimization

Shuffle partitions coalesce is not the single optimization introduced with the Adaptive Query Execution. Another one, addressing maybe one of the most disliked issues in data processing, is joins skew optimization that you will discover in this blog post.

Continue Reading →