Apache Spark SQL articles

Shuffle reading in Apache Spark SQL - wrapping iterators and beyond

It's time for the 2nd blog post about the shuffle readers. Recently, we discovered how Apache Spark fetches the shuffle blocks from local and remote hosts. Today, I would like to share with you the wrapping iterators. Sounds mysterious? It won't be if we start by looking at the iterators participating in the processing of shuffle block files.

Continue Reading →

Shuffle reading in Apache Spark SQL

So far I've covered the writing part of the shuffle files. You've learned about 3 different shuffle writers, but what happens with their generated files? Who and how reads them? Is the reading an in-memory operation? I will try to answer this and some other questions in this blog post.

Continue Reading →

Apache Spark can be eagerly evaluated too - Commands

Some time ago I participated in an interesting meetup about the MERGE operation in Delta Lake (link in the Further reading section). Jacek Laskowski presented the operation internals and asked an interesting question about the difference between commands and execs. Since I didn't know the answer right away, I decided to explore the commands concepts in this blog post.

Continue Reading →

Join hints in Apache Spark SQL

With the Adaptive Query Execution module, you can have a feeling that Apache Spark will optimize the job for you. In part, yes, because it'll be able to optimize the job based on the runtime parameters you don't necessarily know. However, you also can master the execution, and ones of these mastery tools are hints.

Continue Reading →

Under-the-hood: repartition

Previously we discovered what happens when you coalesce a dataset. To recall, it doesn't involve shuffle operation. It's then the opposite of a repartition operation which is a first class shuffle citizen.

Continue Reading →

What's new in Apache Spark 3.1.1 - new built-in functions

Every Apache Spark release brings not only completely new components but also new native functions. The 3.1.1 is not an exception and it also comes with some new built-in functions!

Continue Reading →

What's new in Apache Spark 3.1 - JDBC (WIP) and DataSource V2 API

Even though the change I will describe in this blog post is still in progress, it's worth attention, especially that I missed the DataSource V2 evolution in my previous blog posts.

Continue Reading →

What's new in Apache Spark 3.1 - predicate pushdown for JSON, CSV and Apache Avro

Predicate pushdown is a data processing technique taking user-defined filters and executing them while reading the data. Apache Spark already supported it for Apache Parquet and RDBMS. Starting from Apache Spark 3.1.1, you can also use them for Apache Avro, JSON and CSV formats!

Continue Reading →

What's new in Apache Spark 3.1 - join evolutions

I have waited for writing this blog post since the Data+AI Summit 2020, where Cheng Su presented the ongoing effort to improve shuffle and stream-to-stream joins in Apache Spark 3.1. And in this blog post, I will start by sharing what changed for the joins in the new release of the framework!

Continue Reading →

Under-the-hood: coalesce

That's probably one of the most common questions you may have heard in preliminary job interviews. What's the difference between coalesce and repartition? Many answers exist, but instead of repeating them, I will try to dig a bit deeper in this blog post and see how the coalesce works.

Continue Reading →

Stack operation in Apache Spark SQL

Pivot operation presented 2 weeks ago transforms some cells into columns. The reverse one is called stack and it's time to see how it works!

Continue Reading →

Spark SQL pivot table

If you came to data engineering after having a BI career, you certainly know what the pivot is. It was not my case and was quite amazed by this operation that transforms values from rows into columns. If you want to understand how it's possible, this article will present some internals of pivoting data in Apache Spark.

Continue Reading →

Apache Spark performance tips - look at your code

Very often you will find Apache Spark performance tips related to the hardware (memory, GC) or the configuration parameters (shuffle partitions number, broadcast join threshold). But they're not the single ones you can implement. Moreover, IMO, you should start by the ones presented in this article and optimize your pipeline code before going into more complicated hardware and configuration tuning.

Continue Reading →

Partition-wise joins and Apache Spark SQL

Apache Spark has this great capacity to optimize joins of bucketed tables but does it work on partitions as well? No, and to understand why, I invite you to read the following sections of this blog post ?

Continue Reading →

Drop is a...select

Have you ever wondered what is the relationship between drop and select operations in Apache Spark SQL? If not, I will shed some light on them in this short blog post.

Continue Reading →

What's new in Apache Spark 3.0 - dynamic partition pruning

There are stories like this, the stories that remain in the backlog for a very long time, and finally, they get implemented. That's exactly what happened with the Dynamic Partition Pruning feature added, after almost 4 years in the backlog, to Apache Spark 3.

Continue Reading →

Broadcast join - complementary notes for local shuffle reader

The Local shuffle reader presented in one of the previous posts might have introduced some doubt in the way how the broadcast join is working. If it's the case, this blog post should shed some light on it. If not, it can give you more in-depth details than the ones introducing this type of join a few years ago.

Continue Reading →

What's new in Apache Spark 3.0 - predicate pushdown support for nested fields

If you noticed that some filter expressions weren't pushed down to your Apache Parquet files, the situation should change in Apache Spark 3.0. The new release supports this feature called nested data predicate pushdown.

Continue Reading →

What's new in Apache Spark 3.0 - delete, update and merge API support

All the operations from the title are natively available in relational databases but doing them with distributed data processing systems is not obvious. Starting from 3.0, Apache Spark gives a possibility to implement them in the data sources.

Continue Reading →

What's new in Apache Spark 3.0 - demote broadcast hash join

It's the last part of the series about the Adaptive Query Execution in Apache Spark SQL. So far you learned about the physical plan optimizations. But they're not alone and you will see that in this blog post.

Continue Reading →