Articles about Apache Spark SQL on waitingforcode.com - articles for the pleasure of learning and discovery

December 20, 2021 • Apache Spark SQL

What's new in Apache Spark 3.2.0 - Apache Parquet and Apache Avro improvements

I still have 2 topics remaining in my "What's new..." backlog. I'd like to share the first of them with you today, and see what changed for Apache Parquet and Apache Avro data sources.

Continue Reading →

December 11, 2021 • Apache Spark SQL

What's new in Apache Spark 3.2.0 - performance optimizations

Apache Spark 3.0 extended the static execution engine with a runtime optimization engine called Adaptive Query Execution. It has changed a lot since the very first release and so even in the most recent version! But AQE is not a single performance improvement and I hope you'll see this in the blog post!

Continue Reading →

November 27, 2021 • Apache Spark SQL

What's new in Apache Spark 3.2.0 - Data Source V2

Even though Data Source V2 is present in the API for a while, every release brings something new to it. This time too and we'll see what through this blog post!

Continue Reading →

November 13, 2021 • Apache Spark SQL

What's new in Apache Spark 3.2.0 - SQL changes

Apache Spark SQL evolves and with each new release, it gets closer to the ANSI standard. The 3.2.0 release is not different and you can find many ANSI-related changes. But not only and hopefully, you'll discover all this in this blog post which has an unusual form because this time, I won't focus on the implementation details.

Continue Reading →

August 21, 2021 • Apache Spark SQL

Shuffle reading in Apache Spark SQL - wrapping iterators and beyond

It's time for the 2nd blog post about the shuffle readers. Recently, we discovered how Apache Spark fetches the shuffle blocks from local and remote hosts. Today, I would like to share with you the wrapping iterators. Sounds mysterious? It won't be if we start by looking at the iterators participating in the processing of shuffle block files.

Continue Reading →

August 14, 2021 • Apache Spark SQL

Shuffle reading in Apache Spark SQL

So far I've covered the writing part of the shuffle files. You've learned about 3 different shuffle writers, but what happens with their generated files? Who and how reads them? Is the reading an in-memory operation? I will try to answer this and some other questions in this blog post.

Continue Reading →

August 7, 2021 • Apache Spark SQL

Apache Spark can be eagerly evaluated too - Commands

Some time ago I participated in an interesting meetup about the MERGE operation in Delta Lake (link in the Further reading section). Jacek Laskowski presented the operation internals and asked an interesting question about the difference between commands and execs. Since I didn't know the answer right away, I decided to explore the commands concepts in this blog post.

Continue Reading →

July 24, 2021 • Apache Spark SQL

Join hints in Apache Spark SQL

With the Adaptive Query Execution module, you can have a feeling that Apache Spark will optimize the job for you. In part, yes, because it'll be able to optimize the job based on the runtime parameters you don't necessarily know. However, you also can master the execution, and ones of these mastery tools are hints.

Continue Reading →

May 29, 2021 • Apache Spark SQL

Under-the-hood: repartition

Previously we discovered what happens when you coalesce a dataset. To recall, it doesn't involve shuffle operation. It's then the opposite of a repartition operation which is a first class shuffle citizen.

Continue Reading →

May 8, 2021 • Apache Spark SQL

What's new in Apache Spark 3.1.1 - new built-in functions

Every Apache Spark release brings not only completely new components but also new native functions. The 3.1.1 is not an exception and it also comes with some new built-in functions!

Continue Reading →

May 1, 2021 • Apache Spark SQL

What's new in Apache Spark 3.1 - JDBC (WIP) and DataSource V2 API

Even though the change I will describe in this blog post is still in progress, it's worth attention, especially that I missed the DataSource V2 evolution in my previous blog posts.

Continue Reading →

April 10, 2021 • Apache Spark SQL

What's new in Apache Spark 3.1 - predicate pushdown for JSON, CSV and Apache Avro

Predicate pushdown is a data processing technique taking user-defined filters and executing them while reading the data. Apache Spark already supported it for Apache Parquet and RDBMS. Starting from Apache Spark 3.1.1, you can also use them for Apache Avro, JSON and CSV formats!

Continue Reading →

March 14, 2021 • Apache Spark SQL

What's new in Apache Spark 3.1 - join evolutions

I have waited for writing this blog post since the Data+AI Summit 2020, where Cheng Su presented the ongoing effort to improve shuffle and stream-to-stream joins in Apache Spark 3.1. And in this blog post, I will start by sharing what changed for the joins in the new release of the framework!

Continue Reading →

March 6, 2021 • Apache Spark SQL

Under-the-hood: coalesce

That's probably one of the most common questions you may have heard in preliminary job interviews. What's the difference between coalesce and repartition? Many answers exist, but instead of repeating them, I will try to dig a bit deeper in this blog post and see how the coalesce works.

Continue Reading →

February 27, 2021 • Apache Spark SQL

Stack operation in Apache Spark SQL

Pivot operation presented 2 weeks ago transforms some cells into columns. The reverse one is called stack and it's time to see how it works!

Continue Reading →

February 13, 2021 • Apache Spark SQL

Spark SQL pivot table

If you came to data engineering after having a BI career, you certainly know what the pivot is. It was not my case and was quite amazed by this operation that transforms values from rows into columns. If you want to understand how it's possible, this article will present some internals of pivoting data in Apache Spark.

Continue Reading →

February 6, 2021 • Apache Spark SQL

Apache Spark performance tips - look at your code

Very often you will find Apache Spark performance tips related to the hardware (memory, GC) or the configuration parameters (shuffle partitions number, broadcast join threshold). But they're not the single ones you can implement. Moreover, IMO, you should start by the ones presented in this article and optimize your pipeline code before going into more complicated hardware and configuration tuning.

Continue Reading →

November 29, 2020 • Apache Spark SQL

Partition-wise joins and Apache Spark SQL

Apache Spark has this great capacity to optimize joins of bucketed tables but does it work on partitions as well? No, and to understand why, I invite you to read the following sections of this blog post ?

Continue Reading →

November 15, 2020 • Apache Spark SQL

Drop is a...select

Have you ever wondered what is the relationship between drop and select operations in Apache Spark SQL? If not, I will shed some light on them in this short blog post.

Continue Reading →

October 3, 2020 • Apache Spark SQL

What's new in Apache Spark 3.0 - dynamic partition pruning

There are stories like this, the stories that remain in the backlog for a very long time, and finally, they get implemented. That's exactly what happened with the Dynamic Partition Pruning feature added, after almost 4 years in the backlog, to Apache Spark 3.

Continue Reading →

Apache Spark SQL articles