Apache Spark SQL articles

What's new in Apache Spark 3.3.0 - Data Source V2

After a break for the Data+AI Summit retrospective, it's time to return to Apache Spark 3.3.0 and see what changed for the DataSource V2 API.

Continue Reading →

What's new in Apache Spark 3.3 - new functions

New Apache SQL functions are a regular position in my "What's new in Apache Spark..." series. Let's see what has changed in the most recent (3.3.0) release!

Continue Reading →

What's new in Apache Spark 3.3 - joins

Joins are probably the most popular operation for combining datasets and Apache Spark supports multiple types of them already! In the new release, the framework got 2 new strategies, the storage-partitioned and row-level runtime filters.

Continue Reading →

Radix and Tim sort

The topic of this blog post is not new because the discussed sort algorithms are there from Apache Spark 2. But it happens that I've never had a chance to present them and today I'll try to do it now.

Continue Reading →

Tables and Apache Spark

If you're like me and haven't had an opportunity to work with Spark on Hive, you're probably as confused as I had been about the tables. Hopefully, after reading this blog post you will understand that concept better!

Continue Reading →

Pluggable Catalog API

Despite working with Apache Spark for a while, I still have some undiscovered components. One of them crossed my path while I was writing the first blog post from the ACID file formats series. The lucky one is the Catalog API.

Continue Reading →

Beware of .withColumn

The .withColumn function is apparently an inoffensive operation, just a way to add or change a column. True, but also hides some points that can even lead to the memory issues and we'll see them in this blog post.

Continue Reading →

Distinct vs group by key difference

I've heard an opinion that using DISTINCT can have a negative impact on big data workloads, and that the queries with GROUP BY were more performant. Is it true for Apache Spark SQL?

Continue Reading →

What's new in Apache Spark 3.2.0 - Apache Parquet and Apache Avro improvements

I still have 2 topics remaining in my "What's new..." backlog. I'd like to share the first of them with you today, and see what changed for Apache Parquet and Apache Avro data sources.

Continue Reading →

What's new in Apache Spark 3.2.0 - performance optimizations

Apache Spark 3.0 extended the static execution engine with a runtime optimization engine called Adaptive Query Execution. It has changed a lot since the very first release and so even in the most recent version! But AQE is not a single performance improvement and I hope you'll see this in the blog post!

Continue Reading →

What's new in Apache Spark 3.2.0 - Data Source V2

Even though Data Source V2 is present in the API for a while, every release brings something new to it. This time too and we'll see what through this blog post!

Continue Reading →

What's new in Apache Spark 3.2.0 - SQL changes

Apache Spark SQL evolves and with each new release, it gets closer to the ANSI standard. The 3.2.0 release is not different and you can find many ANSI-related changes. But not only and hopefully, you'll discover all this in this blog post which has an unusual form because this time, I won't focus on the implementation details.

Continue Reading →

Shuffle reading in Apache Spark SQL - wrapping iterators and beyond

It's time for the 2nd blog post about the shuffle readers. Recently, we discovered how Apache Spark fetches the shuffle blocks from local and remote hosts. Today, I would like to share with you the wrapping iterators. Sounds mysterious? It won't be if we start by looking at the iterators participating in the processing of shuffle block files.

Continue Reading →

Shuffle reading in Apache Spark SQL

So far I've covered the writing part of the shuffle files. You've learned about 3 different shuffle writers, but what happens with their generated files? Who and how reads them? Is the reading an in-memory operation? I will try to answer this and some other questions in this blog post.

Continue Reading →

Apache Spark can be eagerly evaluated too - Commands

Some time ago I participated in an interesting meetup about the MERGE operation in Delta Lake (link in the Further reading section). Jacek Laskowski presented the operation internals and asked an interesting question about the difference between commands and execs. Since I didn't know the answer right away, I decided to explore the commands concepts in this blog post.

Continue Reading →

Join hints in Apache Spark SQL

With the Adaptive Query Execution module, you can have a feeling that Apache Spark will optimize the job for you. In part, yes, because it'll be able to optimize the job based on the runtime parameters you don't necessarily know. However, you also can master the execution, and ones of these mastery tools are hints.

Continue Reading →

Under-the-hood: repartition

Previously we discovered what happens when you coalesce a dataset. To recall, it doesn't involve shuffle operation. It's then the opposite of a repartition operation which is a first class shuffle citizen.

Continue Reading →

What's new in Apache Spark 3.1.1 - new built-in functions

Every Apache Spark release brings not only completely new components but also new native functions. The 3.1.1 is not an exception and it also comes with some new built-in functions!

Continue Reading →

What's new in Apache Spark 3.1 - JDBC (WIP) and DataSource V2 API

Even though the change I will describe in this blog post is still in progress, it's worth attention, especially that I missed the DataSource V2 evolution in my previous blog posts.

Continue Reading →

What's new in Apache Spark 3.1 - predicate pushdown for JSON, CSV and Apache Avro

Predicate pushdown is a data processing technique taking user-defined filters and executing them while reading the data. Apache Spark already supported it for Apache Parquet and RDBMS. Starting from Apache Spark 3.1.1, you can also use them for Apache Avro, JSON and CSV formats!

Continue Reading →