Apache Spark 3.0.0 features articles

October 31, 2020 • Apache Spark

What's new in Apache Spark 3.0 - Kubernetes

I believe Kubernetes is the next big step in the framework after proposing Catalyst Optimizer, modernizing streaming processing with Structured Streaming, and introducing Adaptive Query Execution. Especially that Apache Spark 3 brings a lot of changes in this part!

Continue Reading →

October 24, 2020 • Apache Spark

What's new in Apache Spark 3.0 - GPU-aware scheduling

GPU-awareness was one of the topics I postponed the most in my Apache Spark 3.0 exploration. But its time has come and in this blog post you will discover what changed in the version 3 of the framework regarding the GPU-based computation.

Continue Reading →

October 17, 2020 • Apache Spark Structured Streaming

What's new in Apache Spark 3 - Structured Streaming

Apache Kafka changes in Apache Spark 3.0 was one of the first topics covered in the "what's new" series. Even though there were a lot of changes related to the Kafka source and sink, they're not the single ones in Structured Streaming.

Continue Reading →

October 10, 2020 • Apache Spark

What's new in Apache Spark 3.0 - UI changes

Apart from data processing-related changes, Apache Spark 3.0 also brings some changes at the UI level. The interface is supposed to be more intuitive and should help you understand processing logic better!

Continue Reading →

October 3, 2020 • Apache Spark SQL

What's new in Apache Spark 3.0 - dynamic partition pruning

There are stories like this, the stories that remain in the backlog for a very long time, and finally, they get implemented. That's exactly what happened with the Dynamic Partition Pruning feature added, after almost 4 years in the backlog, to Apache Spark 3.

Continue Reading →

September 19, 2020 • Apache Spark SQL

What's new in Apache Spark 3.0 - predicate pushdown support for nested fields

If you noticed that some filter expressions weren't pushed down to your Apache Parquet files, the situation should change in Apache Spark 3.0. The new release supports this feature called nested data predicate pushdown.

Continue Reading →

September 12, 2020 • Apache Spark SQL

What's new in Apache Spark 3.0 - delete, update and merge API support

All the operations from the title are natively available in relational databases but doing them with distributed data processing systems is not obvious. Starting from 3.0, Apache Spark gives a possibility to implement them in the data sources.

Continue Reading →

September 5, 2020 • Apache Spark SQL

What's new in Apache Spark 3.0 - demote broadcast hash join

It's the last part of the series about the Adaptive Query Execution in Apache Spark SQL. So far you learned about the physical plan optimizations. But they're not alone and you will see that in this blog post.

Continue Reading →

August 29, 2020 • Apache Spark SQL

What's new in Apache Spark 3.0 - local shuffle reader

So far you learned about skew optimization and coalesce shuffle partition optimizations made by the Adaptive Query Execution engine. But they're not the single ones and the next one you will discover is also related to the shuffle.

Continue Reading →

August 22, 2020 • Apache Spark SQL

What's new in Apache Spark 3.0 - reuse adaptive subquery

Apart from big and complex changes in the Adaptive Query Execution like skews or partitions coalescing, there are also some others, less complex. Although their smaller complexity, it doesn't mean they are not important. Especially when one of these changes offers a reuse of the subqueries.

Continue Reading →

August 15, 2020 • Apache Spark SQL

What's new in Apache Spark 3.0 - join skew optimization

Shuffle partitions coalesce is not the single optimization introduced with the Adaptive Query Execution. Another one, addressing maybe one of the most disliked issues in data processing, is joins skew optimization that you will discover in this blog post.

Continue Reading →

August 8, 2020 • Apache Spark SQL

What's new in Apache Spark 3.0 - shuffle partitions coalesce

In my previous blog post you could learn about the Adaptive Query Execution improvement added to Apache Spark 3.0. At that moment, you learned only about the general execution flow for the adaptive queries. Today it's time to see one of possible optimizations that can happen at this moment, the shuffle partition coalesce.

Continue Reading →

August 1, 2020 • Apache Spark

What's new in Apache Spark 3.0 - shuffle service changes

One of Apache Spark's components making it hard to scale is shuffle. Fortunately, the community is on a good way to overcome this limitation and the new release of the framework brings some important improvements on this field.

Continue Reading →

July 25, 2020 • Apache Spark SQL

What's new in Apache Spark 3.0 - Adaptive Query Execution

A query adapting to the data characteristics discovered one-by-one at runtime? Yes, in Apache Spark 3.0 it's possible thanks to the Adaptive Query Execution!

Continue Reading →

July 18, 2020 • Apache Spark SQL

What's new in Apache Spark 3.0 - PostgreSQL feature parity

Apart from the date and time management, another big feature of Apache Spark 3.0 is the work on the PostgreSQL feature parity, that will be the topic of my new article from the series.

Continue Reading →

July 11, 2020 • Apache Spark Structured Streaming

What's new in Apache Spark 3.0 - Apache Kafka integration improvements

After previous presentations of the new date time and functions features in Apache Spark 3.0 it's time to see what's new on the streaming side in Structured Streaming module, and more precisely, on its Apache Kafka integration.

Continue Reading →

July 4, 2020 • Apache Spark SQL

What's new in Apache Spark 3.0 - binary data source

I remember my first days with Apache Spark and the analysis of available RDD data sources. Since then, I have used a lot of them, except the binary data which is a new implemented part in Apache Spark SQL in the release 3.0.

Continue Reading →

June 27, 2020 • Apache Spark SQL

What's new in Apache Spark 3.0 - new SQL functions

After date time management, it's time to see another important feature of Apache Spark 3.0, he new SQL functions.

Continue Reading →

June 20, 2020 • Apache Spark SQL

What's new in Apache Spark 3.0 - Proleptic Calendar and date time management

When I was writing my blog post about datetime conversion in Apache Spark 2.4, I wanted to check something on Apache Spark's Github. To my surprise, the code had nothing in common with the code I was analyzing locally. And that's how I discovered the first change in Apache Spark 3.0. The first among few others that I will cover in a new series "What's new in Apache Spark 3.0".

Continue Reading →

1