Apache Spark SQL articles

What's new in Apache Spark 3.0 - Adaptive Query Execution

A query adapting to the data characteristics discovered one-by-one at runtime? Yes, in Apache Spark 3.0 it's possible thanks to the Adaptive Query Execution!

Continue Reading →

What's new in Apache Spark 3.0 - PostgreSQL feature parity

Apart from the date and time management, another big feature of Apache Spark 3.0 is the work on the PostgreSQL feature parity, that will be the topic of my new article from the series.

Continue Reading →

What's new in Apache Spark 3.0 - binary data source

I remember my first days with Apache Spark and the analysis of available RDD data sources. Since then, I have used a lot of them, except the binary data which is a new implemented part in Apache Spark SQL in the release 3.0.

Continue Reading →

What's new in Apache Spark 3.0 - new SQL functions

After date time management, it's time to see another important feature of Apache Spark 3.0, he new SQL functions.

Continue Reading →

Ignoring files issues in Apache Spark SQL

I have to consider myself as a lucky guy since I've never had to deal with incorrectly formatted files. However, that's not the case of everyone. Hopefully, Apache Spark comes with few configuration options to manage that.

Continue Reading →

What's new in Apache Spark 3.0 - Proleptic Calendar and date time management

When I was writing my blog post about datetime conversion in Apache Spark 2.4, I wanted to check something on Apache Spark's Github. To my surprise, the code had nothing in common with the code I was analyzing locally. And that's how I discovered the first change in Apache Spark 3.0. The first among few others that I will cover in a new series "What's new in Apache Spark 3.0".

Continue Reading →

Unions in Apache Spark SQL

You have 2 different datasets and want to process them as a single unit? Maybe you have some legacy data that you need to process alongside the brand new dataset? JOIN is not an option because the goal is to build a single processing unit and not combine the rows. UNION operation can be a good fit for that.

Continue Reading →

Apache Spark SQL partitionBy - shuffle or not to shuffle?

I remember my first time with partitionBy method. I was reading data from an Apache Kafka topic and writing it into hourly-based partitioned directories. To my surprise, Apache Spark was generating always 1 file and my first thought... oh, it's shuffling the data. But I was wrong and in this post will explain why.

Continue Reading →

Idempotent file generation in Apache Spark SQL

Some time ago I was thinking how to partition the data and ensure that we can reprocess it easily. Overwrite mode was not an option since the data of one partition could be generated by 2 different batch executions. That's why I started to think about implementing an idempotent file output generator and, therefore, discover file sink internals in practice.

Continue Reading →

Apache Spark's _SUCESS anatomy

_SUCCESS file generated by Apache Spark SQL when you successfully generate a dataset, is often a big question for newcomers. Why does the framework need this file? How is it generated? I will cover these aspects in this article.

Continue Reading →

sortWithinPartitions in Apache Spark SQL

Few weeks ago when I was preparing a talk for one local meetup, I wanted to list the most common operations we can do with Spark for the newcomers. And I found one I haven't used before, namely sortWithinPartitions.

Continue Reading →

Reorder JOIN optimizer - star schema

I didn't know that join reordering is quite interesting, though complex, topic in Apache Spark SQL. The queries not only can be transformed into the ones using JOIN ... ON clauses. They can also be reordered accordingly to the star schema which we'll try to see in this post.

Continue Reading →

Reorder JOIN optimizer - cost-based optimization

In my previous post I explained how Apache Spark can reorder JOINs based on the logical plan. Today I'll focus on another aspect of reordering which uses cost estimation for the proposed plans.

Continue Reading →

Reorder JOIN optimizer

One of the reasons why I like my blogging activity is that from time to time the exchange is bidirectional. It happens mostly on Github but also on the comments under the post and I appreciate the situation when I don't know the answer and must dig a little to explain it in a blog post :) I wrote this one thanks to bithw1 issue created on my Spark playground repository (thank you for another interesting question btw :)).

Continue Reading →

Schema case sensitivity for JSON source in Apache Spark SQL

On the one hand, I appreciate JSON for its flexibility but also from the other one, I hate it for exactly the same thing. It's particularly painful when you work on a project without good data governance. The most popular pain is an inconsistent field type - Spark can manage that by getting the most common type. Unfortunately, it's a little bit trickier for less common problems, for instance when a same field has different case sensitivity.

Continue Reading →

Apache Spark and line-based data sources

Under one of my posts I got an interesting question about ignoring maxPartitionBytes configuration entry by Apache Spark for text-based data sources. In this post I will try to answer it.

Continue Reading →

Implicit datetime conversion in Apache Spark SQL

If you've ever wondered why when you write "2019-05-10T20:00", Apache Spark considers it as a timestamp field? The fact of defining it as a TimestampType is one of the reasons, but another question here is, how Apache Spark does the conversion from a string into the timestamp type? I will give you some hints in this blog post.

Continue Reading →

Local deduplication or dropDuplicates?

One of the points I wanted to cover during my talk but for which I haven't enough time, was the dilemma about using a local deduplication or Apache Spark's dropDuplicates method to not integrate duplicated logs. That will be the topic of this post.

Continue Reading →

DataFrame or Dataset to solve sessionization problem?

When I was preparing the demo code for my talk about sessionization at Spark AI Summit 2019 in Amsterdam, I wrote my first version of code with DataFrame abstraction. I hadn't type safety but the data manipulation was quite clear thanks to the mapping. Later, I tried to rewrite the code with Dataset and I got type safety but sacrificed a little bit of clarity. Let me deep delve into that in this post.

Continue Reading →

FileAlreadyExistsException at task retry on EMR

The exceptions are our daily pain but the exceptions hard to explain are more than that. I faced one of them one day when I was integrating Apache Spark SQL on EMR.

Continue Reading →