Articles about Apache Spark SQL on waitingforcode.com - articles for the pleasure of learning and discovery

February 22, 2020 • Apache Spark SQL

Reorder JOIN optimizer - cost-based optimization

In my previous post I explained how Apache Spark can reorder JOINs based on the logical plan. Today I'll focus on another aspect of reordering which uses cost estimation for the proposed plans.

Continue Reading →

February 15, 2020 • Apache Spark SQL

Reorder JOIN optimizer

One of the reasons why I like my blogging activity is that from time to time the exchange is bidirectional. It happens mostly on Github but also on the comments under the post and I appreciate the situation when I don't know the answer and must dig a little to explain it in a blog post :) I wrote this one thanks to bithw1 issue created on my Spark playground repository (thank you for another interesting question btw :)).

Continue Reading →

January 11, 2020 • Apache Spark SQL

Schema case sensitivity for JSON source in Apache Spark SQL

On the one hand, I appreciate JSON for its flexibility but also from the other one, I hate it for exactly the same thing. It's particularly painful when you work on a project without good data governance. The most popular pain is an inconsistent field type - Spark can manage that by getting the most common type. Unfortunately, it's a little bit trickier for less common problems, for instance when a same field has different case sensitivity.

Continue Reading →

January 4, 2020 • Apache Spark SQL

Apache Spark and line-based data sources

Under one of my posts I got an interesting question about ignoring maxPartitionBytes configuration entry by Apache Spark for text-based data sources. In this post I will try to answer it.

Continue Reading →

December 22, 2019 • Apache Spark SQL

Implicit datetime conversion in Apache Spark SQL

If you've ever wondered why when you write "2019-05-10T20:00", Apache Spark considers it as a timestamp field? The fact of defining it as a TimestampType is one of the reasons, but another question here is, how Apache Spark does the conversion from a string into the timestamp type? I will give you some hints in this blog post.

Continue Reading →

November 30, 2019 • Apache Spark SQL

Local deduplication or dropDuplicates?

One of the points I wanted to cover during my talk but for which I haven't enough time, was the dilemma about using a local deduplication or Apache Spark's dropDuplicates method to not integrate duplicated logs. That will be the topic of this post.

Continue Reading →

November 3, 2019 • Apache Spark SQL

DataFrame or Dataset to solve sessionization problem?

When I was preparing the demo code for my talk about sessionization at Spark AI Summit 2019 in Amsterdam, I wrote my first version of code with DataFrame abstraction. I hadn't type safety but the data manipulation was quite clear thanks to the mapping. Later, I tried to rewrite the code with Dataset and I got type safety but sacrificed a little bit of clarity. Let me deep delve into that in this post.

Continue Reading →

October 19, 2019 • Apache Spark SQL

FileAlreadyExistsException at task retry on EMR

The exceptions are our daily pain but the exceptions hard to explain are more than that. I faced one of them one day when I was integrating Apache Spark SQL on EMR.

Continue Reading →

October 13, 2019 • Apache Spark SQL

Date functions in Apache Spark SQL

This new post about Apache Spark SQL will give some hands-on use cases of date functions.

Continue Reading →

October 12, 2019 • Apache Spark SQL

Aggregations execution in Apache Spark SQL

I wanted to write this post after the one about aggregation modes but I didn't. Before explaining different aggregation strategies, I prefer to clarify aggregation internals. It should help you to better understand the next part.

Continue Reading →

September 28, 2019 • Apache Spark SQL

The why of code generation in Apache Spark SQL

By the end of 2018 I published a post about code generation in Apache Spark SQL where I answered the questions about who, when, how and what. But I omitted the "why" and cozos created an issue on my Github to complete the article. Something I will try to do here.

Continue Reading →

September 21, 2019 • Apache Spark SQL

Less popular aggregation functions in Apache Spark SQL

There are 2 popular ways to come to the data engineering field. Either you were a software engineer and you were fascinated by the data domain and its problems (I did). Or simply you evolved from a BI Developer. The big advantage of the latter path is that these people spent a lot of time on writing SQL queries and their knowledge of its functions is much better than for the people from the first category. This post is written by a data-from-software engineer who discovered that aggregation is not only about simple arithmetic values but also about distributions and collections.

Continue Reading →

September 15, 2019 • Apache Spark SQL

Buckets in Apache Spark SQL

Partitioning is the most popular method to divide a dataset into smaller parts. It's important to know that it can be completed with another technique called bucketing.

Continue Reading →

September 13, 2019 • Apache Spark SQL

Vectorized operations in Apache Spark SQL

When I was preparing my talk about Apache Spark customization, I wanted to talk about User Defined Types. After some digging, I saw that there are some UDT in the source code and one of them was VectorUDT. And it led me to the topic of this post which is the vectorization.

Continue Reading →

August 24, 2019 • Apache Spark SQL

Writing custom external catalog listeners in Apache Spark SQL

When I was writing posts about Apache Spark SQL customization through extensions, I found a method to define custom catalog listeners. Since it was my first contact with this, before playing with it, I decided to discover the feature.

Continue Reading →

August 15, 2019 • Apache Spark SQL

Writing custom optimization in Apache Spark SQL - custom parser

Last time I presented ANTLR and how Apache Spark SQL uses it to convert textual SQL expressions into internal classes. In this post I will write a custom parser.

Continue Reading →

August 7, 2019 • Apache Spark SQL

Writing custom optimization in Apache Spark SQL - parser

I started the series about Apache Spark SQL customization from the last parts of query execution, which are logical and physical plans. But you must know that before the framework generates these plans, it must first parse the query.

Continue Reading →

July 30, 2019 • Apache Spark SQL

Writing custom optimization in Apache Spark SQL - Union rewriter MVP version

Last time I presented you the basics of code generation in physical plans of Apache Spark SQL. This time I will try to write a physical plan executing UNION operation as a JOIN without code generation.

Continue Reading →

July 25, 2019 • Apache Spark SQL

Apache Avro and Apache Spark compatibility

I'm very happy when the readers comment on my posts or tweets. A lot of such discussions are the topics of posts. It's the case of this one where I try to figure out whether Apache Spark SQL Avro source is compatible with other applications using this serialization format.

Continue Reading →

July 24, 2019 • Apache Spark SQL

Writing custom optimization in Apache Spark SQL - generated code

In my previous post, I explained how to implement a custom physical plan execution. However, this first version didn't use generated code which is also an interesting option to customize Apache Spark. And it's also the feature that I will cover in this post.

Continue Reading →

Apache Spark SQL articles