Apache Spark SQL articles

Range partitioning in Apache Spark SQL

The most popular partitioning strategy divides the dataset by the hash computed from one or more values of the record. However other partitioning strategies exist as well and one of them is range partitioning implemented in Apache Spark SQL with repartitionByRange method, described in this post.

Continue Reading →

Writing Apache Spark SQL custom logical optimization - the first version

Last time I wrote about different hints present in RDBMS and Hive. Today it's the moment to implement one of them.

Continue Reading →

Writing Apache Spark SQL custom logical optimization - unsupported optimization hints

After 2 previous posts dedicated to custom optimization in Apache Spark SQL, it's a good moment to start to write the code. As Jacek Laskowski suggested on Twitter (link in Read more), I will try to implement one extra optimization hint. But first things first and let's start with hints definition.

Continue Reading →

Regression tests with Apache Spark SQL joins

Regressions are one of the risks of our profession. Fortunately, we can limit the risk thanks to different testing strategies. One of them are regression tests that we can use to check whether the modified data processing logic didn't introduce the regressions simply by comparing two datasets.

Continue Reading →

Writing Apache Spark SQL custom logical optimization - API

In one of my previous posts I presented how to add a custom optimization to Apache Spark SQL. It was not a good moment to deep delve into the topic because of its complexity. That's why I will try to do a better job here by showing the API of native optimizations.

Continue Reading →

Apache Spark SQL and types resolution in semi-structured data

One of data governance goals is to ensure data consistency across different producers. Unfortunately, very often it's only a theory and especially when the data format is schemaless. It's why the data exploration is an important step in the process of data pipeline definition. In this post I wanted to do a small exercise and check how Apache Spark SQL behaves with inconsistent data.

Continue Reading →

Apache Spark 2.4.0 features - EXCEPT ALL and INTERSECT ALL

Apache Spark 2.4.0 brought a lot of internal changes but also some new features exposed to the end users, as already presented high-order functions. In this post, I will present another new feature, or rather 2 actually, because I will talk about 2 new SQL functions.

Continue Reading →

Monotonically increasing id function in Apache Spark SQL

Some time ago I was thinking whether Apache Spark provides the support for auto-incremented values, so hard to implement in distributed environments After some research, I almost found what I was looking for - monotonically increasing id. In this post, I will try to explain why "almost".

Continue Reading →

Introduction to custom optimization in Apache Spark SQL

In November 2018 bithw1 pointed out to me a feature that I haven't used yet in Apache Spark - custom optimization. After some months consacred to learning Apache Spark GraphX, I finally found a moment to explore it. This post begins a new series about Apache Spark customization and it covers the basics, i.e. the 2 available methods to add the custom optimizations.

Continue Reading →

Apache Spark 2.4.0 features - Avro data source

Apache Avro became one of the serialization standards, among others because of its use in Apache Kafka's schema registry. Previously to work with Avro files with Apache Spark we needed Databrick's external package. But it's no longer the case starting from 2.4.0 release where Avro became first-class citizen data source.

Continue Reading →

Apache Spark 2.4.0 features - array and higher-order functions

The series about the features introduced in Apache Spark 2.4.0 continues. Today's post will cover higher-order functions that you may know from elsewhere.

Continue Reading →

Apache Spark 2.4.0 features - bucket pruning

This post begins a new series dedicated to Apache Spark 2.4.0 features. The first covered topic will be bucket pruning.

Continue Reading →

Defining schemas in Apache Spark SQL with builder design pattern

Schemas are one of the key parts of Apache Spark SQL and its distinction point with old RDD-based API. When we deal with data coming from a structured data source as a relational database or schema-based file formats, we can let the framework to resolve the schema for us. But the things complicate when we're working with semi-structured data as JSON and we must define the schema by hand.

Continue Reading →

The who, when, how and what of Apache Spark SQL code generation

The code generated by Apache Spark for all the queries defined with higher-level concepts as SQL queries is the key to understand the processing logic performance. This post, started after a discussion on my Github, tries to explain some of the basics of code generation workflow.

Continue Reading →

DataFrame and file bigger than available memory

Some weeks ago I've written a post about files with long lines impact on RDD-based processing. The post revealed the difficulty to process such files because of OOM errors. In this post I wanted to check how does it apply to Datasets.

Continue Reading →

Multiple SparkSession for one SparkContext

Some months ago bithw1 posted an interesting question on my Github about multiple SparkSessions sharing the same SparkContext. If you have similar interrogations, feel free to ask - maybe it will give a birth to more detailed post adding some more value to the community. This post, at least, tries to do so by answering the question.

Continue Reading →

Apache Spark SQL, Hive and insertInto command

Some time ago on my Github bithw1 pointed out an interesting behavior of Hive integration on Apache Spark SQL. To not delve too much into details now, I can tell that the behavior was about not respected DataFrame schema. Our quick exchange ended up with an explanation but it also encouraged me to go much more into details to understand the hows and whys.

Continue Reading →

Grouping sets in Apache Spark SQL

Apache Spark SQL provides advanced analytics features that we can find in more classical OLAP-based workloads. Below I'll explain one of them.

Continue Reading →

Custom projection pushdown in Apache Spark SQL for JSON columns

Most of RDBMS are able to store JSON documents in columns of JSON-like type. One of them is PostgreSQL that can keep JSONs in one of 2 columns (JSON or JSONB) and that natively enables querying of JSON document attributes. As we'll see below, with a little bit of effort we can implement similar behavior in Apache Spark SQL.

Continue Reading →

Dealing with nested data in Apache Spark SQL

Nested data structure is very useful in data denormalization for Big Data needs. It avoids joins that we could use for several related and fully normalized datasets. But processing such data structures is not always simple. Fortunately Apache Spark SQL provides different utility functions helping to work with them.

Continue Reading →