Apache Spark SQL articles

Home Apache Spark SQL

November 14, 2018 • Apache Spark SQL

Apache Spark SQL, Hive and insertInto command

Some time ago on my Github bithw1 pointed out an interesting behavior of Hive integration on Apache Spark SQL. To not delve too much into details now, I can tell that the behavior was about not respected DataFrame schema. Our quick exchange ended up with an explanation but it also encouraged me to go much more into details to understand the hows and whys.

Continue Reading →

November 7, 2018 • Apache Spark SQL

Grouping sets in Apache Spark SQL

Apache Spark SQL provides advanced analytics features that we can find in more classical OLAP-based workloads. Below I'll explain one of them.

Continue Reading →

October 3, 2018 • Apache Spark SQL

Custom projection pushdown in Apache Spark SQL for JSON columns

Most of RDBMS are able to store JSON documents in columns of JSON-like type. One of them is PostgreSQL that can keep JSONs in one of 2 columns (JSON or JSONB) and that natively enables querying of JSON document attributes. As we'll see below, with a little bit of effort we can implement similar behavior in Apache Spark SQL.

Continue Reading →

September 23, 2018 • Apache Spark SQL

Dealing with nested data in Apache Spark SQL

Nested data structure is very useful in data denormalization for Big Data needs. It avoids joins that we could use for several related and fully normalized datasets. But processing such data structures is not always simple. Fortunately Apache Spark SQL provides different utility functions helping to work with them.

Continue Reading →

September 23, 2018 • Apache Spark SQL

Apache Spark and window functions

One of previous posts in SQL category presented window functions that can be used to compute values per grouped rows. These analytics functions are also available in Apache Spark SQL.

Continue Reading →

August 18, 2018 • Apache Spark SQL

RDBMS options in Apache Spark SQL

Some recent posts covered important Spark SQL options for RDBMS: partitioning and write modes. However they're not the only ones available for this data storage.

Continue Reading →

August 11, 2018 • Apache Spark SQL

SaveMode.Overwrite trap with RDBMS in Apache Spark SQL

Some months ago I presented save modes in Spark SQL. However, this post was limited to their use in files. I was quite surprised to observe some specific behavior of them for RDBMS sinks. Especially for SaveMode.Overwrite.

Continue Reading →

June 24, 2018 • Apache Spark SQL

Correlated scalar subqueries in Apache Spark SQL

Some weeks ago I presented correlated scalar subqueries in the example of PostgreSQL. However they can also be found in the Big Data processing systems, as for instance BigQuery or Apache Spark SQL.

Continue Reading →

June 15, 2018 • Apache Spark SQL

Nested loop join in Apache Spark SQL

In programming a simple is often the synonymous of understandable and maintainable. However it doesn't always mean efficient. One of examples of this thesis is nested loop join that is also present in Apache Spark SQL.

Continue Reading →

December 3, 2017 • Apache Spark SQL

Spark SQL Cost-Based Optimizer

Prior to Spark 2.2.0 release, the data processing was based on a set of heuristic rules ignoring the typology of the data. But the most recent release brought a tool well known from the RDBMS world that is a Cost-Based Optimizer.

Continue Reading →

October 29, 2017 • Apache Spark SQL

Save modes in Spark SQL

DataFrame can either be loaded and saved. And Spark SQL provides, as for a lot other points, different strategies to deal with data persistence.

Continue Reading →

October 22, 2017 • Apache Spark SQL

Spark SQL operator optimizations - part 2

It's time to continue the exploration of operator optimizations of logic plans in Spark SQL. After the first part describing optimizations from A to L, this post covers remaining letters.

Continue Reading →

October 14, 2017 • Apache Spark SQL

Spark SQL operator optimizations - part 1

Pushdown predicate is one of the most popular optimizations in Spark SQL. But it's not the single one and their main list is defined in org.apache.spark.sql.catalyst.optimizer.Optimizer abstract class.

Continue Reading →

October 8, 2017 • Apache Spark SQL

User Defined Aggregate Functions

User Defined Functions are not the single way to extend Spark SQL. The second solution is offered by User Defined Aggregate Functions.

Continue Reading →

October 1, 2017 • Apache Spark SQL

Spark SQL statistics

Spark SQL has a lot of "hidden" features making it an efficient processing tool. One of them are statistics.

Continue Reading →

September 24, 2017 • Apache Spark SQL

Predicate pushdown in Spark SQL

The optimizer in Spark SQL helps to improve the performance of processing pipelines. One of its techniques is predicate pushdown.

Continue Reading →

September 3, 2017 • Apache Spark SQL

Chain of responsibility design pattern in Spark SQL UDF

Chain of responsibility design pattern is one of my favorite's alternatives to avoid too many nested calls. Some days ago I was wondering if it could be used instead of nested calls of multiple UDFs applied in column level in Spark SQL. And the response was affirmative.

Continue Reading →

August 26, 2017 • Apache Spark SQL

User Defined Functions

User Defined Types (UDT) described in one of previous posts aren't the single customization possibility in Apache Spark SQL. The other possibility are User Defined Functions (UDF).

Continue Reading →

August 20, 2017 • Apache Spark SQL

Fetchsize in Spark SQL

Spark SQL reading from RDBMS is based on classic JDBC drivers. Thus it supports some of their options, as fetchsize described in sections below.

Continue Reading →

August 12, 2017 • Apache Spark SQL

Sort-merge join in Spark SQL

After discovering two methods used to join DataFrames, broadcast and hashing, it's time to talk about the third possibility - sort-merge join.

Continue Reading →