Apache Avro became one of the serialization standards, among others because of its use in Apache Kafka's schema registry. Previously to work with Avro files with Apache Spark we needed Databrick's external package. But it's no longer the case starting from 2.4.0 release where Avro became first-class citizen data source.
The series about the features introduced in Apache Spark 2.4.0 continues. Today's post will cover higher-order functions that you may know from elsewhere.
This post begins a new series dedicated to Apache Spark 2.4.0 features. The first covered topic will be bucket pruning.
Schemas are one of the key parts of Apache Spark SQL and its distinction point with old RDD-based API. When we deal with data coming from a structured data source as a relational database or schema-based file formats, we can let the framework to resolve the schema for us. But the things complicate when we're working with semi-structured data as JSON and we must define the schema by hand.
The code generated by Apache Spark for all the queries defined with higher-level concepts as SQL queries is the key to understand the processing logic performance. This post, started after a discussion on my Github, tries to explain some of the basics of code generation workflow.
Some weeks ago I've written a post about files with long lines impact on RDD-based processing. The post revealed the difficulty to process such files because of OOM errors. In this post I wanted to check how does it apply to Datasets.
Some months ago bithw1 posted an interesting question on my Github about multiple SparkSessions sharing the same SparkContext. If you have similar interrogations, feel free to ask - maybe it will give a birth to more detailed post adding some more value to the community. This post, at least, tries to do so by answering the question.
Some time ago on my Github bithw1 pointed out an interesting behavior of Hive integration on Apache Spark SQL. To not delve too much into details now, I can tell that the behavior was about not respected DataFrame schema. Our quick exchange ended up with an explanation but it also encouraged me to go much more into details to understand the hows and whys.
Apache Spark SQL provides advanced analytics features that we can find in more classical OLAP-based workloads. Below I'll explain one of them.
Most of RDBMS are able to store JSON documents in columns of JSON-like type. One of them is PostgreSQL that can keep JSONs in one of 2 columns (JSON or JSONB) and that natively enables querying of JSON document attributes. As we'll see below, with a little bit of effort we can implement similar behavior in Apache Spark SQL.
Nested data structure is very useful in data denormalization for Big Data needs. It avoids joins that we could use for several related and fully normalized datasets. But processing such data structures is not always simple. Fortunately Apache Spark SQL provides different utility functions helping to work with them.
One of previous posts in SQL category presented window functions that can be used to compute values per grouped rows. These analytics functions are also available in Apache Spark SQL.
Some recent posts covered important Spark SQL options for RDBMS: partitioning and write modes. However they're not the only ones available for this data storage.
Some months ago I presented save modes in Spark SQL. However, this post was limited to their use in files. I was quite surprised to observe some specific behavior of them for RDBMS sinks. Especially for SaveMode.Overwrite.
Some weeks ago I presented correlated scalar subqueries in the example of PostgreSQL. However they can also be found in the Big Data processing systems, as for instance BigQuery or Apache Spark SQL.
In programming a simple is often the synonymous of understandable and maintainable. However it doesn't always mean efficient. One of examples of this thesis is nested loop join that is also present in Apache Spark SQL.
Prior to Spark 2.2.0 release, the data processing was based on a set of heuristic rules ignoring the typology of the data. But the most recent release brought a tool well known from the RDBMS world that is a Cost-Based Optimizer.
DataFrame can either be loaded and saved. And Spark SQL provides, as for a lot other points, different strategies to deal with data persistence.
It's time to continue the exploration of operator optimizations of logic plans in Spark SQL. After the first part describing optimizations from A to L, this post covers remaining letters.
Pushdown predicate is one of the most popular optimizations in Spark SQL. But it's not the single one and their main list is defined in org.apache.spark.sql.catalyst.optimizer.Optimizer abstract class.