Data processing articles

Looking for something else? Check the categories of Data processing:

Apache Beam Apache Spark Apache Spark GraphFrames Apache Spark GraphX Apache Spark SQL Apache Spark Streaming Apache Spark Structured Streaming PySpark

If not, below you can find all articles belonging to Data processing.

Data transformations in Apache Beam

Transformation are intrinsic part of each data processing framework. Apache Beam is not an exception and it also provides some of build-in transformations that can be freely extended with appropriated structures.

Continue Reading →

PCollection - data representation in Apache Beam

One of the problems with data processing frameworks released in the past few years was the use of different abstractions for batch and streaming tasks. Apache Beam is an exception of this rule because it proposes a uniform data representation called PCollection.

Continue Reading →

Spark SQL Cost-Based Optimizer

Prior to Spark 2.2.0 release, the data processing was based on a set of heuristic rules ignoring the typology of the data. But the most recent release brought a tool well known from the RDBMS world that is a Cost-Based Optimizer.

Continue Reading →

Spark failure detection - heartbeats

One of problems in distributed computing is the failure detection. How a master node can know that some of its workers went down just a minute ? A popular and quite simple solution uses heartbeats sent at regular interval by the workers. Spark also implements this technique.

Continue Reading →

Spark data locality

If you've ever analyzed Spark UI, you've certainly seen the part of Locality level in the table with tasks. Even if this concept is less exposed than the topics as shuffle, it remains quite important in efficient data processing.

Continue Reading →

Save modes in Spark SQL

DataFrame can either be loaded and saved. And Spark SQL provides, as for a lot other points, different strategies to deal with data persistence.

Continue Reading →

Spark SQL operator optimizations - part 2

It's time to continue the exploration of operator optimizations of logic plans in Spark SQL. After the first part describing optimizations from A to L, this post covers remaining letters.

Continue Reading →

Spark SQL operator optimizations - part 1

Pushdown predicate is one of the most popular optimizations in Spark SQL. But it's not the single one and their main list is defined in org.apache.spark.sql.catalyst.optimizer.Optimizer abstract class.

Continue Reading →

User Defined Aggregate Functions

User Defined Functions are not the single way to extend Spark SQL. The second solution is offered by User Defined Aggregate Functions.

Continue Reading →

Spark SQL statistics

Spark SQL has a lot of "hidden" features making it an efficient processing tool. One of them are statistics.

Continue Reading →

Predicate pushdown in Spark SQL

The optimizer in Spark SQL helps to improve the performance of processing pipelines. One of its techniques is predicate pushdown.

Continue Reading →

org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start() explained

The error quoted in the title of this post is quite common when you want to copy conception logic from Spark DStream/RDD to Spark structured streaming. This post makes some insight on it.

Continue Reading →

Analyzing Structured Streaming Kafka integration - Kafka source

Spark 2.2.0 brought the change of structured streaming state. Between 2.0 and 2.2.0 it was marked as "alpha". But the last version changed this status to General Availability. It's so a good moment to start to play with this new feature - even if some basics have already been covered in the post about structured streaming. This time we'll go deeper and analyze the integration with Apache Kafka that will be helpful to

Continue Reading →

Partitioning internals in Spark

In October I published the post about Partitioning in Spark. It was an introduction to the partitioning part, mainly focused on basic information, as partitioners and partitioning transformations (coalesce and repartition). This time it's a good moment to take other partition points up.

Continue Reading →

Chain of responsibility design pattern in Spark SQL UDF

Chain of responsibility design pattern is one of my favorite's alternatives to avoid too many nested calls. Some days ago I was wondering if it could be used instead of nested calls of multiple UDFs applied in column level in Spark SQL. And the response was affirmative.

Continue Reading →

Spak UI meaning - common parts

Spark UI is a good method to track jobs execution and detect performances issues. But the multiple parts of the UI, some of them depending on used Spark library, can scare at first glance. This post tries to explain all necessary points to understand better the common parts of Spark UI.

Continue Reading →

User Defined Functions

User Defined Types (UDT) described in one of previous posts aren't the single customization possibility in Apache Spark SQL. The other possibility are User Defined Functions (UDF).

Continue Reading →

Fetchsize in Spark SQL

Spark SQL reading from RDBMS is based on classic JDBC drivers. Thus it supports some of their options, as fetchsize described in sections below.

Continue Reading →

Speculative execution in Spark

Spark has a lot of interesting features and one of them is the speculative execution of tasks.

Continue Reading →

Sort-merge join in Spark SQL

After discovering two methods used to join DataFrames, broadcast and hashing, it's time to talk about the third possibility - sort-merge join.

Continue Reading →