Data processing articles

Looking for something else? Check the categories of Data processing:

Apache Beam Apache Spark Apache Spark GraphFrames Apache Spark GraphX Apache Spark SQL Apache Spark Streaming Apache Spark Structured Streaming

If not, below you can find all articles belonging to Data processing.

Writing custom optimization in Apache Spark SQL - parser

I started the series about Apache Spark SQL customization from the last parts of query execution, which are logical and physical plans. But you must know that before the framework generates these plans, it must first parse the query.

Continue Reading →

Writing custom optimization in Apache Spark SQL - Union rewriter MVP version

Last time I presented you the basics of code generation in physical plans of Apache Spark SQL. This time I will try to write a physical plan executing UNION operation as a JOIN without code generation.

Continue Reading →

Apache Avro and Apache Spark compatibility

I'm very happy when the readers comment on my posts or tweets. A lot of such discussions are the topics of posts. It's the case of this one where I try to figure out whether Apache Spark SQL Avro source is compatible with other applications using this serialization format.

Continue Reading →

Writing custom optimization in Apache Spark SQL - generated code

In my previous post, I explained how to implement a custom physical plan execution. However, this first version didn't use generated code which is also an interesting option to customize Apache Spark. And it's also the feature that I will cover in this post.

Continue Reading →

Apache Spark Structured Streaming and Apache Kafka offsets management

Some time ago I got 3 interesting questions about the implementation of Apache Kafka connector in Apache Spark Structured Streaming. I will answer them in this post.

Continue Reading →

Writing custom optimization in Apache Spark SQL - physical plan

If you follow Apache Spark SQL category on my blog, you can see a lot of posts about customizing this framework. After recently published custom logical rules, it's time to explore another part which is planner strategy.

Continue Reading →

randomSplit implementation in Apache Spark SQL

Several weeks ago when I was checking new "apache-spark" tagged questions on StackOverflow I found one that caught my attention. The author was saying that randomSplit method doesn't divide the dataset equally and after merging back, the number of lines was different. Even though I wasn't able to answer at that moment, I decided to investigate this function and find possible reasons for that error.

Continue Reading →

Validating JSON with Apache Spark and Cerberus

In one of recent Meetups I heard that one of the most difficult data engineering tasks is ensuring good data quality. I'm more than agree with that statement and that's the reason why in this post I will share one of solutions to detect data issues with PySpark (my first PySpark code !) and Python library called Cerberus.

Continue Reading →

Apache Spark SQL and unit tests

Some time ago I was involved in a discussion about testing Apache Spark SQL code. In this post, I would like to share my observations about this topic.

Continue Reading →

Writing Apache Spark SQL custom logical optimization - improved code and summary

In the previous post about Apache Spark SQL custom optimizations I presented a rule transforming UNION operator into JOIN. At this time I only did a simple version working only with 2 datasets. In this post, I will share its improved version.

Continue Reading →

Range partitioning in Apache Spark SQL

The most popular partitioning strategy divides the dataset by the hash computed from one or more values of the record. However other partitioning strategies exist as well and one of them is range partitioning implemented in Apache Spark SQL with repartitionByRange method, described in this post.

Continue Reading →

Writing Apache Spark SQL custom logical optimization - the first version

Last time I wrote about different hints present in RDBMS and Hive. Today it's the moment to implement one of them.

Continue Reading →

Writing Apache Spark SQL custom logical optimization - unsupported optimization hints

After 2 previous posts dedicated to custom optimization in Apache Spark SQL, it's a good moment to start to write the code. As Jacek Laskowski suggested on Twitter (link in Read more), I will try to implement one extra optimization hint. But first things first and let's start with hints definition.

Continue Reading →

Regression tests with Apache Spark SQL joins

Regressions are one of the risks of our profession. Fortunately, we can limit the risk thanks to different testing strategies. One of them are regression tests that we can use to check whether the modified data processing logic didn't introduce the regressions simply by comparing two datasets.

Continue Reading →

FAIR jobs scheduling in Apache Spark

During my exploration of Apache Spark configuration options, I found an entry called spark.scheduler.mode. After looking for its possible values, I ended up with a pretty intriguing concept called FAIR scheduling that I will detail in this post.

Continue Reading →

Writing Apache Spark SQL custom logical optimization - API

In one of my previous posts I presented how to add a custom optimization to Apache Spark SQL. It was not a good moment to deep delve into the topic because of its complexity. That's why I will try to do a better job here by showing the API of native optimizations.

Continue Reading →

Apache Spark SQL and types resolution in semi-structured data

One of data governance goals is to ensure data consistency across different producers. Unfortunately, very often it's only a theory and especially when the data format is schemaless. It's why the data exploration is an important step in the process of data pipeline definition. In this post I wanted to do a small exercise and check how Apache Spark SQL behaves with inconsistent data.

Continue Reading →

Bzip2 compression in Apache Spark

Compression has a lot of benefits in the data context. It reduces the size of stored data, so you will save some space and also have less data to transfer across the network in the case of a data processing pipeline. And if you use Bzip2, you can process the compressed data in parallel. In this post, I will try to explain how does it happen.

Continue Reading →

Apache Spark 2.4.0 features - EXCEPT ALL and INTERSECT ALL

Apache Spark 2.4.0 brought a lot of internal changes but also some new features exposed to the end users, as already presented high-order functions. In this post, I will present another new feature, or rather 2 actually, because I will talk about 2 new SQL functions.

Continue Reading →

Monotonically increasing id function in Apache Spark SQL

Some time ago I was thinking whether Apache Spark provides the support for auto-incremented values, so hard to implement in distributed environments After some research, I almost found what I was looking for - monotonically increasing id. In this post, I will try to explain why "almost".

Continue Reading →