Data processing articles

Looking for something else? Check the categories of Data processing:

Apache Beam Apache Spark Apache Spark GraphFrames Apache Spark GraphX Apache Spark SQL Apache Spark Streaming Apache Spark Structured Streaming PySpark

If not, below you can find all articles belonging to Data processing.

Writing Apache Spark SQL custom logical optimization - unsupported optimization hints

After 2 previous posts dedicated to custom optimization in Apache Spark SQL, it's a good moment to start to write the code. As Jacek Laskowski suggested on Twitter (link in Read more), I will try to implement one extra optimization hint. But first things first and let's start with hints definition.

Continue Reading →

Regression tests with Apache Spark SQL joins

Regressions are one of the risks of our profession. Fortunately, we can limit the risk thanks to different testing strategies. One of them are regression tests that we can use to check whether the modified data processing logic didn't introduce the regressions simply by comparing two datasets.

Continue Reading →

FAIR jobs scheduling in Apache Spark

During my exploration of Apache Spark configuration options, I found an entry called spark.scheduler.mode. After looking for its possible values, I ended up with a pretty intriguing concept called FAIR scheduling that I will detail in this post.

Continue Reading →

Writing Apache Spark SQL custom logical optimization - API

In one of my previous posts I presented how to add a custom optimization to Apache Spark SQL. It was not a good moment to deep delve into the topic because of its complexity. That's why I will try to do a better job here by showing the API of native optimizations.

Continue Reading →

Apache Spark SQL and types resolution in semi-structured data

One of data governance goals is to ensure data consistency across different producers. Unfortunately, very often it's only a theory and especially when the data format is schemaless. It's why the data exploration is an important step in the process of data pipeline definition. In this post I wanted to do a small exercise and check how Apache Spark SQL behaves with inconsistent data.

Continue Reading →

Bzip2 compression in Apache Spark

Compression has a lot of benefits in the data context. It reduces the size of stored data, so you will save some space and also have less data to transfer across the network in the case of a data processing pipeline. And if you use Bzip2, you can process the compressed data in parallel. In this post, I will try to explain how does it happen.

Continue Reading →

Apache Spark 2.4.0 features - EXCEPT ALL and INTERSECT ALL

Apache Spark 2.4.0 brought a lot of internal changes but also some new features exposed to the end users, as already presented high-order functions. In this post, I will present another new feature, or rather 2 actually, because I will talk about 2 new SQL functions.

Continue Reading →

Monotonically increasing id function in Apache Spark SQL

Some time ago I was thinking whether Apache Spark provides the support for auto-incremented values, so hard to implement in distributed environments After some research, I almost found what I was looking for - monotonically increasing id. In this post, I will try to explain why "almost".

Continue Reading →

Introduction to custom optimization in Apache Spark SQL

In November 2018 bithw1 pointed out to me a feature that I haven't used yet in Apache Spark - custom optimization. After some months consacred to learning Apache Spark GraphX, I finally found a moment to explore it. This post begins a new series about Apache Spark customization and it covers the basics, i.e. the 2 available methods to add the custom optimizations.

Continue Reading →

Motifs finding in GraphFrames

In the previous post in GraphFrames category I mentioned the motifs finding feature and promised to write a separate post about it. After several weeks dedicated to the Apache Spark 2.4.0 features, I finally managed to find some time to explore the motifs finding in GraphFrames.

Continue Reading →

Initializing state in Structured Streaming

Some time ago I was asked by Sunil whether it was possible to load the initial state in Apache Spark Structured Streaming like in DStream-based API. Since the response was not obvious, I decided to investigate and share the findings through this post.

Continue Reading →

Memory and Apache Spark classes

In previous posts about memory in Apache Spark, I've been exploring memory behavior of Apache Spark when the input files are much bigger than the allocated memory. After that it's a good moment to sum up that in the post dedicated to classes involved in memory using tasks.

Continue Reading →

Visualizing Apache Spark GraphX data processing with websockets and cytoscape.js

For a long time, I've wanted to make a small real-time data visualization application with the use of websockets and some fancy JavaScript visualization framework. And the moment went when I was preparing the execution schemas to illustrate distributed graph algorithms covered in Graph algorithms in distributed world - part 1 post. I used there static images combined together but it was quite painful. Because of that, I decided to check whether it's possible to do in a more programmatic way.

Continue Reading →

Apache Spark 2.4.0 features - foreachBatch

When I first heard about the foreachBatch feature, I thought that it was the implementation of foreachPartition in the Structured Streaming module. However, after some analysis I saw how I was wrong because this new feature addresses other but also important problems. You will find more .

Continue Reading →

Apache Spark 2.4.0 features - barrier execution mode

Data-driven systems continuously change. We moved from static, batch-oriented daily processing jobs to real-time streaming-based pipelines running all the time. Nowadays, the workflows have more and more AI compontents. Apache Spark tries to stay in the movement and in the new release proposes the implementation of the barrier execution mode as a new way to schedule tasks.

Continue Reading →

Creating graphs in GraphFrames

The Project Tungsten revolutionized Apache Spark ecosystem. Thanks to the new row-based data structure the jobs became more performant and easier to create. This revolution first affected the batch processing and later the streaming one. As of writing the following article, the graph processing is still not impacted but hopefully GraphFrames project can change this.

Continue Reading →

Apache Spark 2.4.0 features - Avro data source

Apache Avro became one of the serialization standards, among others because of its use in Apache Kafka's schema registry. Previously to work with Avro files with Apache Spark we needed Databrick's external package. But it's no longer the case starting from 2.4.0 release where Avro became first-class citizen data source.

Continue Reading →

Iterative algorithms with Pregel on Apache Spark GraphX

One of important characteristics of distributed graph processing which makes it different from classical Map/Reduce approach is the iterative nature of many algorithms. Pregel is one of the computation models that supports such kind of processing very well, while Apache Spark GraphX comes with its own Pregel implementation.

Continue Reading →

Apache Spark 2.4.0 features - array and higher-order functions

The series about the features introduced in Apache Spark 2.4.0 continues. Today's post will cover higher-order functions that you may know from elsewhere.

Continue Reading →

Loading and saving graphs in Apache Spark GraphX

Until now we've been working only with in-memory graphs. However, Apache Spark GraphX provides a much more convenient and prod-ready methods to load and save them. And this post will try to show them.

Continue Reading →