Articles about Data processing on waitingforcode.com - articles for the pleasure of learning and discovery

Looking for something else? Check the categories of Data processing:

Apache Beam Apache Flink Apache Spark Apache Spark GraphFrames Apache Spark GraphX Apache Spark SQL Apache Spark Streaming Apache Spark Structured Streaming PySpark

If not, below you can find all articles belonging to Data processing.

June 7, 2019 • Apache Spark SQL

Apache Spark SQL and unit tests

Some time ago I was involved in a discussion about testing Apache Spark SQL code. In this post, I would like to share my observations about this topic.

Continue Reading →

May 28, 2019 • Apache Spark SQL

Writing Apache Spark SQL custom logical optimization - improved code and summary

In the previous post about Apache Spark SQL custom optimizations I presented a rule transforming UNION operator into JOIN. At this time I only did a simple version working only with 2 datasets. In this post, I will share its improved version.

Continue Reading →

May 25, 2019 • Apache Spark SQL

Range partitioning in Apache Spark SQL

The most popular partitioning strategy divides the dataset by the hash computed from one or more values of the record. However other partitioning strategies exist as well and one of them is range partitioning implemented in Apache Spark SQL with repartitionByRange method, described in this post.

Continue Reading →

May 22, 2019 • Apache Spark SQL

Writing Apache Spark SQL custom logical optimization - the first version

Last time I wrote about different hints present in RDBMS and Hive. Today it's the moment to implement one of them.

Continue Reading →

May 15, 2019 • Apache Spark SQL

Writing Apache Spark SQL custom logical optimization - unsupported optimization hints

After 2 previous posts dedicated to custom optimization in Apache Spark SQL, it's a good moment to start to write the code. As Jacek Laskowski suggested on Twitter (link in Read more), I will try to implement one extra optimization hint. But first things first and let's start with hints definition.

Continue Reading →

May 9, 2019 • Apache Spark SQL

Regression tests with Apache Spark SQL joins

Regressions are one of the risks of our profession. Fortunately, we can limit the risk thanks to different testing strategies. One of them are regression tests that we can use to check whether the modified data processing logic didn't introduce the regressions simply by comparing two datasets.

Continue Reading →

April 4, 2019 • Apache Spark

FAIR jobs scheduling in Apache Spark

During my exploration of Apache Spark configuration options, I found an entry called spark.scheduler.mode. After looking for its possible values, I ended up with a pretty intriguing concept called FAIR scheduling that I will detail in this post.

Continue Reading →

April 3, 2019 • Apache Spark SQL

Writing Apache Spark SQL custom logical optimization - API

In one of my previous posts I presented how to add a custom optimization to Apache Spark SQL. It was not a good moment to deep delve into the topic because of its complexity. That's why I will try to do a better job here by showing the API of native optimizations.

Continue Reading →

March 27, 2019 • Apache Spark SQL

Apache Spark SQL and types resolution in semi-structured data

One of data governance goals is to ensure data consistency across different producers. Unfortunately, very often it's only a theory and especially when the data format is schemaless. It's why the data exploration is an important step in the process of data pipeline definition. In this post I wanted to do a small exercise and check how Apache Spark SQL behaves with inconsistent data.

Continue Reading →

March 22, 2019 • Apache Spark

Bzip2 compression in Apache Spark

Compression has a lot of benefits in the data context. It reduces the size of stored data, so you will save some space and also have less data to transfer across the network in the case of a data processing pipeline. And if you use Bzip2, you can process the compressed data in parallel. In this post, I will try to explain how does it happen.

Continue Reading →

March 21, 2019 • Apache Spark SQL

Apache Spark 2.4.0 features - EXCEPT ALL and INTERSECT ALL

Apache Spark 2.4.0 brought a lot of internal changes but also some new features exposed to the end users, as already presented high-order functions. In this post, I will present another new feature, or rather 2 actually, because I will talk about 2 new SQL functions.

Continue Reading →

March 14, 2019 • Apache Spark SQL

Monotonically increasing id function in Apache Spark SQL

Some time ago I was thinking whether Apache Spark provides the support for auto-incremented values, so hard to implement in distributed environments After some research, I almost found what I was looking for - monotonically increasing id. In this post, I will try to explain why "almost".

Continue Reading →

March 13, 2019 • Apache Spark SQL

Introduction to custom optimization in Apache Spark SQL

In November 2018 bithw1 pointed out to me a feature that I haven't used yet in Apache Spark - custom optimization. After some months consacred to learning Apache Spark GraphX, I finally found a moment to explore it. This post begins a new series about Apache Spark customization and it covers the basics, i.e. the 2 available methods to add the custom optimizations.

Continue Reading →

March 8, 2019 • Apache Spark GraphFrames

Motifs finding in GraphFrames

In the previous post in GraphFrames category I mentioned the motifs finding feature and promised to write a separate post about it. After several weeks dedicated to the Apache Spark 2.4.0 features, I finally managed to find some time to explore the motifs finding in GraphFrames.

Continue Reading →

February 27, 2019 • Apache Spark Structured Streaming

Initializing state in Structured Streaming

Some time ago I was asked by Sunil whether it was possible to load the initial state in Apache Spark Structured Streaming like in DStream-based API. Since the response was not obvious, I decided to investigate and share the findings through this post.

Continue Reading →

February 16, 2019 • Apache Spark

Memory and Apache Spark classes

In previous posts about memory in Apache Spark, I've been exploring memory behavior of Apache Spark when the input files are much bigger than the allocated memory. After that it's a good moment to sum up that in the post dedicated to classes involved in memory using tasks.

Continue Reading →

February 13, 2019 • Apache Spark GraphX

Visualizing Apache Spark GraphX data processing with websockets and cytoscape.js

For a long time, I've wanted to make a small real-time data visualization application with the use of websockets and some fancy JavaScript visualization framework. And the moment went when I was preparing the execution schemas to illustrate distributed graph algorithms covered in Graph algorithms in distributed world - part 1 post. I used there static images combined together but it was quite painful. Because of that, I decided to check whether it's possible to do in a more programmatic way.

Continue Reading →

February 6, 2019 • Apache Spark Structured Streaming

Apache Spark 2.4.0 features - foreachBatch

When I first heard about the foreachBatch feature, I thought that it was the implementation of foreachPartition in the Structured Streaming module. However, after some analysis I saw how I was wrong because this new feature addresses other but also important problems. You will find more .

Continue Reading →

January 30, 2019 • Apache Spark

Apache Spark 2.4.0 features - barrier execution mode

Data-driven systems continuously change. We moved from static, batch-oriented daily processing jobs to real-time streaming-based pipelines running all the time. Nowadays, the workflows have more and more AI compontents. Apache Spark tries to stay in the movement and in the new release proposes the implementation of the barrier execution mode as a new way to schedule tasks.

Continue Reading →

January 24, 2019 • Apache Spark GraphFrames

Creating graphs in GraphFrames

The Project Tungsten revolutionized Apache Spark ecosystem. Thanks to the new row-based data structure the jobs became more performant and easier to create. This revolution first affected the batch processing and later the streaming one. As of writing the following article, the graph processing is still not impacted but hopefully GraphFrames project can change this.

Continue Reading →

⟵ Previous
6
7
8
9
10
11
12
13
14
15
16
17
18
Next ⟶

Data processing articles