Data processing articles

Looking for something else? Check the categories of Data processing:

Apache Beam Apache Flink Apache Spark Apache Spark GraphFrames Apache Spark GraphX Apache Spark SQL Apache Spark Streaming Apache Spark Structured Streaming PySpark

If not, below you can find all articles belonging to Data processing.

Memory and Apache Spark classes

In previous posts about memory in Apache Spark, I've been exploring memory behavior of Apache Spark when the input files are much bigger than the allocated memory. After that it's a good moment to sum up that in the post dedicated to classes involved in memory using tasks.

Continue Reading β†’

Visualizing Apache Spark GraphX data processing with websockets and cytoscape.js

For a long time, I've wanted to make a small real-time data visualization application with the use of websockets and some fancy JavaScript visualization framework. And the moment went when I was preparing the execution schemas to illustrate distributed graph algorithms covered in Graph algorithms in distributed world - part 1 post. I used there static images combined together but it was quite painful. Because of that, I decided to check whether it's possible to do in a more programmatic way.

Continue Reading β†’

Apache Spark 2.4.0 features - foreachBatch

When I first heard about the foreachBatch feature, I thought that it was the implementation of foreachPartition in the Structured Streaming module. However, after some analysis I saw how I was wrong because this new feature addresses other but also important problems. You will find more .

Continue Reading β†’

Apache Spark 2.4.0 features - barrier execution mode

Data-driven systems continuously change. We moved from static, batch-oriented daily processing jobs to real-time streaming-based pipelines running all the time. Nowadays, the workflows have more and more AI compontents. Apache Spark tries to stay in the movement and in the new release proposes the implementation of the barrier execution mode as a new way to schedule tasks.

Continue Reading β†’

Creating graphs in GraphFrames

The Project Tungsten revolutionized Apache Spark ecosystem. Thanks to the new row-based data structure the jobs became more performant and easier to create. This revolution first affected the batch processing and later the streaming one. As of writing the following article, the graph processing is still not impacted but hopefully GraphFrames project can change this.

Continue Reading β†’

Apache Spark 2.4.0 features - Avro data source

Apache Avro became one of the serialization standards, among others because of its use in Apache Kafka's schema registry. Previously to work with Avro files with Apache Spark we needed Databrick's external package. But it's no longer the case starting from 2.4.0 release where Avro became first-class citizen data source.

Continue Reading β†’

Iterative algorithms with Pregel on Apache Spark GraphX

One of important characteristics of distributed graph processing which makes it different from classical Map/Reduce approach is the iterative nature of many algorithms. Pregel is one of the computation models that supports such kind of processing very well, while Apache Spark GraphX comes with its own Pregel implementation.

Continue Reading β†’

Apache Spark 2.4.0 features - array and higher-order functions

The series about the features introduced in Apache Spark 2.4.0 continues. Today's post will cover higher-order functions that you may know from elsewhere.

Continue Reading β†’

Loading and saving graphs in Apache Spark GraphX

Until now we've been working only with in-memory graphs. However, Apache Spark GraphX provides a much more convenient and prod-ready methods to load and save them. And this post will try to show them.

Continue Reading β†’

Apache Spark 2.4.0 features - watermark configuration

The series about Apache Spark 2.4.0 features continues. After last week's discovery of bucket pruning, it's time to switch to Structured Streaming module and see its major evolution.

Continue Reading β†’

Edge partitioning strategies

Previously we've learned about the vertices and edges representations in Apache Spark GraphX. At this moment to not introduce too many new concepts at once, we deliberately omitted the discovery of edges partitioning. Luckily, a new week comes and it lets us discuss that.

Continue Reading β†’

Apache Spark 2.4.0 features - bucket pruning

This post begins a new series dedicated to Apache Spark 2.4.0 features. The first covered topic will be bucket pruning.

Continue Reading β†’

Defining schemas in Apache Spark SQL with builder design pattern

Schemas are one of the key parts of Apache Spark SQL and its distinction point with old RDD-based API. When we deal with data coming from a structured data source as a relational database or schema-based file formats, we can let the framework to resolve the schema for us. But the things complicate when we're working with semi-structured data as JSON and we must define the schema by hand.

Continue Reading β†’

Edge representation in Apache Spark GraphX

After last week's discovery of VertexRDD we have still one graph-composing item to explain - EdgeRDD. After all, the graph is about the relationships this RDD guarantees the links between vertices.

Continue Reading β†’

The who, when, how and what of Apache Spark SQL code generation

The code generated by Apache Spark for all the queries defined with higher-level concepts as SQL queries is the key to understand the processing logic performance. This post, started after a discussion on my Github, tries to explain some of the basics of code generation workflow.

Continue Reading β†’

Vertex representation in Apache Spark GraphX

After last week's global overview of graph representation in GraphX module, it's time to go a little bit deeper and analyze the 2 main components of graphs: vertices and edges. We'll begin here with the former ones.

Continue Reading β†’

DataFrame and file bigger than available memory

Some weeks ago I've written a post about files with long lines impact on RDD-based processing. The post revealed the difficulty to process such files because of OOM errors. In this post I wanted to check how does it apply to Datasets.

Continue Reading β†’

GraphX and fault-tolerance

Bad things happen in distributed data processing and if we're prepared for them, it's better. To prevent against such issues Apache Spark is able to recompute failed partition but also to store the computation snapshot as a checkpoint. Both properties apply to GraphX module's fault-tolerance mechanism.

Continue Reading β†’

Apache Spark and off-heap memory

With data-intensive applications as the streaming ones, bad memory management can add long pauses for GC. Luckily, we can reduce this impact by writing memory-optimized code and using the storage outside the heap called off-heap.

Continue Reading β†’

Graphs representation in Apache Spark GraphX

Apache Spark uses a common data abstraction for all its higher level data structures. This implementation rule isn't different for GraphX represented by the sets of specialized versions of RDDs.

Continue Reading β†’