Data processing articles

Home Data processing

Looking for something else? Check the categories of Data processing:

Apache Beam Apache Flink Apache Spark Apache Spark GraphFrames Apache Spark GraphX Apache Spark SQL Apache Spark Streaming Apache Spark Structured Streaming PySpark

If not, below you can find all articles belonging to Data processing.

August 5, 2017 • Apache Spark SQL

Broadcast join in Spark SQL

Joining DataFrames can be a performance-sensitive task. After all, it involves matching data from two data sources and keeping matched results in a single place. As you can deduce, the first thinking goes towards shuffle join operation. However, it's not the single strategy implemented in Spark SQL. For some specific use cases another type called broadcast join can be preferred.

Continue Reading →

August 5, 2017 • Apache Spark SQL

Join types in Spark SQL

Spark SQL reflects the most of concepts related to relational databases as possible. One of them are joins that can be defined in one of 7 forms.

Continue Reading →

July 30, 2017 • Apache Spark

Dynamic resource allocation in Spark

Defining the universal workload and associating corresponding resources is always difficult. Even if most of time expected resources will support the load, there always will be some interval in the year when data activity will grow (e.g. Black Friday). One of Spark's mechanisms helping to prevent processing failures in such situations is dynamic resource allocation.

Continue Reading →

July 22, 2017 • Apache Spark SQL

Partitioning RDBMS data in Spark SQL

Without any explicit definition, Spark SQL won't partition any data, i.e. all rows will be processed by one executor. It's not optimal since Spark was designed to parallel and distributed processing.

Continue Reading →

July 22, 2017 • Apache Spark SQL

Loading data from RDBMS

Structured data processing takes more and more place in Apache Spark project. Structured streaming is one of the proofs. But how does Spark SQL work - and particularly, how does it load data from sources of structured data as RDMBS ?

Continue Reading →

July 15, 2017 • Apache Spark

Collecting a part of data to the driver with RDD toLocalIterator

The golden rule, when you deal with a lot of data, is to avoid bringing all these data on a single node. It can easily and pretty quickly lead to OOM errors. Spark isn't an exception for this rule. But Spark provides one solution that can reduce the amount of objects brought the driver, when this move is mandatory - toLocalIterator method.

Continue Reading →

July 15, 2017 • Apache Spark

Shading as solution for dependency hell in Spark

Using Spark in AWS environment can sometimes be problematic. It especially is when the dependency hell problem appears. But fortunately, it can be resolved pretty easily with shading.

Continue Reading →

July 9, 2017 • Apache Spark

Apache Spark blocks explained

In Spark blocks are everywhere. They represent broadcasted objects, they are used as support for intermediate steps in shuffle process, or finally they're used to store temporary files. But very often they're disregarded at the beginning because of more meaningful concepts, as transformations and actions - even if without blocks, both of them won't be possible.

Continue Reading →

July 9, 2017 • Apache Spark

Failed tasks resubmit

A lot of things are automatized in Spark: metadata and data checkpointing, task distribution, to quote only some of them. Another one, not mentioned very often, is the automatic retry in the case of task failures.

Continue Reading →

July 2, 2017 • Apache Spark Streaming

Graceful shutdown explained

Spark has different methods to reduce data loss, also during streaming processing. It proposes well known checkpointing but also less obvious operation invoked on stopping processing - graceful shutdown.

Continue Reading →

July 2, 2017 • Apache Spark

JARs split personality problem

Often making errors helps to progress. It was my case with spark-submit and local/remote JAR pair. They helped me to understand the role of driver, closures, serialization and some configuration properties.

Continue Reading →

June 25, 2017 • Apache Spark

Dockerize Spark on YARN - lessons learned

Even if a lot of Docker containers exist for Apache Spark, it's always a good exercise to make one in your own. It can help to understand some new concepts as well as improve skills of building Docker images.

Continue Reading →

June 17, 2017 • Apache Spark

Zoom at broadcast variables

Broadcast variables send object to executors only once and can be easily used to reduce network transfer and thus are precious in terms of distributed computing.

Continue Reading →

June 11, 2017 • Apache Spark Streaming

Stateful transformations with mapWithState

updateStateByKey function, explained in the post about Stateful transformations in Spark Streaming, is not the single solution provided by Spark Streaming to deal with state. Another one, much more optimized, is mapWithState.

Continue Reading →

June 11, 2017 • Apache Spark

Spark's Singleton to be or not to be dilemma

Some time ago I was wondering why an object created once in the driver is recreated every time with new stage on executors - even if this object is sent through a broadcast variable. After some code digging, the response related to Java serialization appeared.

Continue Reading →

June 5, 2017 • Apache Spark

Serialization issues - part 2

Some of previous posts (Serialization issues - part 1) presented some of solutions for serialization problems. This post is its continuation.

Continue Reading →

June 5, 2017 • Apache Spark

Serialization issues - part 1

Issues with not serializable objects are maybe the most painful when we start to work with Spark. But hopefully there are several solutions to them.

Continue Reading →

May 29, 2017 • Apache Spark

Deployment modes and master URLs in Spark

Spark has 2 deployment modes that can be controlled in fine-grained way thanks to master URL property.

Continue Reading →

May 29, 2017 • Apache Spark Streaming

Metadata checkpoint

One of previous posts talked about checkpoint types in Spark Streaming. This one focuses more on one type of them - metadata checkpoint.

Continue Reading →

May 21, 2017 • Apache Spark SQL

Schema projection

Even if it's always better to explicit things, in programming we have often the possibility to let the computer to guess. Spark SQL also has this level of intelligence, for example during schema resolving.

Continue Reading →