Collecting a part of data to the driver with RDD toLocalIterator

The golden rule, when you deal with a lot of data, is to avoid bringing all these data on a single node. It can easily and pretty quickly lead to OOM errors. Spark isn't an exception for this rule. But Spark provides one solution that can reduce the amount of objects brought the driver, when this move is mandatory - toLocalIterator method.

Shading as solution for dependency hell in Spark

Using Spark in AWS environment can sometimes be problematic. It especially is when the dependency hell problem appears. But fortunately, it can be resolved pretty easily with shading.

Apache Spark blocks explained

In Spark blocks are everywhere. They represent broadcasted objects, they are used as support for intermediate steps in shuffle process, or finally they're used to store temporary files. But very often they're disregarded at the beginning because of more meaningful concepts, as transformations and actions - even if without blocks, both of them won't be possible.

Failed tasks resubmit

A lot of things are automatized in Spark: metadata and data checkpointing, task distribution, to quote only some of them. Another one, not mentioned very often, is the automatic retry in the case of task failures.

Graceful shutdown explained

Spark has different methods to reduce data loss, also during streaming processing. It proposes well known checkpointing but also less obvious operation invoked on stopping processing - graceful shutdown.

JARs split personality problem

Often making errors helps to progress. It was my case with spark-submit and local/remote JAR pair. They helped me to understand the role of driver, closures, serialization and some configuration properties.

Dockerize Spark on YARN - lessons learned

Even if a lot of Docker containers exist for Apache Spark, it's always a good exercise to make one in your own. It can help to understand some new concepts as well as improve skills of building Docker images.

Zoom at broadcast variables

Broadcast variables send object to executors only once and can be easily used to reduce network transfer and thus are precious in terms of distributed computing.

Stateful transformations with mapWithState

updateStateByKey function, explained in the post about Stateful transformations in Spark Streaming, is not the single solution provided by Spark Streaming to deal with state. Another one, much more optimized, is mapWithState.

Spark's Singleton to be or not to be dilemma

Some time ago I was wondering why an object created once in the driver is recreated every time with new stage on executors - even if this object is sent through a broadcast variable. After some code digging, the response related to Java serialization appeared.

Serialization issues - part 2

Some of previous posts (Serialization issues - part 1) presented some of solutions for serialization problems. This post is its continuation.

Serialization issues - part 1

Issues with not serializable objects are maybe the most painful when we start to work with Spark. But hopefully there are several solutions to them.

Deployment modes and master URLs in Spark

Spark has 2 deployment modes that can be controlled in fine-grained way thanks to master URL property.

Metadata checkpoint

One of previous posts talked about checkpoint types in Spark Streaming. This one focuses more on one type of them - metadata checkpoint.

Schema projection

Even if it's always better to explicit things, in programming we have often the possibility to let the computer to guess. Spark SQL also has this level of intelligence, for example during schema resolving.

Code execution on driver and executors

Keeping in mind which parts of Spark code are executed on driver and which ones on workers is important and can help to avoid some of annoying errors, as the ones related to serialization.

Tree aggregations in Spark

As every library, Spark has methods than are used more often than the others. As often used methods we could certainly define map or filter. In the other side of less popular transformations we could place, among others, tree-like methods that will be presented in this post.

isEmpty() trap in Spark

In general Spark's actions reflects logic implemented in a lot of equivalent methods in programming languages. As an example we can consider isEmpty() that in Spark checks the existence of only 1 element and similarly in Java's List. But it can often lead to troubles, especially when more than 1 action is invoked.

Testing strategies in Spark

After writing a post about testing Spark applications, I decided to take a look at Spark project tests and see which patterns they use to verify framework features.

Testing Spark applications

It's difficult to contest the importance of testing in programming. Tests help to avoid regressions (a lot of regressions) and also to better understand developed code. Spark (and other data processing frameworks by the way) is not an exception of this rule. But, obviously, testing applications working in distributed mode is more tricky than in the case of standalone programs.

