Apache Spark articles

JARs split personality problem

Often making errors helps to progress. It was my case with spark-submit and local/remote JAR pair. They helped me to understand the role of driver, closures, serialization and some configuration properties.

Continue Reading →

Dockerize Spark on YARN - lessons learned

Even if a lot of Docker containers exist for Apache Spark, it's always a good exercise to make one in your own. It can help to understand some new concepts as well as improve skills of building Docker images.

Continue Reading →

Zoom at broadcast variables

Broadcast variables send object to executors only once and can be easily used to reduce network transfer and thus are precious in terms of distributed computing.

Continue Reading →

Spark's Singleton to be or not to be dilemma

Some time ago I was wondering why an object created once in the driver is recreated every time with new stage on executors - even if this object is sent through a broadcast variable. After some code digging, the response related to Java serialization appeared.

Continue Reading →

Serialization issues - part 2

Some of previous posts (Serialization issues - part 1) presented some of solutions for serialization problems. This post is its continuation.

Continue Reading →

Serialization issues - part 1

Issues with not serializable objects are maybe the most painful when we start to work with Spark. But hopefully there are several solutions to them.

Continue Reading →

Deployment modes and master URLs in Spark

Spark has 2 deployment modes that can be controlled in fine-grained way thanks to master URL property.

Continue Reading →

Code execution on driver and executors

Keeping in mind which parts of Spark code are executed on driver and which ones on workers is important and can help to avoid some of annoying errors, as the ones related to serialization.

Continue Reading →

Tree aggregations in Spark

As every library, Spark has methods than are used more often than the others. As often used methods we could certainly define map or filter. In the other side of less popular transformations we could place, among others, tree-like methods that will be presented in this post.

Continue Reading →

isEmpty() trap in Spark

In general Spark's actions reflects logic implemented in a lot of equivalent methods in programming languages. As an example we can consider isEmpty() that in Spark checks the existence of only 1 element and similarly in Java's List. But it can often lead to troubles, especially when more than 1 action is invoked.

Continue Reading →

Testing strategies in Spark

After writing a post about testing Spark applications, I decided to take a look at Spark project tests and see which patterns they use to verify framework features.

Continue Reading →

Testing Spark applications

It's difficult to contest the importance of testing in programming. Tests help to avoid regressions (a lot of regressions) and also to better understand developed code. Spark (and other data processing frameworks by the way) is not an exception of this rule. But, obviously, testing applications working in distributed mode is more tricky than in the case of standalone programs.

Continue Reading →

Jobs, stages and tasks

Every distributed computation is divided in small parts called jobs, stages and tasks. It's useful to know them especially during monitoring because it helps to detect bottlenecks.

Continue Reading →

Configuration of Spark architecture members

Often a misconfiguration is the reason of all kinds of issues - performance, security or functional. Spark isn't an exception for this rule and it's the reason why this article focuses on configuration properties available for driver and executors.

Continue Reading →

Spark shuffle - complementary notes

This small post is the complement for previous article describing big lines of shuffle. It focuses more in details on writing part.

Continue Reading →

Memory management in Spark

Memory management in Spark went through some changes. In the first versions, the allocation had a fix size. Only the 1.6 release changed it to more dynamic behavior. This change will be the main topic of the post.

Continue Reading →

Shuffling in Spark

As already told in one of previous posts about Spark, shuffle is a process which moves data between nodes. It's orchestrated by a specific manager and it will be the topic of this post.

Continue Reading →

Cache in Spark

Cache is an appreciable tool when we have a greedy computation generating a lot of data. Spark also uses this feature to better handle the case of RDD which generation is heavy (for example necessities database connection or data retrieval from external web services).

Continue Reading →

Checkpointing in Spark

Checkpointing is, alongside caching, a method allowing to make a RDD persist. But there are some subtle differences between cache and checkpoint.

Continue Reading →

Serialization in Spark

Serialization frameworks are intrinsic part of Big Data systems. Spark is not an exception for this rule and it offers some different possibilities to manage serialization.

Continue Reading →