Apache Spark articles

on waitingforcode.com

Memory management in Spark

Memory management in Spark went through some changes. In the first versions, the allocation had a fix size. Only the 1.6 release changed it to more dynamic behavior. This change will be the main topic of the post. Continue Reading →

Shuffling in Spark

As already told in one of previous posts about Spark, shuffle is a process which moves data between nodes. It's orchestrated by a specific manager and it will be the topic of this post. Continue Reading →

Cache in Spark

Cache is an appreciable tool when we have a greedy computation generating a lot of data. Spark also uses this feature to better handle the case of RDD which generation is heavy (for example necessities database connection or data retrieval from external web services). Continue Reading →

Per-partition operations in Spark

Spark was developed to work on big amount of data. If big means millions of items. For every item one or several costly operations are done, it'll lead quick to performance problems. It's one of the reasons why Spark proposes operations executed once per partition. Continue Reading →

Shared variables in Spark

Spark has an interesting concept of shared variables among all distributed computations. This special kind of objects is called broadcast variables. But it's not the single possibility to share objects in Spark. The second one are accumulators. Continue Reading →

Directed Acyclic Graph in Spark

As we already know, RDD is the main data concept of Spark. It's created either explicitly or implicitly, through computations called transformations and actions. But these computations are all organized as a graph and scheduled by Spark's components. This graph is called DAG and it's the main topic of this post. Continue Reading →

Transformations in Spark

One of methods generating new RDD consists on applying transformations on already existent RDDs. But transformations not only makes new RDDs but also gives a sense to all data processing. Continue Reading →