Apache Spark articles

on waitingforcode.com

Cache in Spark

Cache is an appreciable tool when we have a greedy computation generating a lot of data. Spark also uses this feature to better handle the case of RDD which generation is heavy (for example necessities database connection or data retrieval from external web services). Continue Reading →

Per-partition operations in Spark

Spark was developed to work on big amount of data. If big means millions of items. For every item one or several costly operations are done, it'll lead quick to performance problems. It's one of the reasons why Spark proposes operations executed once per partition. Continue Reading →

Shared variables in Spark

Spark has an interesting concept of shared variables among all distributed computations. This special kind of objects is called broadcast variables. But it's not the single possibility to share objects in Spark. The second one are accumulators. Continue Reading →

Directed Acyclic Graph in Spark

As we already know, RDD is the main data concept of Spark. It's created either explicitly or implicitly, through computations called transformations and actions. But these computations are all organized as a graph and scheduled by Spark's components. This graph is called DAG and it's the main topic of this post. Continue Reading →

Transformations in Spark

One of methods generating new RDD consists on applying transformations on already existent RDDs. But transformations not only makes new RDDs but also gives a sense to all data processing. Continue Reading →