Apache Spark articles

Shuffling in Spark

As already told in one of previous posts about Spark, shuffle is a process which moves data between nodes. It's orchestrated by a specific manager and it will be the topic of this post.

Continue Reading →

Cache in Spark

Cache is an appreciable tool when we have a greedy computation generating a lot of data. Spark also uses this feature to better handle the case of RDD which generation is heavy (for example necessities database connection or data retrieval from external web services).

Continue Reading →

Checkpointing in Spark

Checkpointing is, alongside caching, a method allowing to make a RDD persist. But there are some subtle differences between cache and checkpoint.

Continue Reading →

Serialization in Spark

Serialization frameworks are intrinsic part of Big Data systems. Spark is not an exception for this rule and it offers some different possibilities to manage serialization.

Continue Reading →

Per-partition operations in Spark

Spark was developed to work on big amount of data. If big means millions of items. For every item one or several costly operations are done, it'll lead quick to performance problems. It's one of the reasons why Spark proposes operations executed once per partition.

Continue Reading →

Shared variables in Spark

Spark has an interesting concept of shared variables among all distributed computations. This special kind of objects is called broadcast variables. But it's not the single possibility to share objects in Spark. The second one are accumulators.

Continue Reading →

Partitioning in Spark

Partitioning in distributed data is quite common concept. Spark is not an exception and it also has some operations related to partitions.

Continue Reading →

Spark architecture members

The knowledge of Spark's API is not a single useful thing. It's also so important to know when and by who programs are executed.

Continue Reading →

Directed Acyclic Graph in Spark

As we already know, RDD is the main data concept of Spark. It's created either explicitly or implicitly, through computations called transformations and actions. But these computations are all organized as a graph and scheduled by Spark's components. This graph is called DAG and it's the main topic of this post.

Continue Reading →

Actions in Spark

In Spark, actions are the final results of operations on RDDs. Without them, transformations are meaningless and difficult to use by applications.

Continue Reading →

Transformations in Spark

One of methods generating new RDD consists on applying transformations on already existent RDDs. But transformations not only makes new RDDs but also gives a sense to all data processing.

Continue Reading →

Data representation in Spark - RDD

The first post about Spark internals concerns Resilient Distributed Dataset (RDD), an abstraction used to represent processed data.

Continue Reading →