Articles about distributed data representation on waitingforcode.com

December 26, 2018 • Apache Spark GraphX

Edge representation in Apache Spark GraphX

After last week's discovery of VertexRDD we have still one graph-composing item to explain - EdgeRDD. After all, the graph is about the relationships this RDD guarantees the links between vertices.

Continue Reading →

December 19, 2018 • Apache Spark GraphX

Vertex representation in Apache Spark GraphX

After last week's global overview of graph representation in GraphX module, it's time to go a little bit deeper and analyze the 2 main components of graphs: vertices and edges. We'll begin here with the former ones.

Continue Reading →

December 5, 2018 • Apache Spark GraphX

Graphs representation in Apache Spark GraphX

Apache Spark uses a common data abstraction for all its higher level data structures. This implementation rule isn't different for GraphX represented by the sets of specialized versions of RDDs.

Continue Reading →

December 16, 2017 • Apache Beam

PCollection - data representation in Apache Beam

One of the problems with data processing frameworks released in the past few years was the use of different abstractions for batch and streaming tasks. Apache Beam is an exception of this rule because it proposes a uniform data representation called PCollection.

Continue Reading →

February 26, 2017 • Apache Spark SQL

Dataset in Spark SQL

Spark 2.0 brought some changes at API level. One of them was the merge of DataFrame with Dataset. Thanks to that the 3rd data abstraction, present yet in 1.6, was finally removed.

Continue Reading →

November 6, 2016 • Apache Spark Streaming

DStream in Spark Streaming

In Spark batch-oriented, RDD was a data abstraction. In Spark Streaming RDDs are still present but for the programmer another data type is exposed - DStream.

Continue Reading →

October 15, 2016 • Apache Spark

Data representation in Spark - RDD

The first post about Spark internals concerns Resilient Distributed Dataset (RDD), an abstraction used to represent processed data.

Continue Reading →

distributed data representation articles

Edge representation in Apache Spark GraphX

Vertex representation in Apache Spark GraphX

Graphs representation in Apache Spark GraphX

PCollection - data representation in Apache Beam

Dataset in Spark SQL

DStream in Spark Streaming

Data representation in Spark - RDD