After last week's discovery of VertexRDD we have still one graph-composing item to explain - EdgeRDD. After all, the graph is about the relationships this RDD guarantees the links between vertices.
After last week's global overview of graph representation in GraphX module, it's time to go a little bit deeper and analyze the 2 main components of graphs: vertices and edges. We'll begin here with the former ones.
Apache Spark uses a common data abstraction for all its higher level data structures. This implementation rule isn't different for GraphX represented by the sets of specialized versions of RDDs.
One of the problems with data processing frameworks released in the past few years was the use of different abstractions for batch and streaming tasks. Apache Beam is an exception of this rule because it proposes a uniform data representation called PCollection.
Spark 2.0 brought some changes at API level. One of them was the merge of DataFrame with Dataset. Thanks to that the 3rd data abstraction, present yet in 1.6, was finally removed.
In Spark batch-oriented, RDD was a data abstraction. In Spark Streaming RDDs are still present but for the programmer another data type is exposed - DStream.
The first post about Spark internals concerns Resilient Distributed Dataset (RDD), an abstraction used to represent processed data.