A 3-day bug hunt on a 3-person team costs up to β¬7,200 in lost engineering time. This workshop teaches you to prevent that β unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.
The first post about Spark internals concerns Resilient Distributed Dataset (RDD), an abstraction used to represent processed data.
In Spark batch-oriented, RDD was a data abstraction. In Spark Streaming RDDs are still present but for the programmer another data type is exposed - DStream.
Spark 2.0 brought some changes at API level. One of them was the merge of DataFrame with Dataset. Thanks to that the 3rd data abstraction, present yet in 1.6, was finally removed.
One of the problems with data processing frameworks released in the past few years was the use of different abstractions for batch and streaming tasks. Apache Beam is an exception of this rule because it proposes a uniform data representation called PCollection.
Apache Spark uses a common data abstraction for all its higher level data structures. This implementation rule isn't different for GraphX represented by the sets of specialized versions of RDDs.
After last week's global overview of graph representation in GraphX module, it's time to go a little bit deeper and analyze the 2 main components of graphs: vertices and edges. We'll begin here with the former ones.
After last week's discovery of VertexRDD we have still one graph-composing item to explain - EdgeRDD. After all, the graph is about the relationships this RDD guarantees the links between vertices.