Apache Spark GraphX articles

Visualizing Apache Spark GraphX data processing with websockets and cytoscape.js

For a long time, I've wanted to make a small real-time data visualization application with the use of websockets and some fancy JavaScript visualization framework. And the moment went when I was preparing the execution schemas to illustrate distributed graph algorithms covered in Graph algorithms in distributed world - part 1 post. I used there static images combined together but it was quite painful. Because of that, I decided to check whether it's possible to do in a more programmatic way.

Continue Reading β†’

Iterative algorithms with Pregel on Apache Spark GraphX

One of important characteristics of distributed graph processing which makes it different from classical Map/Reduce approach is the iterative nature of many algorithms. Pregel is one of the computation models that supports such kind of processing very well, while Apache Spark GraphX comes with its own Pregel implementation.

Continue Reading β†’

Loading and saving graphs in Apache Spark GraphX

Until now we've been working only with in-memory graphs. However, Apache Spark GraphX provides a much more convenient and prod-ready methods to load and save them. And this post will try to show them.

Continue Reading β†’

Edge partitioning strategies

Previously we've learned about the vertices and edges representations in Apache Spark GraphX. At this moment to not introduce too many new concepts at once, we deliberately omitted the discovery of edges partitioning. Luckily, a new week comes and it lets us discuss that.

Continue Reading β†’

Edge representation in Apache Spark GraphX

After last week's discovery of VertexRDD we have still one graph-composing item to explain - EdgeRDD. After all, the graph is about the relationships this RDD guarantees the links between vertices.

Continue Reading β†’

Vertex representation in Apache Spark GraphX

After last week's global overview of graph representation in GraphX module, it's time to go a little bit deeper and analyze the 2 main components of graphs: vertices and edges. We'll begin here with the former ones.

Continue Reading β†’

GraphX and fault-tolerance

Bad things happen in distributed data processing and if we're prepared for them, it's better. To prevent against such issues Apache Spark is able to recompute failed partition but also to store the computation snapshot as a checkpoint. Both properties apply to GraphX module's fault-tolerance mechanism.

Continue Reading β†’

Graphs representation in Apache Spark GraphX

Apache Spark uses a common data abstraction for all its higher level data structures. This implementation rule isn't different for GraphX represented by the sets of specialized versions of RDDs.

Continue Reading β†’

Introduction to Apache Spark GraphX

Every time when we learn a new topic, it's important to start from the basics. We couldn't learn a new language without knowing the order of subject and verbs in a sentence. The same rule applies to Apache Spark's GraphX module that will be covered in this category. But before going into details, we'll focus on its basics.

Continue Reading β†’