Graph processing frameworks survey

Versions: GraphX 2.4.0, Gelly 1.6.1, Giraph 1.2.0

The series about graph processing continues. Today it's the moment to analyze some major graph processing frameworks and choose the framework that I'll present more in details in incoming posts.

This article talks about 3 main graph processing frameworks: Apache Spark GraphX, Apache Flink's Gelly library, and Apache Giraph project. The features of all of them are listed in the first section. The next part exactly as in Choosing time-series database for study explains what framework I chosen as graph processing learning project.

GraphX, Gelly and Giraph comparison

Bipartite graph

A bipartite graph is the graph composed of 2 disjoint sets of vertices. Each of these vertices is connected to one or more vertices in the other set:

Learning project choice

During this quick and, I admit, superficial analysis, my first reflex was to try Gelly. It impressed me by the roadmap quoted previously in this post. However, after some deeper research, I found that not many points of that roadmap were implemented in the official version after 3 years. What about Giraph ? The project seems to make its proofs in really big real graphs (Facebook). However, I was a little bit scared by the lack of activity in the repository.

Thus, naturally I turned out to GraphX. Even though I had an impression it's a little bit less supported by Databricks and the community that Structured Streaming and SQL modules, I really appreciated the GraphFrames initiative that can be a game changer on this field. Moreover, GraphX fits pretty well in the environment of my blog where Apache Spark, except some posts about Apache Beam, is the main described data processing framework.

To sum up, the next graph-related posts will be mostly about Apache Spark GraphX module. But please keep in mind that it's only a subjective choice and two other compared frameworks would be valid solutions too.