The series about graph processing continues. Today it's the moment to analyze some major graph processing frameworks and choose the framework that I'll present more in details in incoming posts.
Data Engineering Design Patterns
Looking for a book that defines and solves most common data engineering problems? I'm currently writing
one on that topic and the first chapters are already available in 👉
Early Release on the O'Reilly platform
I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩
This article talks about 3 main graph processing frameworks: Apache Spark GraphX, Apache Flink's Gelly library, and Apache Giraph project. The features of all of them are listed in the first section. The next part exactly as in Choosing time-series database for study explains what framework I chosen as graph processing learning project.
GraphX, Gelly and Giraph comparison
- data representation - GraphX and Gelly use the same data representation as their underlying frameworks. GraphX is based on Spark's RDDs while Gelly on Flink's DataSet structures. Giraph uses its own data structures as Vertex or Edge classes. The properties in the graph are represented in their turn with Hadoop's *Writabe objects.
- data source - Giraph, thanks to its on-top Hadoop construction, is able to process the data from HDFS, HBase or Hive. Moreover, accordingly to "Practical graph analytics with Apache Giraph", it also supports the data from Cassandra. Regarding Gelly and GraphX, from the available sources, we can learn about their support for specifically formatted files. However, even though I didn't find any precise examples, we could also consider that they support the data from any data source supported by Apache Flink and Apache Spark.
- data manipulation - since both Gelly and GraphX are part of data processing framework, they inherit directly a lot of its features, as transformation functions (map, filter,...) or broadcast variables. Aside from global processing functions, both frameworks also have pure graph operations, as vertices or edges mapping or grouping. Since Giraph is built independently on any other data processing library, it uses its own functions. It provides obviously an implementation for BSP approach but also some reducing methods as combiners or aggregators.
- Open-Source - all 3 projects are Open Source but it doesn't have the same definition. GraphX is a part of Apache Spark and thus the community is supported by Databricks. Gelly is a part of Apache Flink so the community works together with data Artisans company. Finally, Giraph is a pure product of the community. Even though it's not officially owned by nobody, Facebook participated actively in its development by leading into important performance gains.
- programming model - the most flexible seems to be Gelly that supports 3 computation models covered in the post about graph computation model. GraphX, in plus of neighbor-related methods, provides also an extended version of Pregel where the messages can be only sent to the direct neighbors. Giraph in its turn is purely Pregel-based and extends it by adding master computation, sharded aggregators or edge-oriented input.
- partitioning - it was already discussed in the post about graph partitioning. Just to recall the tellings, GraphX outperforms this comparison category since it provides hash-based partitioning together with 1-dimension and 2-dimension ones. Gelly and Giraph only support the former one.
- static and dynamic graphs - in theory all analyzed frameworks handle graph mutations. GraphX does it by creating a new graph at every change of the graph structure. Gelly exposes more convenient methods to deal with graph changes - addVertex, addEdge or their corresponding removals. Giraph follows the path of Apache Flink's module and also provides pretty meaningful methods to mutate the graph.
- batch and streaming - none of the analyzed frameworks provide a support for streaming data sources. The concept of stream graph processing was pushed very far only in the case of Gelly. Flink's library proposed an experimental feature called Gelly streaming. However, the project still lives as an independent initiative that seems to not be kept up to date.
- algorithms - in Giraph the algorithms are represented by org.apache.giraph.graph.Computation implementations. Among them, we can find PageRank, weighted PageRank, connected components, strongly connected components or max value in the graph. This list is extended with community initiatives as the algorithms proposed in giraph-algorithms project. GraphX has similar built-in algorithms: PageRank, personalized PageRank, triangle count, shortest paths, connected component, strongly connected component, label propagation or SVD++. A little bit longer list of implemented algorithms has Gelly. It provides the solutions for label propagation, community detection, connected components, single source shortest paths, triangles, clustering, PageRank or Hyperlink-Induced Topic Search.
- community activity - once upon a time Facebook was very invested in Giraph development. It was proven among others with the paper "Scaling Apache Giraph to a trillion edges" published on 2013. Regarding Gelly, in the past, it also involved a serious community. But instead of a worldwide company, it mobilized researchers. The project about streaming graph processing quoted previously is one of the proofs of that. Another one is an ambitious roadmaps that planned to add new algorithms, different programming paradigms and partitioning algorithms. The last discussed framework, GraphX, even though it's supported by Databricks, it also has an active community and its most visible initiative is the project called GraphFrames. It's the same for GraphX as Structured Streaming was for DStream-based streaming. That said, it uses optimized DataFrame structures to represent and process graphs.
- users - finding frameworks users on their official websites was not easy. It's why I used for this point a job search on indeed. After searching for "Apache Giraph" keyword, I found among others the offers from Facebook, Nielsen (marketing cloud company), Object Computing (software engineering company) and LiveRamp (identity resolution). I had less chance with Gelly because the engine returned 0 offers. I extended the research on Google for "job" + "Apache Flink" + "Gelly" keyword but also it also gave nothing. The impression I had is that Gelly was used very often in scientific papers. For GraphX I had more chance and the companies like UC Berkeley Extension (the professional and continuing education division of the University of California), Payette Group (for IoT and security threats detection), Qualys (SaaS provider) or UnitedHealth Group (healthcare company). Please notice that the research was done by the end of September 2018 and the situation could change meantime.
- books - I was actively looking for a source about Gelly but I didn't find any book reserved for this framework. Some of them were talking about it in one chapter but compared to Giraph and GraphX it was not enough. The former one has its own book published by Manning "GraphX in Action". A reference for Giraph is "Practical graph analytics with Apache Giraph". Even though a book doesn't guarantee a better learning process that a solid and well-illustrated documentation, it's often the best way to start to use a new framework.
- distinction points - without going to deep into details we can notice some features that are present only in one of the described frameworks. Gelly offers the native support for bipartite graphs Apache Flink's module also brings a rich set of generators for different graph topologies: circulant, complete, star, cube, path and so forth. GraphX has less generators but on the other side it has a better support for partitioning.
Bipartite graph
A bipartite graph is the graph composed of 2 disjoint sets of vertices. Each of these vertices is connected to one or more vertices in the other set:
Learning project choice
During this quick and, I admit, superficial analysis, my first reflex was to try Gelly. It impressed me by the roadmap quoted previously in this post. However, after some deeper research, I found that not many points of that roadmap were implemented in the official version after 3 years. What about Giraph ? The project seems to make its proofs in really big real graphs (Facebook). However, I was a little bit scared by the lack of activity in the repository.
Thus, naturally I turned out to GraphX. Even though I had an impression it's a little bit less supported by Databricks and the community that Structured Streaming and SQL modules, I really appreciated the GraphFrames initiative that can be a game changer on this field. Moreover, GraphX fits pretty well in the environment of my blog where Apache Spark, except some posts about Apache Beam, is the main described data processing framework.
To sum up, the next graph-related posts will be mostly about Apache Spark GraphX module. But please keep in mind that it's only a subjective choice and two other compared frameworks would be valid solutions too.