Graphs and data processing

In this blog I've covered the topics about relational databases, key-value stores, search engines or log systems. There are still some storage systems deserving some learning effort and one of them are graphs considered here in the context of data processing.

A virtual conference at the intersection of Data and AI. This is not a conference for the hype. Its real users talking about real experiences.
- 40+ speakers with the likes of Hannes from Duck DB, Sol Rashidi, Joe Reis, Sadie St. Lawrence, Ryan Wolf from nvidia, Rebecca from lidl
- 12th September 2024
- Three simultaneous tracks
- Panels, Lighting Talks, Keynotes, Booth crawls, Roundtables and Entertainment.
- Topics include (ingestion, finops for data, data for inference (feature platforms), data for ML observability
- 100% virtual and 100% free

👉 Register here

This post initializes new category devoted to graphs and their processing at scale. It starts by explaining some global concepts. Its first section recalls shortly graph data structures and its components. The second section focuses on the storage of graphs while the last one on their processing.

Graphs

Before talking about graphs in the context of data processing, it's important to recall some basic concepts about them:

Graph data distribution

One of main graph problems in Big Data context is partitioning. A lot of real-world graphs "suffer" from Power Law rule. Shortly speaking, this law tells that some vertices have more important impact on the graph than the others. This impact translates by degree of these vertices. The following curve shows that:

As you can see, the first vertex has X percentage points more than the second. In its turn, the second has X percentage points more than the third. The third also has the same difference with the fourth and so forth.

Another problem with graph partitioning is related to dynamic graphs. Partitioning a static graph is much simpler than doing that for a graph constantly changing. The distribution that seems right at current time may become wrong a week later when for instance new hotspots appear.

To mitigate both of them, the scientific word works hard and proposes a lot of different distribution algorithms. In graph context we often consider partitions as cuts. From that 2 different distribution approaches exist, applied either on edges or on vertices. The first approach is edge-cut and the name suggests, it cuts the graph by edges. In result of this the graph vertices are assigned to different partitions. It brings a risk to allocate connected vertices on different machines and thus the need of network communication to get the edge. To reduce this risk the adjacent vertices are often placed on the same partition. But since we talk about assigning vertices, we can immediately find out that it would probably lead to uneven distribution because of Power Law. It's why another strategy, vertex-cut, uses different idea. It cuts graph by vertices and distributes edges. The following image sums up these 2 strategies:

Graphs processing

The reflection about graph processing technique should start with MapReduce approach, widely adopted by most of Big Data frameworks. But it's not well suited to graphs because a lot of graph algorithms are iterative. In the case of MapReduce it would mean the necessity to pass graph state at every computation stage, complex joins and also a lot of I/O work increasing with each new iteration. It's why more efficient processing strategies exists and 2 of them gained a lot of popularity nowadays.

The first approach is called vertex-centric or "Think like a vertex". It puts the vertices in the center of computation. The program is divided in iterations called supersteps. In each superstep an user defined processing function applies on all graph vertices. The iteration stops when all vertices execute the function. Some of vertices can later propagate the result of their computation to other vertices in the form of message.

Another strategy is known as graph-centric or, as "Think like a graph" slogan. It's a response to one of vertex-centric programs problem about a big number of iterations. Sometimes 2 vertices located on the same partition need multiple iterations before they're finally able to communicate together. The communication in vertex-centric paradigm is often limited to a vertex and its neighbours. Graph-centric computation model works differently. It splits a whole graph on smaller subgraphs. The processing is executed directly on the subgraphs and the results computed by boundary vertices (= vertices present in different subgraphs) are exchanged at each iteration. These 2 approaches will be detailed more in the next articles devoted to distributed graphs.

The goal of this post was to provide a global overview of graphs in the context of Big Data. But it started with a section devoted to general concepts of graph as a data structure. It'll help to better understand the data processing examples given in next articles. Two next parts described more the context of graph as a data tool. One of them shown partitioning strategies and proven that sharding graphs was not an easy task because of Power Law. The next part given a small insight on data processing paradigms oriented around vertex and subgraph ideas. But as told, it was only an introduction to gather the same level of knowledge before the next posts talking about graphs in the context of data processing more in details.