As told many times in previous posts, one of the most challenging tasks in distributed graph processing is the partitioning. Connected nature of the graph components makes the partitioning hard. Hopefully, the researchers continue to propose the solutions.

### Looking for a better data engineering position and skills?

You have been working as a data engineer but feel
stuck? You don't have any new challenges and are still writing the same
jobs all over again? You have now different options. You can try to look for a new job, now
or later, or learn from the others! "**Become a Better Data Engineer**"
initiative is one of these places where you can find online **learning resources
where the theory meets the practice**.
They will help you prepare maybe for the next job, or at least, improve your current
skillset without looking for something else.

đź‘‰ I'm interested in improving my data engineering skillset

The article is composed of 3 sections. The first one recalls why it's difficult to partition a graph. The second one shows some of the partitioning strategies brought by researchers. The final section presents the partitioning in 3 Open-Source graph processing frameworks: Gelly, GraphX and Giraph.

## Difficult graph partitioning

Graph partitioning has the same objectives as for the case of classical data processing based on files, NoSQL data stores or relational databases. Its main goal is to provide an efficient way to parallelize the computation. Who tells efficient, he automatically invokes evenly distributed data. In addition to that property, graph partitioning has another requirement - minimizing cut size. It means that ideally, each partition will contain an independent subgraph. In result, all subgraphs compose the whole graph. Such optimization means that there won't be any edge connecting vertices on different partitions.

However achieving this it's not an easy task because most of the real-world graphs are dense (= a lot of edges). Moreover, they often suffer from the Power law that makes edge-cut minimization even more difficult. Another graph property making its distribution difficult is its dynamic nature. Nobody can guarantee that current partitioning will fit for the graph state in a week or even a few days. This is particularly visible in social media graphs where in one day some not popular person can gain a very big number of relationships.

It's why graph partitioning is classified as an NP-hard problem. Finding the optimal solution is very difficult and often impossible but hopefully, a lot of good solutions with approximated results exist. In this solutions often we need to define an acceptable level of graph imbalance. Aside from solvability, this problem also has another difficulty. Even though we succeed to partition the graph for one specific use case, nothing tells that the same approach will work for other graphs. It's because of relationship nature between vertices that most of the time is different between graphs.

From that we can list 3 main principles that a good partitioning algorithm should guarantee:

- balance - the partitions should be balanced, i.e. they should have similar load.
- edge-cut - a good partitioning should minimize the number of edges connecting vertices from different partitions.
- incremental - as already told, graphs change - especially social media graphs popular nowadays. In order to keep good partitioning, they should be able to execute incrementally and rebalance as new vertices and edges are coming. Incremental character means often the need to move already partitioned vertices in order to guarantee the respect of above 2 points in changing environment.

## Partitioning approaches

Graphs can be partitioned with different strategies. Two basic ones are **vertex-cut** and **edge-cut** introduced in the post graphs and data processing. The partitioners for this category are divided into 2 groups: online and offline. The former one uses only local information as vertex and its neighbors and it's easily distributed. Offline partitioner requires information about the whole graph before doing the partitioning. It may require multiple iterations before reaching optimal partitions. Among concrete partitioning algorithms for both groups, we can distinguish random (hash-based), hash space segmented (completes the random assignment and tends to minimize replication needs) or greedy (minimizes replication factor at each step). Vertex and edge-based approaches are iterative and at each execution, steps try to minimize edge-cuts by keeping all partitions balanced. The following animation shows partitioning for a simple graph:

Among other different scientific partitioning proposals we can find ones based on **streaming**. Some of them are able to distribute the graph in one pass. To achieve that they use different heuristics, such as hashing, assigning vertex to the smallest partition, weights (triangles for social networks, random) or â€śbalance big" (high degree vertices are used to attract vertices with lower degrees). However, all the heuristics don't fit equally to all types of graphs.

Other partitioning methods use graph algorithms. One of them is **graph coarsening** where a big graph is transformed into much smaller subgraphs of related vertices through multiple steps called levels. Another algorithm we could use in partitioning is **Label Propagation Algorithm**. It works similarly to the connected components algorithm described in Graph algorithms in distributed world - part 1 post, because it begins by assigning random ids for all vertices. The difference is that the vertex doesn't take the minimal id among the sent values but the id most frequently met in its neighbors. Later the algorithm uses a central component providing constraints to satisfy, as partitions balance. One of the LPA-based partitioning strategies is called Spinner.

## Partitioning in real-world

A theoretical point of view is always important but how does it fit real-world use cases? How Gelly, GraphX or Giraph implement partitioning in their data processing workflow? Apache Flink's **Gelly** is based on DataStream abstraction and theoretically, it's able to use any of provided partitioning strategies. However in practice, at least after analyzing test code for graph algorithms (PageRank, connected components) we can observe the exclusive use of hash-based partitioning where vertices are randomly assigned to the workers.

Apache Spark's **GraphX** library proposes more evolved strategies. We can choose among 3 possibilities:

- random vertex partitioning - computes partition by hashing source and destination vertex ids
- canonical vertex partitioning - similar to above except the edges between the same vertices are on the same partition. It can happen that two vertices are connected with the edges of, for instance, different directions.
- 1-dimension edge partitioning - ensures that the outgoing edges of given vertex are always on the same partition.
- 2-dimensions edge partitioning - uses edge adjacency matrix and divides it up into tiles. It works well only when the number of partitions is a perfect square (4, 9, 16, 25, ...)

Regarding to **Apache Giraph**, it uses 2 strategies. The first one is based on the distribution of vertex ids hashes across the partitions. The second one uses vertex ids directly and assigns them to the partitions. In such case, the workers are responsible for equally sized ranges of ids.

Partitioning a graph is a difficult task. Its 2 main goals are balanced partitions and edge-cut minimization. However, both are difficult and often impossible to achieve at the same time. Since some approximation solutions exist. But as we saw, despite an important mobilization of the scientific community, the proposed solutions aren't always implemented by Open-Source graph processing frameworks. The analyzed solutions (GraphX, Gelly, and Giraph) use similar approaches to deal with partitions: hash-based and dimensional edge partitioning.