Graph storage

Until now we've discovered exclusively the concepts devoted to computing distributed graphs. But the compute part can't go without storage. And since for the latter in the context of graph we can't talk about the storage, it requires its own detailed explanation.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I'm currently writing one on that topic and the first chapters are already available in 👉 Early Release on the O'Reilly platform

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

Three sections compose this post. The first one tries to highlight the difference between graph processing and graph database. The second part provides an insight into different graph data storage techniques. The final section gives some examples of graph databases implementing the concepts from previous section.

Graph database vs graph processing

In the recent posts we've focused mainly on the graph processing. For most of the frameworks implementing these concepts, the data is stored in the framework-specific format, partitioned and loaded into workers memory before processing.

In the other side, a graph database is a storage system saving the data with graph structures such as vertices or edges. If we should compare both graph concepts to data processing concept we've learned until now, we could consider graph processing as Apache Spark and graph database as mutable sources for the data processing logic.

Graph database is a category of NoSQL databases. We can describe it with 4 components: storage, querying facility, scalability and transaction processing. Since I'll describe the storage in the remaining part of this post, I'll discuss here only 3 last points:

Storing graph data

Graph data has specific, often continuous structure. Vertices are linked together through edges and at first glance, we could think that it's represented as one big file for each subgraph. However, it's not the case and graph can be represented with the help of different data structures:

Native or not native graph databases

In the blog post "Graph Databases for Beginners: Native vs. Non-Native Graph Technology" quoted in Read also section, Neo4j proposes a distinction for technologies storing graphs. They classify graph storages as being native and not native. The first approach is characterized by "graph-first" citizenship. Concretely it means that the graph data is stored as a...graph. Not-native graph technology is the one which delegates the storage to 3rd party storages as key-value stores or relational databases.

The proposed differentiation point is the use of index-free adjacency by the storage technology. Accordingly to the post, this structure is the most natural way to represent graph structures in the storage part.

Above list contains only some of the main storage structures. There are also different engines able to store graph data:

Graph databases - examples

Until now we've discovered only theoretical concepts about the storage of graph data. But to not let us hungry, let's try to find the implementations of these ideas. Maybe the most famous implementation of native graph storage is Neo4j. It stores data in files where each of them is responsible for different graph property (vertices, edges, labels, attributes …). The vertices have a fixed size of 9 bytes and thanks to that they can be accessed faster. The beginning of given vertex is computed as 9 * vertex index. The same fixed-size rule concerns edges represented as a doubly linked list. To guarantee traversal performances, each edge stores the pointers for previous and next edges of the source and destination vertex.

An example of storage using NoSQL technologies is JanusGraph. This Open Source project stores graph data as an adjacency list with all edges (not only the pointer) in one of the provided stores: Apache Cassandra, Apache HBase or Google Bigtable. Another graph database using NoSQL is DGraph that is based on Badger - a key-value store implemented in Go. Badger is used as an inverted index to store all outgoing edges of vertices sharing the same predicate.

Regarding to multi-model database we can quote OrientDB as an example. It comes with 2 APIs: one for document-oriented storage and another for graph-oriented storage. The latter is built on top of the former one. As JanusGraph, it stores the graph in the form of an adjacency list. But the list is completed with document database capabilities to store physical vertices.

An example of graph storage from the family of relational databases is SQL Server 2017. It stores graph data in 2 tables: one for vertices and another for edges.It supports traversals through MATCH statement.

The article showed how the graph data can be stored with different structures. The simplest data representation are matrices with values representing vertices connectedness. More complicated ones are linked list and among them a doubly linked list implemented in Neo4j that by the way provides a classification of graph databases on native and not native. Everything that shows clearly that distributing graph storage, mainly because of its connected character, is quite challenging. As illustrated in the 3rd part of the post, horizontal scalability is most of the time provided with NoSQL key-value stores, used in JanusGraph and DGraph projects.