Akka and Distributed Data module

Versions: Scala 2.12.3, Akka Distributed Data 2.5.11

Recently we discovered a lot of topics related to the eventual consistency (ACID 2.0, CRDT, CALM...). Some of these posts had the code examples using the Akka distributed data module that will be covered below.

Looking for a better data engineering position and skills?

You have been working as a data engineer but feel stuck? You don't have any new challenges and are still writing the same jobs all over again? You have now different options. You can try to look for a new job, now or later, or learn from the others! "Become a Better Data Engineer" initiative is one of these places where you can find online learning resources where the theory meets the practice. They will help you prepare maybe for the next job, or at least, improve your current skillset without looking for something else.

๐Ÿ‘‰ I'm interested in improving my data engineering skillset

See you there, Bartosz

The post is divided in 3 main parts. The first one explains the context of this module. The second part focuses on the API details. The last section shows some simple use cases of the Akka distributed module with the goal to give a pragmatic input about the CRDT and not to recall the Conflict-free Replicated Data Types (CRDT) specificity presented in previous posts.

The distributed data and Akka

The Akka Distributed Data module is a replicated in-memory data store providing common CRDT implementations as counters, maps, sets or registers. The data stored in the store is shared between Akka Cluster nodes and may be replicated asynchronously throughout Gossip protocol or direct replication.

The module benefits from the advantages provided by the CRDT. Thanks to the monotonic merge function any conflicted state occured for instance because of a network partition can be resolved. Because of the link with the CRDT, the Distributed Data is also related to the eventual consistency. Thus, it provides both high read and write availability, and low latency. In the counterpart it may sometimes return an out-of-date value that will be updated thought after the system regains responsivness.

API details

As told, Distributed Data module already provides some of common CRDT types. However we aren't limited to them. To extend the list of available options we can define our own CRDT type through the implementation of the akka.cluster.ddata.ReplicatedData trait. The implementation must provide the logic for merge(that: T): T that is the function used to resolve the value for given entry. It's important to emphasize that this function must be monotonic (function either entirely nonincreasing or nondecreasing). Another important requirement is the immutability - every modification should return a new instance. An example of the immutability can be PNCounter's copy method called at increment and decrement operations:

private def copy(increments: GCounter = this.increments, decrements: GCounter = this.decrements): PNCounter =
    new PNCounter(increments, decrements)

It exists also a special type of CRDT called delta-CRDT. It's a special kind of CRDT represented by the implementations of akka.cluster.ddata.DeltaReplicatedData trait. Its specificity comes from the fact that it replicates only the modified elements. For instance if 2 new values are added to a set, only the information about the addition of 2 new elements is dispatched to the replicas and instead of the whole set.

But the CRDT type alone wouldn't be enough. It must be integrated with Akka Cluster. The interaction with the distributed data is done by the akka.cluster.ddata.Replicator actor. It must be started on each node of the cluster or on the group of nodes having a specific role. These actor can receive one of the following types of messages: