Consensus problem and the Raft consensus algorithm

During long years the Paxos protocol was one of most serious solutions for consensus problems. However in 2013 Diego Ongaro and John Ousterhout from Stanford University proposed an alternative called Raft.

Throughout this post we'll learn more about the Raft algorithm. Firstly the consensus problem will be explained and only the second section will talk about this algorithm by explaining its main concepts and vocabulary. Three next parts will present 3 main problems that Raft tries to solve.

Consensus problem

The consensus problem is about how several nodes (agents) behave in order to keep their individual states globally consistent. Simply speaking, what they need to do to agree on a given value if a network partition happens. One of the goals of the consensus algorithms is to ensure that the whole distributed system can progress (e.g. handle data change operations) even if some of nodes are unavailable during some period of time and that, once back alive, the state of all nodes (available + unavailable) is consistent in a deterministic manner.

To solve this problem the consensus algorithms are often defined around 3 main topics:

One of often quoted algorithms is Paxos. It'll be covered in one of further posts. It's quoted here only to help us to introduce another solution, Raft - created because of some of Paxos's drawbacks that according to "In Search of an Understandable Consensus Algorithm (Extended Version)" are: the complexity and the lack of foundations for building practical implementations.

Raft explained

To simplify the understanding and the concrete implementation, Raft divides the solutions for the 3 problems listed above as independent tasks. Moreover the data is written to an append-only log sequentially and in a constrained order. The communication between the nodes is based on the RPC calls.

The following components describe the Raft algorithm:

Despite a late arrival regarding to the other solutions, Raft was already adopted in some data-oriented projects. One of them is NATS Streaming.

Raft and leader election

When the cluster is up for the first time, the term on all nodes is set to 0 and there is no leader. The leader election process starts once the election timeout expired on one of the servers. Another condition starting the election is the leader downtime. If a follower doesn't receive the heartbeat message within some delay it'll consider the leader as down and start new election after the election timeout.

After reaching the timeout the follower becomes the candidate: it increments the term, votes for itself and sends RequestVote message to the remaining nodes. What happens with this message on the voter's side ? The message stores the information about the candidate's commit log state. And if the state is less up-to-date that the state of the voter, the voter doesn't give its vote for given candidate.

The candidate becomes the new leader only when it receives the majority of votes from other nodes. However it's not a single possible issue. Another one is when another candidate wins the election. This situation is possible because, as we saw in the previous paragraph, one of candidates can be more eligible than the other because of its data freshness. Another possible issue is called split vote and it occurs when none of the candidates received the majority. In this situation the election process restarts in the same mechanism.

It's important to note here that the election timeout is a random value for all the nodes. It should reduce the risk of infinite split vote situation.

Raft and replication

The data replication happens in some simple steps. First, the client sends the request that is directly addressed to the leader or redirected to it by the follower. Next the leader appends the change message to its commit log and sends the AppendEntries request to all followers to replicate the entry. When the majority of followers confirm writing, the leader applies the change on its persistent storage. Such applied entry is called committed. This commit automatically commits all preceding uncommitted entries. The followers commit the entry in their turn thanks to the highest stored entry index set in the AppendEntries message that is also sent during the heartbeat process.

An entry in the append-only log is composed of 2 properties: the term representing the creation time and the persisted command issued by the client. 2 entries in different logs with the same index and term satisfy these 2 requirements of append-only log design:

Both properties define the Log Matching Property rule.

If something goes wrong with the network and the leader misses some entries, it'll always be in charge of handling the inconsistencies. In such situation the follower with conflicting data will rollback to the situation where it has the same entries as the leader, delete the values after this place and take the entries from the leader's log.

Raft and safety

So far the Raft explanation was easy. After all, we considered that the network worked good. But sometimes (often?) the things can go wrong and Raft is able to restore the consistent state and thus provide the safety guarantees to the client throughout the following points:

The consensus guarantess a consistent cluster state but it's not an easy task to achieve, especially because of the network failures. Hence several algorithms exist to solve the problem. One of them is Paxos but as mentioned in the first sectin, it's often estimated as too complex. It explaines one of the reasons-to-be of Raft that is composed of terms where the nodes can be in one of 3 states: follower, candidate or leader. Because of some processing guarantees this algorithm provides as well leader election facility, consistent data replication and safety.