During long years the Paxos protocol was one of most serious solutions for consensus problems. However in 2013 Diego Ongaro and John Ousterhout from Stanford University proposed an alternative called Raft.
Throughout this post we'll learn more about the Raft algorithm. Firstly the consensus problem will be explained and only the second section will talk about this algorithm by explaining its main concepts and vocabulary. Three next parts will present 3 main problems that Raft tries to solve.
The consensus problem is about how several nodes (agents) behave in order to keep their individual states globally consistent. Simply speaking, what they need to do to agree on a given value if a network partition happens. One of the goals of the consensus algorithms is to ensure that the whole distributed system can progress (e.g. handle data change operations) even if some of nodes are unavailable during some period of time and that, once back alive, the state of all nodes (available + unavailable) is consistent in a deterministic manner.
To solve this problem the consensus algorithms are often defined around 3 main topics:
- leader election - how to select one node that will handle the writes and act as a leader ?
- replication - how to replicate to make the state of all nodes consistent ?
- safety - how to provide correct results to client requests, even in the case of network failures, packet loss, duplication and reordering ?
One of often quoted algorithms is Paxos. It'll be covered in one of further posts. It's quoted here only to help us to introduce another solution, Raft - created because of some of Paxos's drawbacks that according to "In Search of an Understandable Consensus Algorithm (Extended Version)" are: the complexity and the lack of foundations for building practical implementations.
To simplify the understanding and the concrete implementation, Raft divides the solutions for the 3 problems listed above as independent tasks. Moreover the data is written to an append-only log sequentially and in a constrained order. The communication between the nodes is based on the RPC calls.
The following components describe the Raft algorithm:
- server - the nodes in the distributed cluster. Each of them has a state storing the information from the last term (explained later), prefered candidate for the leader election and the log entries with the executed commands. Moreover, the server stores the indexes to the last committed command and to the last applied command to the node's state (e.g. a real physical update of one entry in the key-value store).
- server role - a server can be in one of 3 states:
- follower - the server in this state is passive. It only answers to the requests from leaders and candidates. It also dispatches any client request to the leader.
- candidate - it's a candidate to become a new leader.
- leader - it's responsible for handling client request and replicating them on the followers. Only 1 node can be the leader in given term.
- term - the time unit of arbitrary length represented as incremented integer. Every time when new leader is elected, the term value increases. The term also helps to detect an obsolete information held by node. It may happen during the network partition when the writes occur only on one side of the network. Once the network problems finished, the nodes will transist to consistent state thanks to the term comparison (higher value always wins).
- RPC - the communication between the nodes is based on the RPC calls:
- AppendEntries - this communicate has 2 purposes: leader sends it to replicate the entries on the followers and also as a heartbeat message to maintain its leadership. If a follower doesn't receive such message during the time specified in election timeout it considers that there is no leader and it starts new leader election process.
- RequestVote - it's used in the leader election process and is sent by the candidates to start new election
The RPC calls are issued infinetely, until the leader receives the response from the followers or candidate. So even if a follower executes the RPC instruction and crashes before sending the response, it'll receive once again the same request. And it isn't dangerous because the RPCs are idempotent.
Any message having the term smaller than the current term of the node is ignored.
Despite a late arrival regarding to the other solutions, Raft was already adopted in some data-oriented projects. One of them is NATS Streaming.
Raft and leader election
When the cluster is up for the first time, the term on all nodes is set to 0 and there is no leader. The leader election process starts once the election timeout expired on one of the servers. Another condition starting the election is the leader downtime. If a follower doesn't receive the heartbeat message within some delay it'll consider the leader as down and start new election after the election timeout.
After reaching the timeout the follower becomes the candidate: it increments the term, votes for itself and sends RequestVote message to the remaining nodes. What happens with this message on the voter's side ? The message stores the information about the candidate's commit log state. And if the state is less up-to-date that the state of the voter, the voter doesn't give its vote for given candidate.
The candidate becomes the new leader only when it receives the majority of votes from other nodes. However it's not a single possible issue. Another one is when another candidate wins the election. This situation is possible because, as we saw in the previous paragraph, one of candidates can be more eligible than the other because of its data freshness. Another possible issue is called split vote and it occurs when none of the candidates received the majority. In this situation the election process restarts in the same mechanism.
It's important to note here that the election timeout is a random value for all the nodes. It should reduce the risk of infinite split vote situation.
Raft and replication
The data replication happens in some simple steps. First, the client sends the request that is directly addressed to the leader or redirected to it by the follower. Next the leader appends the change message to its commit log and sends the AppendEntries request to all followers to replicate the entry. When the majority of followers confirm writing, the leader applies the change on its persistent storage. Such applied entry is called committed. This commit automatically commits all preceding uncommitted entries. The followers commit the entry in their turn thanks to the highest stored entry index set in the AppendEntries message that is also sent during the heartbeat process.
An entry in the append-only log is composed of 2 properties: the term representing the creation time and the persisted command issued by the client. 2 entries in different logs with the same index and term satisfy these 2 requirements of append-only log design:
- in such case both entries store the same command
- the logs are identical in all entries preceding these 2
Both properties define the Log Matching Property rule.
If something goes wrong with the network and the leader misses some entries, it'll always be in charge of handling the inconsistencies. In such situation the follower with conflicting data will rollback to the situation where it has the same entries as the leader, delete the values after this place and take the entries from the leader's log.
Raft and safety
So far the Raft explanation was easy. After all, we considered that the network worked good. But sometimes (often?) the things can go wrong and Raft is able to restore the consistent state and thus provide the safety guarantees to the client throughout the following points:
- leader election - as already told, the followers vote for the candidate only if the latter one is up-to-date. It brings the consistency between the stored entries (since they're replied on the followers in the case of differences with the leader). It also ensures no need of an additional mechanism to transfer missing entries.
- Log Matching Property - the leader creates always 1 immutable entry in given index within given term. Moreover when something goes wrong and the logs differ between the leader and a follower, the follower overwrite its local entries with the ones of the leader. These 2 techniques guarantee the respect of the Log Matching Properties and thus the consistency between the append-only logs on different nodes.
- timing independance - the Raft consensus must produce correct results even if some events happened more quickly than expected. For instance, if the RPC communication takes longer than the election timeout, none of the candidates will win the election and no leader will exist. And without the leader Raft can't work properly. To not allow such situation Raft advises to set the election timeout value as much bigger than the time needed to communicate between nodes and much smaller than the average time between failures of a single server. It should guarantee reliable distribution of heartbeat messages (= always some time to deliver the message) and steady progress of the system (= always some time to chose new leader).
The consensus guarantess a consistent cluster state but it's not an easy task to achieve, especially because of the network failures. Hence several algorithms exist to solve the problem. One of them is Paxos but as mentioned in the first sectin, it's often estimated as too complex. It explaines one of the reasons-to-be of Raft that is composed of terms where the nodes can be in one of 3 states: follower, candidate or leader. Because of some processing guarantees this algorithm provides as well leader election facility, consistent data replication and safety.