Raft

Paxos is a great algorithm, but it turns out to be nontrivial to apply it in practice. Google talked about their challenges in implementing a fault-tolerant replicate state machine using paxos algorithm[1]. Even google's engineers have to create a new language to describe the state machine and then translate it into C++ code. What makes things even worse is that paxos missed many important parts in a pratical systems, such as membership change, leader election, etc.

Raft[2] is a consensus algorithm designed particularly for understandability. It follows the general idea of multi-paxos:

Elect a master.
Master appends logs to followers, and help lagged followers to catch up with master.
If a log has been replicated to a majority of servers, it can be marked as committed. This is the time when a change to replicated state machine can be applied and the server can respond to a client.

Algorithm Details

The general framework mentioned above describes how the servers in the cluster serve as a replicated state machine. When a master has been elected, an epoch or a term, begins. The servers can serve repeatedly in steps 2 and 3. The main contribution of raft is that it has proposed five properties: election safety, leader append-only, log matching, leader completeness, state machine safety, which lead to an atomic broadcast protocol. The key to under raft algorithm is to understand how these five properties are kept.

Raft Basics

The term in raft, is quite similar to epoch in zookeeper or chubby. Time consists of terms and each term begins with a leader election. There are three roles for a server in raft: leader, follower and candidate. There is at most one leader for each term, and only leaders can send append log request to followers. Candidate is a state a follower will become when it loses its connection with leader. Each server contains a series of log entry. A log entry consists of three fields: term, index and payload, where payload can be any byte array. A log entry can be marked as committed if it has been replicated on a majority of servers.

Election Safety

This means that for each term, only one leader exists. To maintain this property, raft requires that for each term a server can only vote for one server, and a server can only be a leader if it receives votes from a majority of servers.

Leader Append-only

This property requires that leader only appends log and never deletes logs.

Log Matching

This property requires that if a log entry in two different servers has matching index and term, then all log entries before that index are matching. To ensure this property, leaders will force followers to catch up with itself before follower appends any log to itself.

Leader Completeness

This property says that if a log entry is committed, it will appear in logs of all future leader logs. To ensure this, raft adds a restriction to the leader election process except the properties mentioned above: a server can only vote for someone whose log is not less updated than itself. A log is more updated than others' if its last log entry is more updated. The comparison of two log entries can be described in following code:

```if (term1 == term2) index1 < index2 else term1 < term2```

If all these properties are fulfilled, leader completeness are also fulfilled. The conclusion is quite intuitive. If a log has been committed, then leader in future must be one of servers who has replicated this log since it must be most updated among a majority of servers. For a formal proof, you can refer to the paper[2].

State Machine Completeness

A log entry can only be applied to state machine after it has been committed. Some it's straight forward to induce this property from leader completeness property.

Others

What makes raft even more practical is that it has consideration for group membership change. This is quite important since machines may get broken and never come back from time to time. Log compaction, or snapshot, is a mandatory technique for practical systems. It not only reduces the disk cost, but also help to reduce recover time.

References

Tushar Chandra, Robert Griesemer and Joshua Redstone. Paxos Made Live. 2007
Diego Ongaro and John Ousterhout. In Search of an Understandable Consensus Algorithm. 2014