Paxos
Paxos consensus algorithm was introduced by Lamport in last century with an example of parliament on a fictional island[1] and then explained in plain English[2]. It's famous for a reputation as being difficult to understand. Since it has been proven to be robust in asynchronous network, it has been actively studied in last two decades in spite of its complexity. Other consensus algorithms we will talk about, e.g. raft and zab, are more or less inspired by the ideas behind paxos. Chubby uses it as underlying consensus protocol for its replicated state machine.
Basic Paxos
Basically paxos uses replications to achieve fault tolerance and reach consensus by limiting the proposed values. There are two roles in paxos algorithm: proposer and accepter. And each proposal is a tuple composed of two components: a sequence number and a value. The algorithm executes in two phases:
Phase one
- The proposer picks a sequence number n and sends a prepare message to all accepters.
- When an accepter receives a prepare message, it replies with a promise not to accept any value with sequence number less than n and the accepted value with highest number if has any.
Phase two
- When the proposer received replies from a majority of accepters, it picks value with highest sequence number and send a commit message with the value chosen and sequence number n to all accepters. If no value was chosen, the proposer can chose any value.
- When the accepter receives a commit message, the accepter will accept this message unless it has received any prepare message with higher sequence number.
That is one iteration of paxos protocol, or a paxos instance. For a formal reasoning of why the algorithm can reach consensus, you need to resort to Lamport's papers[1,2]. I just want to say some intuitions behind this algorithm. For example, we have two proposers p1 and p2, and three acceptors a1, a2 and a3. If p1 proposes some value v with sequence number n and accepted by a majority of accepters, we say that this value is chosen, then any proposal with higher sequence number has value v. That's to say, if p1's proposeal (n, v) has been accepted by a1 and a2, then a1 and a2 may accept proposals in the future, but paxos guarantees that proposal accepted by them all have value v. Why? According to the protocol, if a proposer wants to issue a proposal, it at first has to receives replies from a majority of accepters. This majority definitely contains at least one of a1 and a2. Without lose of generality, we assume that p2 receives reply from a1 and a3. a1's reply contains value (n, v), and we are not sure what a3's reply may contain. If a3 has not accepted any proposal or accepted any proposal with sequence number less than n, its reply will not affect value of p2's proposal, which will be v. Is it possible that a3 has accepted any proposal with sequence nubmer higher than n? No. If a3 has accepted a proposal with sequence number higher than n, it must has accepted a prepare message with sequence number higher than n. Since this proposal has been issued, that means a majority has received a prepare message with sequence number higher than n, that's say at least one of a1 and a2 has received that. If you have read step 2 of phase one, you will find that it contradicts with the assumption that a1 and a2 has accepted proposal (n, v).
Sequence number
Paxos requires that sequence number of each proposal must be unique. But paxos itself does not care about how this number is generated. The way to ensure that each sequence number is beyond this book's scope.
Fault tolerance
As you see, paxos only requires that a majority of accepters can communicate with proposer, that means this algorithm can tolerate with at most (n-1)/2 accepters fail at the same time.
Multi paxos
Practical systems need to reach consensus on a series of values, for example the redo log of operations. Though paxos allows multi proposers to issue proposals at the same, it makes paxos take at least 3 messages for each pair of proposer and accepter to finish a paxos instance. If there's only one proposer, or a leader, who can issue proposal, then pahse one can be omitted. Chubby took this approach to implement a replicated state machine. Other algorithms, e.g. raft and zab, are just variants of this approach with some complement.
References
- Leslie Lamport. The part-time parliament. 1998
- Leslie Lamport. Paxos Made Simple. 2001