Consensus Algorithms: From 2PC to Raft

The problem of consensus typically arises in the context of replicated state machine, which is a solution to a variety of fault-tolerance problems. For example, Google File System(GFS) and Bigtable both uses a coarse-grained locking service called Chubby for leader election, group membership management and storing meta data. HDFS and HBase, which claim to be the open source implementation of GFS and Bigtable respectively, both use a coordination service called Zookeeper for similar goals.

At the heart of Chubby is paxos, the most famous consensus algorithm introduced by Lamport. Zookeeper uses an atomic broadcast protocol called Zab, which can be seen as a variation of Paxos, to broadcast changes to processes running on different machines.

So why we need consensus algorithms? If we want to replicate changes on different processes, we just need to send them in order. Problem with this simple approach is that we may fail to send changes to peers for many reasons, e.g. process crash, network failure. If we wait for all replication to be ready, we will lose availability. To achieve high availability, we can't wait for all processes to apply changes to make progress, but we need a mechanism to ensure that all processes apply changes in the same order. That's where consensus algorithm comes in.

This mini book is not an exhaustive survey about consensus problems, and I'll just talk about some of the most famous ones. The rest of the book is structured as following: Chapter 1 is about tow simple algorithms for distributed consensus: two-phase commit (2PC) and three-phase commmit (3PC). Chapter 2 is about paxos, the most famous consensus algorithm. Chapter 3 is about raft, a recently proposed consensus algorithm, which is designed for understandability. In this chapter I'll also introduce logmaster, an open source implementation of raft algorithm developed by me. Chapter 4 is about Chubby and Zookeeper, two replicated state machine built on paxos and zab respectively.