At YLD we are building some new customer-facing products, and every now and then we need to decide on which persistent storage mechanism to use for implementing a particular service. We are fans and have used several, ranging from LevelDB, Redis, CouchDB, PostgreSQL, Riak and others. The problems with any of these (with the honourable exception of Riak), is that they have a poor availability story: You put one of these services running in a big box and hope it never goes down or you have to restart it. Specially in the world of Cloud computing where instances can vanish and are never seen again, relying your entire product on one machine is not only scary, but foolish.
One alternative is to have replication and automatic fail-over: the main server, the master, has one or more slaves that it replicates to. Once the client detects that the server is unreachable, it starts using the slave. While this is can work for read requests, it seldom works automatically for write requests, since recovering from a network partition where two nodes have been used as a master is difficult, if not impossible.
Another alternative that might get you better availability is to distribute your data across a series of partitions (normally called "shards"), spreading those accross machines. Using this, one machine being unreachable may not affect your entire application. Coupled with replication and client fail-over, this can get you closer to the desired availability level. But it comes with a high price: the fail-over and sharding algorithm is placed on the client, making implementation complicated and non-standard. Also, not having all the data under the same engine may make things like atomic transactions impossible in some desirable cases.
In the Dynamo world (where Riak is a member), this problem is fixed by using eventual consistency. All nodes are created equal and data is spread throughout them in a uniform way as it is being written. If you can't write to a node, write to another node. This means that, at any given time, two nodes can have different discordant versions of the same document, and merging can happen later. The merging can be left to the client or, default to a "latest write wins" (whatever "latest" means: timing is always a difficult problem in distributed systems).
The Dynamo way has, for me, one big problem: it's not designed to do range queries. If you have to list, say,
all client purchases for the last year, sorted by date
, you have to query all available nodes because, since you don't know the keys of the records you're looking for, any node may have relevant data. Also, you can't stream the results as they come in from the nodes because you have to return a deduplicated and sorted collection. This means that the client has to buffer the results while deduplicating them, and once finally all the responses from all the envolved nodes end, sort the collection, only then replying to the client. This is far from ideal.
On top of this, the possibility of having multiple versions of a record can be hard to manage at the application layer, introducing a lot of undesired complexity.
OK, relax now. Breathe in. Breathe out. Listen to some nice music. And imagine this:
Wouldn't it be wonderful to have a system that provides high availability and is also a single source of truth? A system that behaved as if you were dealing with only one node?
Enter the Paxos world. "Paxos" has become a Computer-Science dirty word that is used as synonym of "distributed consensus". You can think of distributed consensus as a distributed lock: as long as a majority of nodes agree on the same version of the truth, you can trust that that is the truth. When you send a command to a node, this node asks for consensus from the other nodes. If a majority of the nodes agrees, you can rest assured that the command will be applied to all the nodes.
This allows you to have nice stuff that you used to have in centralised databases: sequential IDs, one version of each record, transactions, etc., without sacrificing availability. As long as you have a majority of nodes available, you can still write.
Cool, right? The problem with Paxos (and the main reason it has become a dirty word) is that the implementation details are extremely complex. One of the reasons for this complexity is 1) to avoid Byzantine faults (faults due to malfunctioning or malicious nodes) and 2) because any node can be the leader of a given transaction.
A simplified version of Paxos named Raft was recently created by Diego Ongaro, when doing his PhD thesis at Stanford University. This paper describes a much simpler algorithm that has been proven to provide the following characteritics (and I quote):
- it ensures safety under all non-Byzantine conditions, including network delays, partitions and packet loss, duplication and reordering.
- a system is fully functional (available) as long as any majority of the servers are operational and can communicate with each other and with clients.
- they do not depend on timing to ensure the consistency of the logs
- in a common case, a command can complete as soon as a majority of the cluster has responded to a single round of remote procedure calls, which means that a minority of slow servers does not impact the overall system performance.
The main differences from Paxos are (and I quote):
- Strong leader: Raft uses a stronger form of leadership. For example, log entries only flow from the leader to other servers.
- Leader election: Raft uses randomised timers to elect leaders.
- Membership changes: Raft's mechanism for changing the set of servers in the cluster uses a new joint consensus approach where the majorities of two different configurations overlap during transitions.
The main point of departure from Paxos is, to me, the presence of a strong leader, which means that, in a cluster, during a whole term, one node does the heavy lifiting, probably making the load profile uneven throughout nodes.
But, by giving up some requirements, Raft can still be safe to use and also, which is very important, much easier to implement.
The Node Story
Having read the Raft paper some 10 times, I set out to build a working version for Node.js, using LevelDB at the persistence layer. The goal is to have an in-process database that reads locally, much like levelup and that communicates with the cluster for writes.
Currently I'm at the stage where the tests for some intricate corner cases are green, but I know I still have a long way to go. If you're curious or want to help out, head out to github.com/pgte/skiff and try it out!