← Back to Posts

Call Me Maybe: FoundationDB vs. Jepsen

Mar 25, 2014 - by Jennifer Rullmann

TL;DR:

Results of running Jepsen against FoundationDB

Developer Aphyr made waves in the database community last summer when he released a series of blog posts titled "Call Me Maybe" that revealed fault tolerance problems in several databases, including MongoDB and Riak. Aphyr’s open-source durability testing tool, Jepsen, found that in the face of a network partition, some databases will acknowledge writes but then lose data. That’s pretty upsetting when you expect your database to keep the data it says it stored.

The team here at FoundationDB asked Aphyr to test FoundationDB, but he’s moved on to other projects. FoundationDB is multi-model database with amazing performance, scalability, ACID transactions, and fault-tolerance. Since I’m new to the company (you can read my first post here), I thought running Jepsen against FoundationDB myself would be a good way to learn about our database.

This blog post is about what I learned from running Jepsen against FoundationDB, while sitting in a room with the developers that built the database. I’ll discuss how Jepsen works in general, how FoundationDB should respond in theory, and then use an internal logging tool to show you how FoundationDB actually behaves when Jepsen tests it. You can expect lots of great detail, but if you’re just interested in the results you can skip to the end.

The Plan

Let’s start by reviewing what Jepsen does, and how we expect FoundationDB to respond.

THE JEPSEN TEST

By default, Jepsen expects a cluster of five nodes:

Five nodes with one master

Most of the databases that Jepsen tests have a master node that accepts writes and copies data to the rest of the cluster, which is reflected in the drawing.

When the test begins, Jepsen spins up a bunch of concurrent threads. Each thread both attempts to write to the database and keeps an in-memory record of whether the database acknowledged the write or not.

The master is partitioned from the majority

In the middle of doing these writes, Jepsen simulates a network failure, partitioning the database cluster and separating the master node from the majority of the cluster.

The various databases Aphyr has tested respond to this situation in different ways - you can review them in detail here. Most, like MongoDB and Redis, stop accepting writes while they elect a new master on the majority side of the partition. Ideally the cluster looks like this after the roles have changed:

A new master is selected on the majority side of the partition

Some databases, like Riak, will continue to accept writes on both sides of the partition, and then attempt to merge the data after the partition is healed.

After a while, Jepsen heals the partition by restoring network connectivity. At this point both the old and the new master can see the entire cluster. Some of the databases that Aphyr tested lost data at this point, as they tried to reconcile the transaction logs of the two masters.

When all the writes have finished, Jepsen compares the writes that were acknowledged by the database to the actual data in the database. Any discrepancies mean that the database was inconsistent. Worse, any lost data means that you can’t rely on the database to durably persist your data.

FOUNDATIONDB’S EXPECTED BEHAVIOR

Let’s talk about how FoundationDB is expected to behave in the face of the network partition that Jepsen is creating.

Many distributed databases have the concept of roles that a node can take - such as primary and secondary. FoundationDB has roles too, which include master, transaction log, and cluster controller. There are more roles but these are the most important players for Jepsen’s test.

The cluster controller is responsible for handing out roles to the other nodes in the cluster. The master node accepts writes, sending acknowledged responses to clients when a write has been durably persisted to all the transaction log nodes. A transaction log node is responsible for ensuring that writes are durably persisted to disk.

A five machine cluster with the recommended configuration can look like this:

A five node FoundationDB cluster

This cluster has one master, one cluster controller, and three transaction log nodes. When a write request is received by the master, the master will send the data to the three transaction log nodes. Only when all three have durably persisted the data to disk does the master acknowledge the write. At this point, you may be worried that the master is a single point of failure for the cluster - a concern in many distributed databases. The key difference for FoundationDB is that the master stores no state, so any node can take its role on at any moment. The cluster controller (also stateless) keeps an eye on the nodes, and will reassign roles if a node is unable to perform its duties, including the master role. Similarly, if the cluster controller is unable to perform its duties a new one will be elected by a quorum of coordinator nodes (another role) using a Paxos-like algorithm. This means there’s no single point of failure in the system.

When Jepsen partitions the network it will separate two nodes from the other three. I’ve configured the cluster to ensure that the master will be on the minority side of the partition, to make the test more interesting.


The master and one transaction log machine are separated from the majority

A couple of things should happen when the partition occurs:

  • The master should detect that it can no longer reach all three of the transaction log nodes, and stop accepting write requests
  • The cluster controller should detect that it can longer reach the master, and select a new one on the majority side of the partition
  • The cluster controller should select new transaction log nodes (this always happens when a new master is chosen)
  • The new master should start accepting writes

After all of this role shuffling the cluster should look like this:

The minority nodes are offline, while the remaining nodes take on additional roles

Note that some nodes have more than one role - that’s very typical of a cluster of any size.

If all goes well, this process means that no acknowledged writes will be lost.

When Jepsen heals the partition later in the test the database should begin utilizing the entire cluster again. By following a similar process to the above, it shouldn’t lose any data.

MAGNESIUM

I’ll be using our internally built tool, Magnesium, during the test to show you exactly what FoundationDB is doing. Magnesium uses the log files produced by each node to graph writes per second and role changes (among other things).

Here’s an example screenshot of Magnesium:


Example screenshot

It’s showing a few things graphed over time:

  • TransactionsGraph, showing total transactions per second
  • ServerFailures, showing the IP addresses of servers the cluster thinks are down
  • Roles, showing roles by machine
  • AllEvents, a table of events, filtered down to just those that happened on 192.168.0.27 (the master)

Each graph uses time as its x-axis, at identical scales. This makes it easy to correlate interesting events. The table of events has a Time column as well, and when we select a row - as I’ve done in the screenshot - Magnesium marks that corresponding timestamp on each graph with a blue horizontal line. The line helps us relate all of the information provided.

The Actual Test

We’ll start by installing FoundationDB:

$ salticid foundationdb.setup

This will download the server onto the five nodes and configure the first node as the master. It will also set up triple replication and utilize all five nodes as coordinators.

Since working with FoundationDB requires a client and a server, I’ve installed the client on the machine that I’m running Jepsen with, and then downloaded the cluster file from one of the servers:

$ scp n1:/etc/foundation/fdb.cluster fdb.cluster

Now the client on my machine will use the database cluster - here’s the FoundationDB cluster status:


I’ve written a simple application which writes a key/value pair for each write request. It’s nearly identical to the Postgres application.

Jepsen will use that application when I run it:

$ lein run foundationdb -n 6000

Let’s peek behind the curtains with Magnesium to see what’s going on in the database. You can see in the transactions per second graph when Jepsen starts hammering away its writes:

Transactions per second increases as Jepsen starts

After running for a bit, Jepsen partitions the cluster, separating the master from the majority. Suddenly unable to sync to all three transaction log machines, the master stops accepting write requests and demotes itself:


The cluster responds to the partition

We’re using the FoundationDB API’s automatic transaction retry loop, so Jepsen shows increased latency instead of unacknowledged writes (the colored bars show time to complete the write request):


We retry the failed transactions until the cluster begins accepting writes again

While the client keeps retrying write requests, the cluster controller chooses a new master and transaction log nodes. In the below screenshot the blue line indicates when the old master demotes itself, and the yellow line indicates when a new master is chosen. Between these two lines no write requests are accepted by the cluster. I’m also showing some additional information in the mouse hover of the roles graph, where you can see the cluster controller has selected 192.168.0.63 as the new master:


Writes resume once a new master and transaction log machines are selected

After a few seconds the cluster has finished its role juggling and the new master starts accepting write requests. Jepsen sees the latency decrease:


Writes are no longer rejected

Things are once again running smoothly in the cluster. A little bit later, Jepsen heals the network partition. Some of the other databases tested encountered the classic split brain problem at this point, as they tried to reconcile their transaction logs. Because FoundationDB’s old master stopped accepting writes as soon as it couldn’t properly sync its data, it doesn’t have this issue. Seeing all the nodes come online, the cluster juggles the roles again:


Cluster switches to using the original master when it rejoins the majority

And Jepsen sees a bounce in latency:


Writes are stopped while the original master is promoted, then resume

FoundationDB is changing roles again because the old master, which I had configured the database to prefer, has become available again. In clusters where the roles are automatically chosen - which is how FoundationDB works by default - you wouldn’t necessarily see the roles change when a partition is healed.

After a while Jepsen finishes the test and prints the results:


And we’re all done!

Results and Limitations

As you can see in the screenshots above, FoundationDB had 0% data loss. Awesome!

Although, as Aphyr pointed out when I sent him a preview of this post, the workload wasn’t doing much of a test of ACID, since each element was stored with a different key/value pair. So I added another test where each thread writes to the same key/value pair, so every transaction could potentially conflict. I scaled back the writes per second to give my poor laptop a chance to churn through the test:

$ lein run foundationdb-append -n 2000 -r 2

Latencies get significantly worse, especially after the partition, as all the threads attempt to read and write to the same key: 
The massive conflicts cause an increase in latency

And the results:

All acknowledged writes were safely stored

Even when all the threads write to the same key, and in the face of a network partition, FoundationDB remembers all 2,000 writes.

You might have noticed that the results always have zero unacknowledged writes, even though a partition occurred. That’s because my test is using FoundationDB API’s automatic transaction retry loop. Let’s see what it looks like when I don’t retry failed transactions:

$ lein run foundationdb-append-noretry -n 2000 -r 2

We see a bunch of uncommitted transaction errors in the log file: 
FoundationDB throws a 'not committed' error when a conflict occurs

These errors occur when two or more threads attempt to concurrently write to the same key. When we used the automatic transaction retry loop, FoundationDB’s API was catching this error and retrying it for us, until the write eventually succeeded.

And the results:

All acknowledged writes were safely stored

As you can see, FoundationDB didn’t lose any data - but it also acknowledged very few writes. That’s because many of the threads conflicted with each other by concurrently reading and writing to the same key. FoundationDB provides isolation - the I in ACID - by rejecting concurrent reads and writes to the same key. Typically an application would handle these rejections and retry the transaction - or simply use FoundationDB’s automatic transaction retry loop.

Another interesting result of this test is the one unacknowledged write. This means that Jepsen found data that the database didn't confirm was stored. This happened during the partition, as the client lost contact with the component of the database it was talking to. The FoundationDB API will return an unknown result in that case, so you can simply retry the transaction. In fact, FoundationDB's automatic transaction retry loop will do that for you, which is why we didn't see any unacknowledged writes in the other two tests.

TEST ENVIRONMENT

All of these tests were run on a laptop with a 4 core i5 processor at 2.30 GHz, with 8 GB of RAM. The FoundationDB servers ran on the same laptop in LXC containers.

SUMMARY OF JEPSEN TEST RESULTS

I thought it was interesting to look at all the database results from Aphyr’s previous tests in one place, but it’s important to keep in mind a couple of things. First, these tests were run last summer, and many of these databases have been updated since then. Second, although they report the same statistics, what they measure internally is very different. You can see below that some databases had multiple entries, because Aphyr tried tuning them to different failures modes for his blog discussion. Finally, as Aphyr mentioned to me, you can get wildly varying results from run to run.

Database

Acknowledged Writes Lost

Unacknowledged Writes Found

Loss Rate

Postgres

0

2

0.00%

Redis with custom sentinel

1126

0

56.36%

Redis redux

91

90

45.27%

MongoDB with unsafe write concern

2381

0

41.77%

MongoDB with safe write concern

2208

0

37.42%

MongoDB with replicas write concern

1927

0

33.84%

MongoDB with majority write concern

2

3

0.02%

Riak with last-write-wins (healthy cluster, no partitions)

1434

0

71.7%

Riak with sloppy quorum

1815

6

91.44%

Riak with strict quorum

1807

6

91.68%

Riak with CRDT

0

0

0.00%

Zookeeper

0

0

0.00%

NuoDB

0

--

0.00%

note: did not acknowledge writes during partition

Kafka

520

1

52.69%

Cassandra with lock

285

0

28.25%

Cassandra with CQL

0

0

0.00%

Cassandra with counters

0

221

0.00%

Cassandra with isolation

58

101

0.59%

Cassandra with lightweight transactions

3

1

0.36%

FoundationDB

0

0

0.00%


I was pretty pumped about FoundationDB’s performance. But, as I was told when I excitedly showed the rest of the engineering team the results, Jepsen is a very limited kind of test by our standards. There are many types of failure scenarios we test - like disk problems and far more diabolical network partitions - that it doesn’t do. Our own internal tests are much more comprehensive and therefore the team wasn’t at all surprised that Jepsen didn’t find problems. It seems that Aphyr agrees:

If you’re interested, you can learn more about our tests here.

But Really, What’s Wrong with FoundationDB?

So Jepsen didn’t uncover problems with FoundationDB - that just means we have to work harder to find them! Like all software, FoundationDB isn’t perfect, and there are always things it can do better.

It’s not likely that you could get FoundationDB to drop writes with Jepsen, since the database always chooses consistency over availability, and wraps all operations in serializable transactions. But there may be times when FoundationDB should be available but isn’t. This may be discovered by asymmetric partitions - where a node cannot see another node but can be seen by it. For example:

Asymmetric partition

In this situation, we can reason that the database should be up, but it may not be.

Separate from testing that FoundationDB does what we say it does - distributed ACID transactions - is the question of whether its design is a good choice for a particular application's needs. You can find lots of information to help inform that decision in our Known Limitations page.

What’s Next?

As a new engineer at FoundationDB, running Jepsen was an opportunity to learn about network partitions and how FoundationDB handles them. But beyond that, I think the existence of Jepsen, and the popularity of Aphyr's blog posts, are indicators that engineers are demanding more of database vendors than they have in the past. They're looking beyond the benchmarks created in vendor labs, and asking what happens when databases encounter the network partitions, frozen drives, and power outages that occur in real environments. It's an exciting time to join a team that makes fault tolerance a central part of their product design and testing strategies.

Interested in running Jepsen against FoundationDB yourself? You can checkout our branch of Jepsen with support for FoundationDB here.

Download Key-Value Store

Topics: Key-Value Store, Jepsen

Subscribe to Email Updates