A few weeks ago we posted our Fault Tolerance Demo Video and it got a lot of interest, making it to the top of Hacker News for most of a day. We figured since our demo cluster video was interesting to people, we’d show off an actual fault tolerance testing tool we use here at FoundationDB - the worst, least reliable database cluster we could put together, aptly named Quicksand.
In the video below one of our founders, Nick Lavezzo, shows off the Quicksand cluster and explains how we use it to continuously test FoundationDB in the real world.
A Quick Overview:
a cluster made up of eight consumer grade machines
each of which is running a different brand of SSD and a different flavor of Linux
connected via nine network switches to a “Quicksand Manager” server
The system is set up using the above network topology. This topology is set up strangely so that the Quicksand Manager server’s test cycle can cause many different types of failures by powering off outlets on the power strips. These power interruptions are performed randomly, which allows a limitless number of different failure scenarios to occur. Some common ones:
Powering off any of the database servers.
Powering off any of the “top” network switches - this causes network unavailability for each of the servers connected to it.
Powering off any of the “bottom” network switches - this creates a partition which allows the servers connected to the top switch to still communicate with each other, but not the rest of the server.
Combinations of the above.
The Quicksand Testing Cycle
During each testing cycle, the Quicksand Manager server:
Ensures all machines are turned on.
Performs a clean install of FoundationDB on each machine, and configures them all as one cluster.
Starts pushing client read / write requests to the database, simulating a real-life workload, for 10 minutes.
Randomly “perturbs” the Quicksand cluster machines, for 10 minutes.
Shuts down the cluster, and collects and analyzes the log files.
What does “perturb” mean in this context? The Quicksand Manager server randomly does these things to the cluster during each 10 minute test run:
Power a server machine off and on (randomized duration).
Power a network switch off and on (randomized duration).
Freeze the fdbserver process.
Directly allocate all free disk space on a server (making a server run out of disk space).
Quicksand does multiple of these things during each testing run, randomly.
Analysis & Validation
After each 10 minute testing run, the Quicksand Manager server analyzes its log files, and the contents of the FoundationDB database cluster, to validate that FoundationDB kept its guarantees in two primary areas:
Availability - Based upon its knowledge of which components of the system were un-perturbed at each point in the testing cycle, the Quicksand Manager server builds a model of when the FoundationDB database should have been available to process client requests. The actual availability of the database during the testing cycle is compared to the expected availability.
ACID Compliance - Based upon its knowledge of the transactions that were acknowledged by the FoundationDB database as “committed”, the Quicksand Manager server builds a model of what each key and value in FoundationDB should be. This model is compared to the actual contents of the database.
Results Thus Far
Let’s get to what matters - what the results of our testing with Quicksand have been:
Zookeeper Deemed Unreliable - In our first ever test run of Quicksand, about a year ago, a bug in Zookeeper (which we were using for cluster coordination) caused the cluster to become permanently unavailable. We hunted down the bug that caused the failure, but realized that we didn’t want to have such a critical component untested by our simulation test runs. So…
We built our own in-house Paxos cluster coordinator in Flow, which allows us to subject it to tens of thousands of tests per night, with the rest of our code base. Since implementing this new tool, we have not encountered any bugs with it in the real world.
No violations of our ACID guarantees have been encountered.
No violations of our Availability guarantees have been encountered since replacing Zookeeper.
Overall, the results for us have been to validate in our minds the incredible usefulness of our deterministic simulated testing. Since the only bugs we encounter in simulation are one in 5 million run occurrences (or bugs in freshly committed code that are quickly identified), it makes sense that we would not have encountered any bugs in the real world. This gives us, and hopefully our users, great peace of mind about the strength of FoundationDB’s guarantees and its overall quality of engineering.