Cassandra 2.0 was released a few days ago, and its headline feature is a compare-and-swap operation. I’m happy to see Cassandra added CAS, which is a very useful capability — that makes the world a little better. And the database world needs all the “better” it can get. I’m less happy to see the feature referred to as “lightweight transactions”, because that is misleading and makes the world a little more confusing. And the database world does not need any more confusion.
Compare-and-swap is not a general transaction facility. At bottom, transactions provide the ability for an application to do multiple operations “together” without interference from other concurrent access to the database. Compare and swap provides a special case of this: the ability to do two operations “together”, one of which is a read from a row (or “partition” in Cassandra), and the other a write to the same row. Being a special case of a transaction is not very exciting in and of itself, because transactions are so general: just about anything is a special case of a transaction.
Transactions are useful for a huge range of purposes. Maintaining and validating invariants, things that are always true of the data in the database, like a friend of a user always having that user as a friend, a product that’s present in an order always being present in the collection of products, or a user having a unique username and e-mail address. Maintaining useful data structures like scalable and consistent indexes. Implementing one data model in terms of another. Moving a record from one collection to another. Of the above examples, uniqueness checks are the only one that’s obviously implementable with compare-and-swap alone, and even those enjoy caveats and complications if there is more than one uniqueness constraint on the same object.
Implement the following function correctly and scalably with Cassandra’s CAS operations. Don’t forget that the client can fail…
Users = directory.create_or_open(db, "users")
Emails = directory.create_or_open(db, "user_emails")
def addUser( tr, username, email, name ):
if tr[Users[username]] != fdb.tuple.pack((email,name)):
if tr[Users[username]] != None:
raise Exception("Username already in use")
raise Exception("Email already in use")
tr[Users[username]] = fdb.tuple.pack((email,name))
tr[Emails[email]] = username
Transactions also make your application correct by default under concurrency: if your application’s transactions work right in any sequential order, then with a serializable database they also work under any concurrent execution. Some applications may need to relax serializable consistency for performance in specific cases, but it’s much better to do that, and deal with the much more challenging requirements of getting concurrency right with even slightly weaker guarantees, only where you have actual predicted or measured performance problems. Compare-and-swap doesn’t provide that same peace of mind, even when it is implemented efficiently enough that you can use it ubiquitously (as Cassandra does not currently do). Think about how much harder it is to write concurrent shared-memory software using only CAS-like features (http://en.wikipedia.org/wiki/Non-blocking_algorithm) than to write it using synchronization primitives (or even better, with transactional memory, if you are lucky enough to use an environment that supports it).
None of this says that CAS isn’t a very useful database feature, especially in the context of Cassandra’s data model and update resolution semantics! But I’m concerned, and I’m afraid Datastax is hoping, that many less sophisticated users will say “oh, ok, it has transactions”and not realize how much they are giving up relative to a truly transactional database until they actually get burned. Maybe this is a lost cause - there are other vendors billing exactly the same CAS feature as “ACID transactions” - but I think we should use the term “transaction” only for the real thing.
It’s also worth talking about another issue: usually the effective level of fault tolerance that your application enjoys is pretty much the worst level of fault tolerance of any of the things it does. So an application built on top of Cassandra can theoretically be fully available in a minority partition — as long as it doesn’t need to do anything that uses quorum consistency or more. If your application uses CAS for any common or important operation, it has effectively chosen CAP Consistency over CAP Availability and will not be available in minority partitions. You are getting, potentially, the worst of both worlds - most of your application (and the users thereof) has to deal with all the concurrency and consistency problems of an eventually consistent database, and no better availability than you could get with a fully transactional database. It’s much better, in my opinion, to start with a tool that gives you strong guarantees by default, and then relax them selectively where there is a clear performance benefit and you have done the careful analysis needed to prevent bugs.
In fact, the fault tolerance of Paxos replication (the implementation of Cassandra’s CAS feature) is not the best that you can do in practice. (This criticism applies to other systems using Paxos replication for database contents, including those like Google Spanner that actually provide transactions.) Paxos maintains consistency during partitions by requiring a majority of replicas to be available, which means that a Paxos replica set with N replicas can only handle (N-1)/2 failed or partitioned replicas. So for example in a three datacenter setup with triple replication, the loss of one datacenter and one machine in another datacenter is likely to make the database unavailable. FoundationDB uses a Paxos variant for coordination, but only stores a negligible amount of metadata using Paxos replication, and it is cheap to have more replicas of this tiny coordination state to handle more failures. Actual database contents are replicated using a variation of “good old” synchronous replication, and so can remain available with N-1 simultaneous failures of N replicas. That’s twice as many failures, and it is much faster, too.
(Disclaimer: I’m not an expert on Cassandra or its implementation, and so it’s possible I’ve said something wrong about it. If so, please let me know.)