The Google F1 database paper just became generally available. It’s the database that runs AdWords, so I suspect Google kind of cares about it. Google F1 bears an amazing resemblance to our FoundationDB SQL layer. Reminder: FoundationDB is a scalable, fault-tolerant, and high-performing NoSQL database with true ACID transactions. Both leverage an underlying scalable and transactional storage substrate to allow unprecedented scalability at both the storage and SQL processing levels. Both leverage hierarchical schemas, arranging tables into a hierarchy that is mirrored in storage by interleaving rows of parents with rows of children. Both systems make the tradeoff that higher latency is acceptable for gaining almost unlimited bandwidth and scale. A query may require more round-trips and take a (bounded) longer time, but the overall throughput of the system, measured in number of queries that can run in parallel, is much higher.
The fact that Google has built and is using F1 for their AdWords business demonstrates how the architectural approach works for even the most demanding of applications. However, F1 is only available to Google’s internal teams, while our open-source FoundationDB SQL Layer is intended to have a somewhat broader audience.
Below are seven of the strongest similarities; All quotes come directly from the above mentioned Google F1 paper. Read on and make up your own mind.
1. Built on a strong storage substrate:
“F1 is built on top of Spanner , which provides extremely scalable data storage, synchronous replication, and strong consistency and ordering properties. F1 inherits those features from Spanner and adds several more… Spanner handles lower-level storage issues like persistence, caching, replication, fault tolerance, data sharding and movement, location lookups, and transactions.”
2. Stateless SQL:
Because all state is in the consistent storage substrate, both SQL engines retain no state of their own. As a result SQL nodes can be easily added or removed, scaling overall SQL capacity. This allows very strong scalability and load balancing.
“F1 servers are mostly stateless, allowing a client to communicate with a different F1 server for each request.”
3. Insisting on truly transactional systems:
Both systems are strongly consistent and support ACID properties even for complex multi-object transactions. A lot has been written about why transactions are important. To anyone who is still confused I’d simply suggest an exercise of trying to write code to maintain an index, in a system that is not strongly consistent, while many users are changing the data.
“We also have a lot of experience with eventual consistency systems at Google. In all such systems, we find developers spend a significant fraction of their time building extremely complex and error-prone mechanisms to cope with eventual consistency and handle data that may be out of date. We think this is an unacceptable burden to place on developers and that consistency problems should be solved at the database level. Full transactional consistency is one of the most important properties of F1.”
4. Limited transaction duration:
In order to avoid the need to retain full logging history, both systems limit transaction duration to 5-10 seconds. In FoundationDB this may be increased over time but the intent is to explicitly disallow transactions that change data to run for long periods of time. Like Google we believe that this limit should suffice for most application types.
“Spanner provides a global safe timestamp, below which no in-flight or future transaction can possibly commit. The global safe timestamp typically lags current time by 5-10 seconds.”
5. Hierarchical schema - a relational schema with an explicit table hierarchy:
Both systems rely on clustering related rows from different tables. In the FoundationDB SQL Layer we call the concept Table-Groups. In essence, Table-Groups define and store nested data. Table-Groups organize a set of user-defined tables into a hierarchy that reflect an object structure, e.g. a CUSTOMERS table will be the parent of ORDERS table, which will be the parent of the ITEMS table. This Table-Group definition is then reflected in the physical organization of data. Related rows from the tables are stored in a hierarchy, interleaved according to the Table-Group structure, e.g. a CUSTOMER row is stored followed by its first ORDER row, followed by that order’s ITEMS, followed by the next ORDER and ITEMS for that customer, thereby reflecting the hierarchy of the tables.
“At the logical level, F1 has a relational schema similar to that of a traditional RDBMS... Logically, tables in the F1 schema can be organized into a hierarchy.”
“Physically, F1 stores each child table clustered with and interleaved within the rows from its parent table. Tables from the logical schema cannot be arbitrarily interleaved: the child table must have a foreign key to its parent table as a prefix of its primary key... All of these properties of a hierarchical schema help mitigate the latency effects of having remote data... This clustering was critical to achieving acceptable latency.”
Source for images: http://research.google.com/pubs/pub38125.html
Taking advantage of this hierarchical row storage requires operators that are aware of this collocation…
“This data model allows us to efficiently join a parent table and a descendant table by their shared primary key prefix… F1 uses a merge-join-like algorithm which we call cluster join. The cluster join operator only needs to buffer one row from each table, and returns the joined results in a streaming fashion as the Spanner input data is received.”
Our SQL Layer takes this much further. Many more operators become ‘Table-Group’ or ‘Clustering’ aware. Our GroupScan operator is analogous to cluster-join. Other operators can leverage clustering information in indexes too, for example, the Index Intersection that can intersect indexes from any number of tables in a single Table-Group.
6. NoSQL and SQL interfaces live in harmony:
This is a direct result of the hierarchical schema that keeps an object in storage as a series of clustered rows. It is straightforward then to provide other interfaces to this data. The Google paper does not go into much detail on their API. The FoundationDB SQL Layer provides a RESTful API to the data as JSON objects in a fully consistent manner.
“F1 supports a NoSQL key/value based interface that allows for fast and simple programmatic access to rows. Read requests can include any set of tables, requesting specific columns and key ranges for each. Write requests specify inserts, updates, and deletes by primary key, with any new column values, for any set of tables. This interface is used by the ORM layer under the hood.”
7. Batch pipelining query execution for parallel execution and latency hiding:
Traditional relational databases operate on local data and optimize for disk access which was traditionally the high contention point. As a result, traditional execution engines approach query processing as a one row at a time affair. This is true for virtually all major RDBMSs today. A very high-bandwidth, over-the-network backend changes the picture completely. Each request may take longer but a very large number of requests can be serviced in parallel. Both F1 and FoundationDB SQL Layer employ similar techniques to batch many parallel requests to the backend amortizing latency and driving performance.
“…remote data accesses involve highly variable network latency. In contrast, traditional database systems generally perform processing on the same machine that hosts their data, and they mostly optimize to reduce the number of disk seeks and disk accesses. Network latency and disk latency are fundamentally different in two ways. First, network latency can be mitigated by batching or pipelining data accesses. F1 uses extensive batching to mitigate network latency. Secondly, disk latency is generally caused by contention for a single limited resource, the actual disk hardware. This severely limits the usefulness of sending out multiple data accesses at the same time. In contrast, F1’s network based storage is typically distributed over many disks, … so scheduling multiple data accesses in parallel often results in near-linear speedup until the underlying storage system is truly overloaded.”
So what are we really saying? That we've got an F1/Spanner like combination for the masses?
I guess so.