Bigtable is a widely applicable, scalable, distributed storage system for managing small to large scaled structured data with high performance and availability. Many Google products such as Google Analytics, Google Finance, Personalized Search, Google Earth, etc use Bigtable for workloads ranging from throughput oriented batch jobs to latency sensitive serving of data. While Bigtable shares many implementation strategies with other databases, it provides a simpler data model that supports dynamic control over data layout, format and locality properties.
Big table is sparse, distributed, persistent multidimensional sorted map. The map is accessed by a row key, column key and a timestamp; each value in the map is an uninterpreted array of bytes.
Row keys are arbitrary strings. Every read or write on a single row is atomic. Total row range in a table is dynamically partitioned into subset of row ranges called tablets, which helps in distribution and load balancing.
Column keys are grouped into a small number of rarely changing Column families before data is stored under any column key. It follows the syntax of family:quantifier. Column family names must be printable but quantifier may be arbitrary strings. Access control and both disk and memory accounting are on per column family level
Each cell is timestamped either by Bigtable or by the application and these multiple versions of data are stored in decreasing timestamp order. There are two settings of timestamps available that determine garbage collection: One stores last n versions and other store versions in the last n seconds, minutes, hours, etc.
The following figure shows a single row from a table. The row key is "com.cnn.www", there are two column families: "contents" and "anchor", two columns under "anchor" column family and different versions of same data specified by t3,t5,t6,etc.
Bigtable API provides functions to create and delete tables and column families, change cluster, table and column family metadata such as access control rights, access individual values and iterate and filter data by column names across multiple column families. It provides single row transactions for atomic Read-Modify-Write operations on a single row key. It does not support transactions across row keys, but provides a client interface for batch writing across row keys. Sawzall scripts, developed at Google, provide data transformation, filtering and data summarization capabilities. They are not allowed to write back to Bigtable. MapReduce wrappers are provided that allow Bigtable to be sed both as an input source and output target for MapReduce jobs.
- Distributed Google File System(GFS) stores Bigtable log and data files in a cluster of machines that run a wide variety of other distributed applications.
- Cluster management system schedules jobs, manages resources, monitors machine health and deals with failures
- Google SSTable(Sorted String table) file format is used to store Bigtable data.
- Chubby, a highly available and persistent distributed lock service, provides an interface of directories and small files that can be used as locks. Big table uses Chubby for:
- ensuring that there is at-most only master at a time
- storing bootstramp location of Bigtable data
- discovering tablet servers
- storing big table schema info(Column family info)
- storing access control lists
Three major components of Big table implementation
- Client library: interfaces between application and cluster of tablet servers
- Master server: assigns tablets to tablet servers, monitors tablet server health and manages provisioning of tablet servers, manages schema changes such as table and column family creation, manages garbage collection of files in GFS; it does not mediate between client and tablet servers
- Cluster of tablet servers : each tablet server houses a set of tablets, handles requests directly from clients(clients do not rely on master server for tablet locations), splits overgrown tablets.
Tablet location information is cached by client libraries as they access them and managed by a three level hierarchy analogous to B+ trees. First level is a Chubby file that stores the location of root tablet. In the second level, root tablet contains location of all tablets in a special METADATA table. Root tablet is treated specially and is never split to ensure the hierarchy is no more than three levels. In the third level, each METADATA tablet contain location of a set of user tablets
Each table begins with a single tablet and as the table grows, tablet server splits it into multiple tablets. Each tablet is stored to one tablet server assigned by master server. For this assignment process, master server keeps track of live Tablet servers, current assignments of tablets to them and sends tablet load request to tablet servers that have enough room.
Each tablet server holds a lock on chubby directory and when they terminate(eg: when cluster management system is taking the tablet server down), they try to release the lock so that master can begin reassigning its tablets more quickly.
Master server monitors the health of tablet servers and reassigns its tablets when that tablet server loses its lock. It begins this reassignment process by trying to acquire the tablet server's chubby lock and deleting it. Then it moves all the tablets from the old tablet server to a new tablet server that has enough room.
When the master is started by cluster management system, it goes through the following routine:
- Acquire unique master lock on Chubby
- Scan Chubby directory to discover live tablet servers
- Find out tablet assignments on each of the live tablet servers
- Scan the METADATA table to detect unassigned tablets by comparing with information from previous step and add them to the set of unassigned tablets making it eligible for tablet assignment.(If the METADATA tablets have not been assigned yet, master server adds root tablet to set of unassigned tablets to ensure that they are assigned)
Master keeps track of creation or deletion new tables and merging of two tablets into one. This follows the normal assignment process of being added to set of unassigned tablets. Tablet split is a special case as it is initiated by tablet servers. During a split, the tablet server records the new tablet information in METADATA table and notifies the master. On receipt of this notification, master assigns this new tablet to a tablet server that has enough room.
The tablets are stored in GFS as shown below
Write operation on a tablet server
- Check wellformed-ness of request and check authorization(by verifiying with list of permitted writers from a Chubby file),
- Make an entry in the commit log that stores redo records
- Inserts the updated content into the memtable.
Read operation on a tablet server
- Check wellformed-ness of request and check authorization
- Retrieve the tablet location information(list of SSTables and set of redo points, corresponding to the data, on the commit log) from METADATA table.
- Read the indices of SSTables into memory, reconstruct memtable by applying redo actions
- Return data
As write operations execute, the size of memtable increases. There are three levels of compaction to keep the size of memtable under bounds. Minor compaction freezes a memtable when it reaches a threshold size, converts it to an SSTable and persists it in GFS. Merging compaction merges a few SSTables and memtable into a single SSTable. Major compaction rewrites all SSTables into exactly one SSTable
There are several refinements done to achieve high performance, availability and reliability.
- Locality groups of column families are formed by clients and are stored in a single SSTable. This locality group can also be configured to be stored in memory to avoid disk reads
- Compression of SSTables blocks is a choice to client library. The compression format can be chosen by the client
- There are two levels of caching employed to improve read performance. Scan Cache is a higher level cache that caches key-value pairs returned by SSTable. Block Cache is at lower level and caches SSTable blocks that were read from GFS
- Bloom filters are applied to SSTables in a particular locality group. This allows client to ask whether an SSTable contains any data for the specified row/column pair.
- A single commit log file is used for the tablet server instead of one per tablet. This provides significant performance benefit during normal operation. To avoid GFS write hiccups contributing to overall latency, two separate threads(only one active; when one performs poorly, other is made active) each with their own commit logs are used. Sequence numbers for log entries give way for recovery from both the log files.
- When master initiates reassignment of tablet from source tablet server to target, source server makes a minor compaction before it stops supporting it and offers it to master server for reassignment
- Immutability of most parts of the system except SSTable caches and memtable eliminates the need of synchronization of access to SSTables. Copy-on-Write is employed on memcache to have reads and writes execute in parallel.
In a Bigtable cluster with N tablet servers, the following benchmarks were run to measure performance and scalability as N varied.
- Sequential reads and writes
- Random reads and writes
- Random reads(mem) : column families configured to be stored in memory
- Scan: reads made through Big table API for scanning over all values in a row range
The following figures shows two views on performance of benchmarks when reading and writing 1000-byte values to Bigtable.
- Random reads are slower than most other operations as a read involves fetching 64KB SSTables blocks from different nodes in GFS and reassembling the memtable. For applications with more read than write, Bigtable recommends using smaller block size, typically 8KB.
- Random reads from memory are much faster as they avoid fetching SSTable blocks from GFS.
- Random and sequential writes perform better and random reads as writes are not flushed to GFS yet. And there is no significant difference between the two writes as they are recorded in the same commit log and memtable.
- Sequential reads perform better than random reads as every 64KB block fetched from GFS is cached and used before attempting to fetch the next block.
- Scans are even faster as the RPC overhead is amortized when accessing through the the Bigtable API.
- Aggregate throughput increases dramatically by over a factor of 100 for every benchmark. But it is not linear.
- Random read benchmark shows worst scaling because of huge amount of 64KB block reads being saturated by the capacity of the network in GFS.
Google analytics stores help webmasters analyze traffic patterns by storing clicks data. Raw click table(~200 TB) maintains a row for each end-user session. The row name is tuple of website name and time when the session was created. This ensures single session is stored in single row and multiple sessions on a website are contiguous and stored chronologically. This table compresses to 14% of original size. Summary table(~20 TB) stores various predefined summaries for each website. This table is updated by scheduled MapReduce jobs that read from Raw click table. This table compresses to 29% of the original size.
Google Earth provides access to high resolution satellite imagery of the world's surface. This system uses one table to preprocess data and a different set of tables for serving client data. Preprocessing table contains 70 terabytes of data and therefore is served from disk. It relies on MapReduce over Bigtable to transform data. Each row in imagery table corresponds to a single location and rows are named to ensure adjacent locations are stored next to each other. This table contains a column family to keep track of data sources for each location. The serving system uses one table to index data stored in GFS. It is relatively small(~500GB) but must serve tens of thousands of queries per second per datacenter with low latency. So this is hosted across hundreds of tablet servers with in memory column families.
Personalized Search gives search results based on previous user actions on google products. Each user's data is stored in a single row with row name as the userid. A separate column family is reserved for each type of action. Timestamp value represents the time when the user performed the action. User profiling is done by MapReduce jobs over Bigtable and this is replicated across several clusters to increase availability and reduce latency.
- Large distributed systems are vulnerable to many types of failures such as memory and network corruption, large clock skew, bugs in other systems(eg: Chubby), etc.
- It is very important to delay adding new features until it is clear how they will be used. Eg: Not implementing general purpose transactions until some application direly needs them, which never happened. Most applications seem to require only single-row transactions.
- It is important to have a proper system-level monitoring to detect and fix many problems such as lock contention on tablet data structures, slow writes to GFS, etc
- The most important lesson is the value of simple design when dealing with a very huge system. It avoids spending huge amounts of time in debugging the system behavior.
Applications that use Bigtable have been observed to have benefitted from performance, high availability and scalability. The unusual interface to Bigtable compared to traditional databases, lack of general purpose transactions, etc have not been a hindrance given many google products successfully use Bigtable implementation. Google has had significant advantages building their own storage solution by being able to have full control and flexibility and by removing bottlenecks and inefficiencies as they arise.
The original paper is linked here.