abinesh's blog: Google Megastore Paper Overview

Introduction

Interactive online services place huge demands on storage systems with wide range of conflicting requirements such as scalability, low latency, consistency, high availability and agility in development. Megastore tries to address this problem by taking a middle ground and blending the scalability of NoSQL with convenience of traditional RDBMS.

Data is partitioned into entity groups(it makes an assumption that most Internet services give natural way to partitioning to make this viable), and is synchronously replicated across data centers with help of Paxos algorithm. Each data center can house multiple entity groups. Full ACID semantics are provided within these partitioned entity groups, but there are only limited consistency guarantees across them. Communication across entity groups use either the expensive two phase commit protocol or leverage Megastore's efficient asynchronous message queue. Synchronous, fault tolerant replication increases availability and independently partitioned entity groups acting as small databases stored in NoSQL datastores(Bigtable) provide high scalability.

This architecture is shown below:

Data Model

Megastore defines a data model that lies between the abstract tuples of an RDBMS and concrete row-column implementation of NoSQL. The data model is declared in schema, each schema contains a set of tables, each table containing a set of entities, which in turn contain a set of properties. Primary key consists of a sequence of properties and child tables declare foreign keys by referencing their parent table(root table). An entity group consists of a root entity along with all entities in child tables that reference the root entity.

Each entity is mapped to a single Bigtable row and the row keys are formed by concatenating primary key values.This stores data within the same entity group contiguously in hierarchical fashion(Root entity followed by child table entities). Column name is concatenation of table name and property name. This give way for storing multiple Megastore tables in single Bigtable.

There are two class of indexes: local index that indexes data within an entity group and is hence accessed atomically, global index that spans entity group and is hence not guaranteed to reflect recent updates. Index structures are stored in Bigtable and each index entry is stored in a row. The row key is constructed using indexed property values concatenated with primary key of the indexed entity.

Transaction and Concurrency control

Each entity group functions as a mini-database that provides serializable ACID semantics and has its own write ahead log. Versioned cells(timestamped) in Bigtable provides way to support MVCC and hence readers and writers don't block each other.

Megastore provides current, snapshot and inconsistent reads. A write transaction begins with current read, determines next available log position and appends the mutation to the logs using Paxos. Transactions across entity groups can either use Queues(Eventual consistency) or two phase commit protocol(Strong consistency).

Implementation of Paxos between distributed datacenters

Paxos algorithm is a way to reach consensus among a group of replicas on a single value. The original algorithm is ill-suited for high latency network links as it demands multiple rounds of communication. Megastore does a few optimizations to the original system making it practical to the system.

Any time local reads on all replicas(Fast reads) are made possible using a system called Coordinator per each replica. It tracks whether a particular entity group is up-to-date in the replica or not and stores sufficient state to serve local reads. Write algorithms ensure that Coordinator systems are conservative. Megastore assumes that it is highly likely to get batch writes from within the same region and hence optimizes this by allowing Paxos leader from previous round to continue to remain the leader(Fast writes) and avoid the prepare stage of Paxos algorithm.

There are two other replica types: witness replica that doesn't store data and participates in Paxos voting only and Read-only replica that are non-voting replicas that contain full snapshots of the data. Each application server has a client library. To minimize roundtrip WAN latency, the libraries submit remote Paxos operations to stateless intermediary replication servers communicating with their local Bigtables. This architecture is shown below.

Read algorithm

Query Local: Query the local coordinator to determine if the entity group is up-to-date locally
Find Position: Determine the highest possibly-committed log position and select a replica that has applied through that log position.

(Local read) If step 1 indicates that the local replica is up-to-date, read the highest accepted log position and timestamp from the local replica.
(Majority read) If the local replica is not up-to-date (or if step 1 or step 2.1 times out), read from a majority of replicas to and the maximum log position that any replica has seen, and pick a replica to read from. We select the most responsive or up-to-date replica, not always the local replica.

Catchup: As soon as a replica is selected, catch it up to the maximum known log position as follows:

For each log position in which the selected replica does not know the consensus value, read the value from another replica. For any log positions without a known-committed value available, invoke Paxos to propose a no-op write. Paxos will drive a majority of replicas to converge on a single value -- either the no-op or a previously proposed write.
Sequentially apply the consensus value of all unapplied log positions to advance the replica's state to the distributed consensus state
In the event of failure, retry on another replica.

Validate: If the local replica was selected and was not previously up-to-date, send the coordinator a validate message asserting that the (entity group; replica) pair reflects all committed writes. Do not wait for a reply -- if the request fails, the next read will retry.
Query Data: Read the selected replica using the timestamp of the selected log position. If the selected replica becomes unavailable, pick an alternate replica, perform catchup, and read from it instead. The results of a single large query may be assembled transparently from multiple replicas.

Write algorithm

Accept Leader: Ask the leader to accept the value as proposal number zero. If successful, skip to step 3.
Prepare: Run the Paxos Prepare phase at all replicas with a higher proposal number than any seen so far at this log position. Replace the value being written with the highest-numbered proposal discovered, if any.
Accept: Ask remaining replicas to accept the value. If this fails on a majority of replicas, return to step 2 after a randomized backoff .
Invalidate: Invalidate the coordinator at all full replicas that did not accept the value.
Apply: Apply the value's mutations at as many replicas as possible. If the chosen value di ers from that originally proposed, return a conflict error.

Known problems

Megastore might appear to be unavailable during a single replica failure(Bigtable and coordinator), but the paper claims that in practice this is not a common problem since coordinator is a simple process, more stable than Bigtable server, with no external dependencies or persistent storage. Nevertheless, network and host failures can still make the coordinator unavailable.

Coordinators resolve network partitions by using an out-of-band protocol to identify when other coordinators are up. Each Coordinator obtain specific Chubby locks during their startup and when it loses a majority, it will revert to a state specifying all entity groups as out-of-date. There is brief write outage, when a datacenter containing live coordinators becomes unavailable. When there is asymmetric network partition, coordinators hold Chubby locks but become unreachable. During this problem, a manual step is necessary to disable the partially isolated coordinator.

There are a variety of race conditions possible between the validate message for earlier writes and invalidate messages for later writes. Log position numbers help mitigate this. Another race associated with crash between an invalidate by a writer a position n and a validate at some position m < n is mitigated by using a unique epoch number for each incarnation of the coordinator. Validates are only allowed to modify the coordinator state if the epoch remains unchanged since the most recent read of the coordinator.

When several application servers initiate writes to the same entity group and log position simultaneously, there is increased likelihood of conflict. Megastore limits the write rate to a few per second per entity to keep the conflict rates in check. To overcome this limitation and increase write throughput, the paper suggests aggressive sharding of data to many entity groups.

Megastore's performance degrades when a particular full replica becomes unavailable. There are a few techniques followed to mitigate this such as disabling clients at affected replica and rerouting to other replicas, disabling replica's coordinator to have minimum impact on write latency or disabling the replica entirely.

Production Metrics

Megastore has been deployed within Google for several years; Observed average read latencies are tens of milliseconds, depending on the amount of data, showing most reads as local. Observed average write latencies are in 100-400 milliseconds range, depending on the distance between datacenters, the size of data and number of full replicas.

Conclusion

Megastore's latency penalty of synchronous replication across widely distributed replicas is more than offset by the convenience of singe system image. Megastore's over 100 running applications and infrastructure service for higher layers, is evidence of its ease of use, generality and power. It demonstrates the viability of a middle ground in feature set and replication consistency.

Link to original paper here

abinesh's blog

Thursday, September 20, 2012

Google Megastore Paper Overview