Tuesday, April 23, 2013

Memcached Internals

I have been working on memcached about about two semesters now and for a portion of my university work, I had to put up this presentation. I may expand this into full text blog when I find more time.

Notes on "Scaling Memcache at Facebook"

Overview


Facebook experiences a read dominated workload with data coming from disparate data sources. This necessities a flexible caching strategy. Memcached, single node in-memory hash table, is used as basic building block to build memcache, an efficient distributed hash table implementation. Memcache is used as demand filled look-aside cache where data updates result in deletion of keys in cache.

Two major design goals
  1. Any change must impact user-facing or operational issue. Limited scope optimisations are rarely considered.
  2. Tolerant to exposing slightly stale data in exchange for insulating backend store.
A memcache cluster houses many memcached servers across which data is distributed using consistent hashing. In this scenario, it is possible for one web server to communicate with all other memcached servers in a short period of time(incast congestion: see footnote 1 for longer explanation).


In a cluster: Latency and Load


Optimisations are done to memcache client, that runs on a web server for communicating with memcached servers. A directed acyclic graph representing dependencies between data is used to minimize number of round trips. Memcache clients can either run in embedded mode or proxy mode  outside the web server. This provides memcached server interfaces and routes requests to servers. 
  • Reducing Latency
Get requests are done over UDP(any packet drop or error is considered a cache miss). Set and delete requests are done over TCP as they need to be reliable. Memcache clients use a sliding window mechanism to control the rate of requests issued to memcached servers to avoid incast congestion. 
  • Reducing Load
    • Leases 
      • Leases mechanism is used to solve stale sets(due to concurrent cache actions; Gumball technique solves this case in a different way) and thundering herds(heavy read and write activity on a key leading to alternate get and delete operations on that key). 
      • Stale sets are avoided by generating short-lived 64-bit tokens from memcached instances on cache misses. This token must be presented when the updated data is available to be inserted in the cache. memcached instance remembers if there was a delete on that key during the duration of lease. If there was, the update is rejected.
      • Limiting the rate of issuing the lease token(thereby signaling newer threads to back off and try get operations again shortly) solves thundering herds.
    • Memcache pools
      • Negative interference happens when low churn but expensive keys are evicted by high churn keys. Different pools of memcached servers(each housing keys from different churn rates) are maintained to mitigate negative interference.
    • Replication within pools
      • Keys are replicated within pools instead of partitioning when the application fetches many keys simultaneously, entire dataset fits in main memory completely and request rate is higher than single server capacity.
  • Handling failures
Gutter pools of memcached servers are maintained that begin housing keys belonging to memcached instances that experience small outages. These keys are short lived in gutter machines. This prevents backing store from surge in trafic due to minor failures or network incidents


In a region: Replication


  • Regional Invalidations
Storage cluster(houses backing store) and multiple frontend clusters(houses web server and memcache) are combined into a region, allowing for a smaller failure domain and a tractable network configuration.  mcsqueal, an invalidation daemon keeps data in sync by intercepting SQL statements in storage cluster and invalidating keys in frontend clusters. It batches requests to avoid overwhelming the network boundary between storage cluster and frontend clusters. Alternatively, web servers may chose the simpler way of invalidating the keys themselves, but this approach is less effective than mcsqueal.
  • Regional pools
Keys that are not very frequently accessed are placed in a common memcached server that is shared across multiple frontend clusters. This is called regional pool and it reduces number of replicas for these key value pairs.
  • Cold cluster warmup
Cold clusters(frontend cluster with empty cache) takes advantage of data replication between frontend clusters to alleviate load from backing store in scenarios such as addition of new cluster, failure of existing cluster or during scheduled maintenance.


Across regions: Consistency


The replication across regions is between backing stores and generally there is delay in transmitting replication commands between master and slave. This may lead to setting inconsistent data in frontend clusters of slave region. To maintain consistency across master-slave regions, remote marker mechanism is used to minimize the probability of reading stale data.  Remote marker is set in frontend cluster if the value needs to be computed from across the region. Presence of remote marker on gets will help redirect the request to either primary region or slave region. This avoids stale reads.


Single server improvements


  • Performance Optimisations
    1. Automatic expansion of hash table to avoid O(n) lookups
    2. Multithreading using global lock. Fine grained locking tripled the peak get rate in their experiments.
    3. UDP ports per thread to reduce contention during communication. UDP ports outperformed TCP by 13% for single gets and 8% for 10-key multigets.
  • Adaptive Slab Allocator
In contrast to open source implementation of slab re-balancing, facebook implemented its strategy that balances age of items across classes(that converts existing slab class level LRU to a single global LRU) rather than eviction rates.
  • The transient item cache
Existing LRU will hold on short lived keys longer(as memcached's lazy eviction waits for keys to come to end of the LRU to be evicted) and this wastes memory. Transient item cache, a circular buffer with each entry holding items indexed at per second level depending on which imminent second they will be expiring. At each second, head of the buffer is cleared. 
  • Software upgrades
Software upgrades usually mean clearing out whole cache and repopulating them again. With System V's shared memory regions, data can remain live across software upgrades.


Conclusion


Learnings
  • Separating cache and persistence storage simplifies scaling.
  • Monitoring, debugging and operational efficiencies are as important as performance.
  • Stateful components are complex to manage than stateless ones.
  • Ability to gradually rollout and rollback features into parts of infrastructure are important.
  • Simplicity is vital

Footnotes:

  1. Incast congestion(all-to-all communication): It is possible that a webserver may communicate to many(all) memcached servers to receive data and get overwhelmed(and begin dropping packets) when it receives responses from many(all) memcached servers are the same time.

References

  1. Nishtala, et al., Scaling Memcache at Facebook, 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13)
  2. https://www.usenix.org/conference/nsdi13/scaling-memcache-facebook