abinesh's blog: 2013

Tuesday, April 23, 2013

Memcached Internals

I have been working on memcached about about two semesters now and for a portion of my university work, I had to put up this presentation. I may expand this into full text blog when I find more time.

Notes on "Scaling Memcache at Facebook"

Overview

Facebook experiences a read dominated workload with data coming from disparate data sources. This necessities a flexible caching strategy. Memcached, single node in-memory hash table, is used as basic building block to build memcache, an efficient distributed hash table implementation. Memcache is used as demand filled look-aside cache where data updates result in deletion of keys in cache.

Two major design goals

Any change must impact user-facing or operational issue. Limited scope optimisations are rarely considered.
Tolerant to exposing slightly stale data in exchange for insulating backend store.

A memcache cluster houses many memcached servers across which data is distributed using consistent hashing. In this scenario, it is possible for one web server to communicate with all other memcached servers in a short period of time(incast congestion: see footnote 1 for longer explanation).

In a cluster: Latency and Load

Optimisations are done to memcache client, that runs on a web server for communicating with memcached servers. A directed acyclic graph representing dependencies between data is used to minimize number of round trips. Memcache clients can either run in embedded mode or proxy mode outside the web server. This provides memcached server interfaces and routes requests to servers.

Reducing Latency

Get requests are done over UDP(any packet drop or error is considered a cache miss). Set and delete requests are done over TCP as they need to be reliable. Memcache clients use a sliding window mechanism to control the rate of requests issued to memcached servers to avoid incast congestion.

Reducing Load

Leases

Leases mechanism is used to solve stale sets(due to concurrent cache actions; Gumball technique solves this case in a different way) and thundering herds(heavy read and write activity on a key leading to alternate get and delete operations on that key).
Stale sets are avoided by generating short-lived 64-bit tokens from memcached instances on cache misses. This token must be presented when the updated data is available to be inserted in the cache. memcached instance remembers if there was a delete on that key during the duration of lease. If there was, the update is rejected.
Limiting the rate of issuing the lease token(thereby signaling newer threads to back off and try get operations again shortly) solves thundering herds.

Memcache pools

Negative interference happens when low churn but expensive keys are evicted by high churn keys. Different pools of memcached servers(each housing keys from different churn rates) are maintained to mitigate negative interference.

Replication within pools

Keys are replicated within pools instead of partitioning when the application fetches many keys simultaneously, entire dataset fits in main memory completely and request rate is higher than single server capacity.

Handling failures

Gutter pools of memcached servers are maintained that begin housing keys belonging to memcached instances that experience small outages. These keys are short lived in gutter machines. This prevents backing store from surge in trafic due to minor failures or network incidents

In a region: Replication

Regional Invalidations

Storage cluster(houses backing store) and multiple frontend clusters(houses web server and memcache) are combined into a region, allowing for a smaller failure domain and a tractable network configuration. mcsqueal, an invalidation daemon keeps data in sync by intercepting SQL statements in storage cluster and invalidating keys in frontend clusters. It batches requests to avoid overwhelming the network boundary between storage cluster and frontend clusters. Alternatively, web servers may chose the simpler way of invalidating the keys themselves, but this approach is less effective than mcsqueal.

Regional pools

Keys that are not very frequently accessed are placed in a common memcached server that is shared across multiple frontend clusters. This is called regional pool and it reduces number of replicas for these key value pairs.

Cold cluster warmup

Cold clusters(frontend cluster with empty cache) takes advantage of data replication between frontend clusters to alleviate load from backing store in scenarios such as addition of new cluster, failure of existing cluster or during scheduled maintenance.

Across regions: Consistency

The replication across regions is between backing stores and generally there is delay in transmitting replication commands between master and slave. This may lead to setting inconsistent data in frontend clusters of slave region. To maintain consistency across master-slave regions, remote marker mechanism is used to minimize the probability of reading stale data. Remote marker is set in frontend cluster if the value needs to be computed from across the region. Presence of remote marker on gets will help redirect the request to either primary region or slave region. This avoids stale reads.

Single server improvements

Performance Optimisations

Automatic expansion of hash table to avoid O(n) lookups
Multithreading using global lock. Fine grained locking tripled the peak get rate in their experiments.
UDP ports per thread to reduce contention during communication. UDP ports outperformed TCP by 13% for single gets and 8% for 10-key multigets.

Adaptive Slab Allocator

In contrast to open source implementation of slab re-balancing, facebook implemented its strategy that balances age of items across classes(that converts existing slab class level LRU to a single global LRU) rather than eviction rates.

The transient item cache

Existing LRU will hold on short lived keys longer(as memcached's lazy eviction waits for keys to come to end of the LRU to be evicted) and this wastes memory. Transient item cache, a circular buffer with each entry holding items indexed at per second level depending on which imminent second they will be expiring. At each second, head of the buffer is cleared.

Software upgrades

Software upgrades usually mean clearing out whole cache and repopulating them again. With System V's shared memory regions, data can remain live across software upgrades.

Conclusion

Learnings

Separating cache and persistence storage simplifies scaling.
Monitoring, debugging and operational efficiencies are as important as performance.
Stateful components are complex to manage than stateless ones.
Ability to gradually rollout and rollback features into parts of infrastructure are important.
Simplicity is vital

Footnotes:

Incast congestion(all-to-all communication): It is possible that a webserver may communicate to many(all) memcached servers to receive data and get overwhelmed(and begin dropping packets) when it receives responses from many(all) memcached servers are the same time.

References

Nishtala, et al., Scaling Memcache at Facebook, 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13)
https://www.usenix.org/conference/nsdi13/scaling-memcache-facebook

Tuesday, January 15, 2013

USC CSCI 685(Advanced Topics in Database systems) - Fall 2012 Course review

I am writing this post as an expanded version of many of my responses on facebook forums to my fellow university-mates at USC about the review of the course.

TL;DR: Skip to last two paragraph.

Advanced Topics in Database systems, CSCI 685, taught by Prof. Shahram Ghandeharizadeh, is a rigorous course structured into a seminar format(more like a book club that meets twice a week, you will be expected to participate actively), focusing on reading research papers representing influential milestones that helped shape up the current marketplace of database technologies. Papers including, but not limited to, RAID[1], Beyond Relational Databases[2], Bigtable[3], Column Stores vs. Row-Stores[4], The Gamma Database Machine Project[5], R-Trees[6], etc were discussed. There was variety in topic and it aimed at giving us a overall knowledge of these systems. Each class opened up with students commenting about the paper for that day followed by lecture/discussion of the paper facilitated by Professor. I would highly encourage that the papers be read before the class to make the best use of the classes.

Assignments were related to and building on top of BG, benchmarking tool to evaluate performance of a data store for interactive social networking actions and sessions[7]. Each student was assigned a datastore(from a list that included MySQL, DB2, Riak, Cassandra, CouchDB, VoltDB, etc) and the assignment involved two parts: writing DB interfaces to communicate with the datastore to store in it the data representing the social network domain that BG dealt with and to gather benchmark metrics running different workloads. It was predominantly in written in Java. Assignment questions were discussed in BG's google group. Prof. Ghandeharizadeh was very accommodating when the class wasn't making as quick a progress as he expected that we would and structured the assignment differently to adjust the pace to make sure everybody is upto speed. I think it was possible because of the small class size(15 students). I would imagine the assignment would have been simpler to begin with in a larger class size. An extra credit assignment on Berkeley DB was announced but once the class started working on the project, it was hard to begin this optional assignment. Assignments were completed by first half of the semester.

We were free to propose our ideas for class project and we were also given interesting ideas. Projects were done in a group of 2 and each group worked on a separate topic. It was done in Java and/or C depending on the topic. A few of the topics included benchmarking vertical scaling and horizontal scaling of different SQL and NoSQL datastores, implementing Gumball[8] on Memcached, implementing elasticity on Memcached, etc. There were in-class presentations during important milestones. The projects were meant to be very intense and the scope evolved over time as the students became comfortable with the problem at hand. Professor pointed out to relavent literature when necessary and facilitated discussions to get us thinking about the solution. This occupied most of the second half of the semester.

Two closed book exams were pretty easy to crack with participative attendance during the paper lectures and exam review lectures. Extra credits for midterm were given for reviewing a few PhD student's papers and for providing feedback on them.

One of the questions that I was asked several times by many of my fellow university-mates was how useful the course would be from an industry perspective. The variety of papers with the theoretical nature of discussions exposes us on a very high level about different kinds of problems that experts have encountered so far and the type of solutions that have been explored for them. Although this is not as directly applicable as the learnings from assignments or projects, I feel that it would equip students with a high level knowledge in wide variety of database technologies(and on generic topics such as indexing techniques, scalable systems, etc) that could be directly applied whenever we encounter similar problems while building systems.

Another popular question is how is CSCI 685 different from CSCI 599 - NewSQL Database Management Systems. CSCI 685 is broad and covers papers from a longer duration of time ranging from classical to contemporary systems. CSCI 599 - NewSQL is smaller in scope and covers extensively about the recent hype around NoSQL and more recent NewSQL products that have been bubbling up rapidly.

Hit me up with any questions you may have. Good luck.

References

D. A. Patterson, G. Gibson, and R. H. Katz. A Case for Redundant Arrays of Inexpensive Disks (RAID). ACM SIGMOD, 1988
M. Seltzer. Beyond Relational Databases. Communications of the ACM, July 2008, Vol. 51, No. 7.
F. Chang et al. Bigtable: A Distributed Storage System for Structured Data. In OSDI 2006.
D. J. Abadi et al. Column Stores vs. Row-Stores: How Different Are They Really? ACM SIGMOD 2008.
D. DeWitt et al. The Gamma Database Machine Project. IEEE Transactions on Knowledge and Data Engineering, Vol. 2, 1990.
Guttman. R-Trees: A Dynamic Index Structure for Spatial Searching. In ACM SIGMOD 1984
S. Barahmand, S. Ghandeharizadeh. BG: A Benchmark to Evaluate Interactive Social Networking Actions. CIDR 2013.
S. Ghandeharizadeh, J. Yap. Gumball: A Race Condition Prevention Technique for Cache Augmented SQL Database Management Systems. Database Laboratory Technical Report 2012-01, Computer Science Department, USC