Distributing a Cache

Lithium flame

Project Li

Why?

A cache is too large for one machine and management overhead is causing CPU and memory bottlenecks.

Current design hitting physical limits.

What?

One strategy is to optimize the cache to reduce management overhead. Assume this has been done and physical machine limits are still being hit. Another strategy is to distribute the cache.

Distributed cache.

Modulo N Hashing

A simple algorithm to distribute items among a set of nodes is Modulo N Hashing.

“A wise man can learn more from a foolish question than a fool can learn from a wise answer.”

Bruce Lee

Consider these questions before using this algorithm.

  1. How may we make the cache resilient to faulty nodes?
  2. How may we redistribute cache items when new nodes are added?
  3. How may we ensure the cache items are uniformly distributed?

When a node becomes unavailable all the items owned by that node will become unavailable. Replication can be used to ensure the items are available on a replica. In the example below each node replicates the data from two other nodes. If any node goes down the data is still available on two other nodes.

Replicating items for fault tolerance.

The price for fault tolerance is space. The amount of space saved is proportional to the amount of replication used. If there are n nodes and m replicas per node, then the proportion of the cache distributed per node is m/n. The more replication the less the cache is distributed.

The price for fault tolerance.

When new nodes are added the items must be redistributed to share the load with the new nodes. If using Modulo N Hashing and the node count changes, then the item locations will change. In the worst case all nodes will need to redistribute their items when a node is added or removed. This can lead to Thundering Herd Problem and cause further service degradation.

Items being redistributed when a node is added.

If node count is not expected to change, then Modulo N Hashing with replication is sufficient to distribute a cache with some degree of fault tolerance. Extra complexity would be needed to communicate to nodes the count has changed. This algorithm has the benefit of not needing nodes to communicate.

In the example above the item hashes were uniformly distributed. A poor hash function will cause non-uniform distribution. Non-uniform distribution will cause some nodes to receive more load than others. Using cryptographic hash functions will provide good distribution at the cost of speed. Use a hash function that is both fast and provides uniform distribution. MurmurHash, xxHash, MetroHash or SipHash1–3 are all good alternatives.

Even using a good hash function may not guarantee a uniform distribution if the data is not uniform. A Hash Ring with Virtual Nodes can be used to increase the uniformity of cache item distribution. One algorithm is called Consistent Hashing.

Consistent Hashing

Consistent hashing with replication factors 1 and 2.

The idea behind Consistent Hashing is to distribute the nodes and cache items around a ring. This is done by computing the hash of the item and node keys and sorting them. All item hashes less than or equal to a node hash and greater than the previous node hash will be owned by the node with the larger hash. This is called a Hash Ring.

Notice in the above example that Node 0 has more items than other nodes when the Hash Ring has Replication Factor of 1. The cache item distribution can be made more uniform by adding Virtual Nodes to the Hash Ring. After adding virtual nodes to the hash ring Item 3 has been moved to Node 2 and the entire distribution is more uniform.

Hash Ring implementation.

Consistent Hashing Also Mitigates the Fault Tolerance and Thundering Herd Problems

When a node is removed the items owned by that node are redistributed to the node with the next greatest hash. Only the items belonging to the faulty node need to be redistributed, and only the node with the next greatest hash needs to fetch the new items. When a node is added the portion of the ring having hash values less than or equal to the node hash and greater than the previous node hash will be moved to the new node.

In both cases of addition and removal only a portion of the nodes are affected. This reduces the chance of all nodes fetching from a shared resource and causing a Thundering Herd that brings down the shared resource. This is unlike the Modulo N algorithm where all nodes and items may be affected. A trade off is increased complexity by requiring node count changes to be communicated so that nodes can update their hash rings.

How May We Compare Modulo N to Consistent Hashing?

The goal is to have a uniformly distributed cache. One way to measure this is to measure the standard deviation of the number of items owned by each node. The lower the better. The cost for a low standard deviation is paid in terms of availability, CPU, and memory.

Compare Standard Deviation of Items Per Node

Works Cited

Consistent Hashing: Algorithmic Tradeoffs

https://medium.com/@dgryski/consistent-hashing-algorithmic-tradeoffs-ef6b8e2fcae8

Your hash function should be fast. This tends to rule out cryptographic ones like SHA-1 or MD5. Yes they are well distributed but they are also too expensive to compute — there are much cheaper options available. Something like MurmurHash is good, but there are slightly better ones out there now. Non-cryptographic hash functions like xxHash, MetroHash or SipHash1–3 are all good replacements.

First, the load distribution across the nodes can still be uneven. With 100 replicas (“vnodes”) per server, the standard deviation of load is about 10%. The 99% confidence interval for bucket sizes is 0.76 to 1.28 of the average load (i.e., total keys / number of servers). This sort of variability makes capacity planning tricky. Increasing the number of replicas to 1000 points per server reduces the standard deviation to ~3.2%, and a much smaller 99% confidence interval of 0.92 to 1.09.

This comes with significant memory cost. For 1000 nodes, this is 4MB of data, with O(log n) searches (for n=1e6) all of which are processor cache misses even with nothing else competing for the cache.

In 2014, Google released the paper “A Fast, Minimal Memory, Consistent Hash Algorithm” known as “Jump Hash”. The algorithm was actually included in the 2011 release of the Guava libraries and indicates it was ported from the C++ code base.

Jump Hash addresses the two disadvantages of ring hashes: it has no memory overhead and virtually perfect key distribution. (The standard deviation of buckets is 0.000000764%, giving a 99% confidence interval of 0.99999998 to 1.00000002).

Jump Hash is also fast. The loop executes O(ln n) times, faster by a constant amount than the O(log n) binary search for Ring Hash, and made faster even still by the fact that the computation is done entirely in a few registers and doesn’t pay the overhead of cache misses.

Jump Hash looks great. It’s fast and splits the load evenly. What’s the catch? The main limitation is that it only returns an integer in the range0..numBuckets-1. It doesn’t support arbitrary bucket names. (With ring hash, even if two different instances receive their server lists in a different order, the resulting key mapping will still be the same.) A better way to think of Jump Hash is as providing a shard number, not a server name. Secondly, you can only properly add and remove nodes at the upper end of the range. This means it doesn’t support arbitrary node removal. You can’t use it for distributing keys among a set of memcached instances where one of them might crash — there’s no way to remove the crashed node from the list of possible destinations.

Another paper from Google “Multi-Probe Consistent Hashing” (2015) attempts to address this. MPCH provides O(n) space (one entry per node), and O(1) addition and removal of nodes. The catch? Lookups get slower.

The basic idea is that instead of hashing the nodes multiple times and bloating the memory usage, the nodes are hashed only once but the key is hashed ktimes on lookup and the closest node over all queries is returned. The value of k is determined by the desired variance. For a peak-to-mean-ratio of 1.05 (meaning that the most heavily loaded node is at most 5% higher than the average), k is 21. With a tricky data structure you can get the total lookup cost from O(k log n) down to just O(k). My implementation uses the tricky data structure.

As a point of comparison, to have the equivalent peak-to-mean ratio of 1.05 for Ring Hash, you need 700 ln n replicas per node. For 100 nodes, this translates into more than a megabyte of memory.

Replication is using the consistent hash to choose secondary (or more) nodes for a given key. This can be either to protect against node failure, or simply as a second node to query to reduce tail latency. Some strategies use full node replication (i.e, having two full copies of each server), while others replicate keys across the servers.

Tyler McMullen – Load Balancing is Impossible


“Predictive Load-Balancing: Unfair but Faster & more Robust” by Steve Gury


Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web

https://www.akamai.com/us/en/multimedia/documents/technical-publication/consistent-hashing-and-random-trees-distributed-caching-protocols-for-relieving-hot-spots-on-the-world-wide-web-technical-publication.pdf

Consistent Hashing

http://www.tom-e-white.com/2007/11/consistent-hashing.html

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Exit mobile version