We’re coming up on the tenth anniversary of LMDB, and I’ve been thinking back to the bad old days of cache tuning that we struggled with, before LMDB’s advent. As noted in the LMDB design doc tuning caches used to be a major pain point in administering OpenLDAP. It was a constant juggling act due to the presence of at least 3 different layers of caching being in effect, each with different space and time characteristics. While we’ve been free of this burden for nearly 10 years now, we gained a lot of hard-won knowledge about caching in the years before that. Most of that knowledge is no longer necessary for operating OpenLDAP, but there’s still a lot of software in use today that tries to manage their own caches, and the lessons we learned are still valuable in those other contexts.
Who/What is LRU?
The slapd-ldbm, slapd-bdb, and slapd-hdb backends all used their own application-level entry caches to optimize read performance. These caches were essential because the underlying DB libraries (gdbm, bdb, ndbm, etc.) performed quite slowly on their own. The same cache management policy was used on each, LRU: Least Recently Used. This is one of the simplest to understand and simplest to implement, and it’s widely documented.
The typical implementation in C uses a doubly-linked list. Cached elements are added at one end of the list when they’re used, and dropped from the other end of the list when the cache size limit is reached. Any cached elements that are referenced again are pulled out of their spot in the middle of the list and added back onto the end.
It turns out that this data structure performs very poorly with higher levels of concurrency; the head and tail pointers of the list must be mutex protected, as well as the pointers within each cache element. And every read operation in the server actually expands to a write operation in the cache, pulling an entry out of its old position and tacking it on to its new position.
Logically it is the same strategy as LRU, but the implementation uses a circularly linked list. Instead of popping pages out of their position and tacking them onto the end, there is a clock-hand pointer that points to the current head/tail, and a Referenced flag for each entry. When an entry is referenced, the hand is just advanced to the next entry in the circle. This approach drastically reduces the locking overhead in reading and updating the cache, removing a major bottleneck in multi-threaded operation.
Important Lesson #1: If you’re using plain LRU in your code, stop. Use CLOCK instead.
Cold Hard Cache
Aside from classical LRU being terrible for multi-threaded performance, it has some pathological cases where it cannot offer any performance benefit at all. In particular, if the cache has a maximum size of N entries, and a sequence of operations arrive that touch M > N entries, the cache effectiveness goes to zero.
For example, given a cache with max size 4, and a request that accesses 5 entries A, B, C, D, and E, the cache will grow as:
A A B A B C A B C D B C D E
At every step an entry is added to the cache. Since there are no repeats in the sequence, none of the cached entries are reused while executing the sequence. At the last step, because the cache is full, entry A is evicted to make room for entry E to be added.
If the operation accessing this sequence of 5 entries is repeated, the cache contents will become
C D E A D E A B E A B C A B C D B C D E
None of the cache’s contents will ever be reused to service the request, because the next entry being requested is always the one that was just evicted a moment before. In these situations the cache is just a waste of time and memory.
This is a major weakness in LRU and its related cache management schemes. One of the solutions proposed to this problem is the Clock-Pro algorithm. We experimented with this back in 2007 but with poor results.
One of the difficulties in working with these experimental algorithms is that they tend to be developed in the context of managing virtual memory pages in a demand-paged operating system kernel. A kernel has many low level high performance facilities at its disposal that have no equivalent in user-level application space, or whose usage is more expensive in user-land. In the case of Clock-Pro, it may have worked in kernel space but adapting it to user level was a very poor fit.
Eventually a simpler solution was arrived at: just stop adding entries to the cache if the number of entries accessed in the request exceeds the maximum size of the cache. This allowed the cache effectiveness to approach 100% (every entry in the cache gets reused) and was a huge performance boost for these pathological operations. To illustrate with the same cache and operations as in the previous example, the cache grows almost the same as before:
A A B A B C A B C D A B C D
But when we process entry E, we discard it instead of caching it. Then when the sequence is repeated, entries A, B, C, and D are already in the cache and are processed at basically zero cost. We simply pay the cost of fetching entry E again. In this example, the repeated operation runs 5x faster than the uncached operation.
This simple optimization is one that works because we have higher level knowledge about the request being performed. It’s an advantage that we gave up, in adopting LMDB and leaving all cache management to the OS. (Advantages like these are reasons why most DB engineers claim that DBs can manage caches better than OSs can.) In practice, memory management in systems like Linux has advanced beyond simple LRU, with features adopted from Clock-Pro and other designs.
So at least on Linux, cache performance hasn’t been an issue with LMDB. It’s noticeably worse on Windows, as is memory management in general on Windows, but obviously Windows is not a high performance platform to begin with.
Important Lesson #2: Exotic solutions aren’t always all they’re cracked up to be; take a step back and you may see a simpler option.
Despite the weaknesses of LRU in some rare corner cases, it still performs well in general. This is especially true for hierarchically structured data, due to the nature of these structures. For example, given a directory tree like this:
dc=com ..dc=example ....ou=people ......cn=Alice ......cn=Bob ......cn=Carol
A lookup for cn=Carol,ou=people,dc=example,dc=com will touch these entries and cache them:
dc=com dc=example ou=people cn=Carol
Assuming the cache size limit is 4, then a lookup for cn=Bob,ou=people,dc=example,dc=com will touch these entries and cache them:
dc=com dc=example ou=people cn=Bob
The LRU policy will naturally evict the cn=Carol entry, while preserving the other entries. So with tree structured data you will always get the maximum possible effectiveness from an LRU-based cache. This same principle applies to the B+trees used internally in LMDB. Because tree traversals always start at a tree root, the tree root will always be hot in the cache, and frequently used branches will also stay hot in the cache, with only the leaves being cold. So the natural access pattern of tree-oriented data structures inherently maximizes the effectiveness of system caches, without any additional effort required. This is one reason why OpenLDAP is still the world’s fastest, most efficient distributed database today, while requiring no cache tuning.
Important Lesson #3: Getting your data model right is more than half the battle.
A lot of intensive work is still ongoing in designing optimal caching algorithms. Even if you have a great design, realizing it in a good implementation can still be a difficult challenge. (Looking at the commit history of the old caching code in OpenLDAP gives a hint of just how much effort it takes to get it right.) For locally stored data the best cache management strategy these days is “let the OS do it for you.”