INTRODUCTION
The Lightning Memory-mapped Database (LMDB) is designed around the virtual memory facilities found in modern operating systems, Multi-version Concurrency Control (MVCC), and Single-Level Store (SLS) concepts. This design is quite different than those of more traditional databases and, in operation, it can mimic behaviors that system administrators have been trained to recognize as signs of trouble. With LMDB, though, the behaviors are normal, but nonetheless this leads to the following questions:
Why does an LMDB database file grow to sizes that are sometimes substantially larger than expected?
Why doesn’t the database file shrink when entries are deleted from the database?
Why does my LDAP server’s VM usage skyrocket when an LMDB database is in use?
Why does apparent memory usage of an LMDB process go so high?
Why do LMDB database files in multi-master or replica servers often have different sizes if they are supposed to be identical copies of each other?
These are all good questions, so let’s get to it.
LMDB DESIGN
GENERAL POINTS One of the basic underpinnings of LMDB is the memory-mapped file facility, as implemented in modern Linux, UNIX, and Windows operating systems. Here is a brief introduction from the Wikipedia article on memory-mapped files at https://en.wikipedia.org/wiki/Memory-mapped_file:
“A memory-mapped file is a segment of virtual memory which has been assigned a direct byte-for-byte correlation with some portion of a file or file-like resource. This resource is typically a file that is physically present on-disk, but can also be a device, shared memory object, or other resource that the operating system can reference through a file descriptor. Once present, this correlation between the file and the memory space permits applications to treat the mapped portion as if it were primary memory.”
The memory-mapped file facility makes direct use of the operating system’s virtual memory (VM) subsystem, and in modern operating systems that subsystem is tightly integrated with both the file system and the block cache that the file system uses. As an added bonus, many operating systems take memory that is not otherwise being used by applications, and assign it to the block cache. Thus, in these operating systems there is no such thing as idle memory - all of it gets used, all of the time, and consequently file I/O is very efficient.
LMDB uses memory-mapped files to make the entire database appear as if it’s in primary memory, even if the database itself is larger than the total complement of RAM on the system. This means that instead of computing a disk address for an entry, telling the file system to read it in, and then storing it somewhere, LMDB simply computes that memory address where the entry resides and hands that address back to the calling program. Then the calling program accesses the memory address and the operating system automatically handles paging the data into main memory as needed. It should come as no surprise, then, to learn that LMDB has no cache subsystem of its own. The operating system very efficiently handles all of LMDB’s caching needs. It’s also important to point out that only those portions of the database file that are actually accessed are read into primary memory, so other applications are not starved for memory. This is a concept called Single-Level Store, or SLS. “But wait”, you say, “if the database occupies the same address space as the program that uses it, a bug in that program could cause part of the database to be overwritten and corrupt the database!” This leads to another important aspect of LMDB’s design. The memory area occupied by the database file is marked read-only. If the application using LMDB tries to modify memory that’s within the bounds of LMDB’s mapped database, the operating system immediately flags an error and terminates the application. Thus, the database is safe from stray pointers and other software bugs that could damage the database.
DATABASE FILE SIZE
We’ve discussed so far how LMDB makes the entire database file appear in memory. It follows, then, that the LMDB database file is going to be at least the same size as the total number of active records in the database, plus whatever space LMDB needs for record-keeping. And it is, sort of. Read on.
An active database will be in a constant state of flux as new records are added, existing records changed, and unneeded records deleted. A frequent question is this: why doesn’t the database file get smaller when records are deleted? In fact, (panicky rise in voice) why does the database file only grow? Read on.
When an LMDB database is initially created, it occupies only as much space in virtual memory (and therefore on disk) as is needed to accommodate the root data structures. As entries are added, more virtual memory is needed, and so the corresponding mapped file must also grow, which it does, until it reaches the maximum size set by the DBA. That accounts for the simplest case of database file growth. As database entries are deleted, the space they used is not returned to the operating system. Indeed, doing so would be expensive and require compaction, which would be I/O intensive and would slow everything down substantially. Instead, the space the entry took is marked as free and later used to store new entries. This approach does no harm - any physical RAM that the entry used will not be referenced and therefore will be available for use by the OS. Remember too, that this design eliminates the need for a performance-killing compaction phase. Database performance remains predictable.
It stands to reason, then, that in a system where the net number of entries in a database remains the same, the amount of virtual memory consumed by a database and, by extension, the size of the file on disk, will remain the same. So why does an LMDB database file sometimes grow beyond even the largest number of entries it ever had? Just as importantly, why is it that two OpenLDAP multi-master database files often differ in size, sometimes by a gigabyte or more?
To understand that, we need to look at two more aspects of modern databases: ACID and MVCC. An in-depth discussion of these is beyond the scope of this document, but an excellent article discussing ACID is here:
https://en.wikipedia.org/wiki/ACID,
and one discussing MVCC is here: https://en.wikipedia.org/wiki/Multiversion_concurrency_control.
Imagine that an LDAP client issues a search (read) request and the server identifies four entries it needs to return to satisfy the request. It queues the four entries for transmission and begins the process of returning them. Then another LDAP client simultaneously issues a delete request for one of the entries that’s either queued for transmission or in transit. What happens? The first client has already been told there are four entries on their way, and now suddenly there’s going to be one less? I’d hate to be working with that database! Consider further, what happens if one of the entries being returned is modified before it’s actually sent back to the requesting client? Do we now return the modified version of the entry? What if the modified version no longer matches the search criteria that caused it to be selected in the first place? Again, this is unexpected behavior on the server’s part and is generally considered unacceptable.
To address these conditions, database designers describe a characteristic called “Isolation”, which simply means that, within a particular transaction, the data that make up the transaction, the four entries that are being returned to the client, are protected until that transaction completes, in this case until all the entries have been sent. In other words, even though one of the entries being read was deleted out from under the reader or modified, it remains in existence in its original form until the operation is complete. After the delete operation completes, but before the read operation completes, the entry is in the funny state of being free but still busy, so it can’t be re-used. Now imagine the case of the entry that’s being modified by one client while simultaneously being read by another. In this case, *two* versions of the entry exist simultaneously, the original one that’s waiting to be returned to the client, and the new one that was just written. For just those few milliseconds, the database needed extra space to hold a new version of the entry until the read on the existing version was complete. If no free entries were available, more space in virtual memory is needed and the mapped file would also grow. Then the database marks the original version of the entry as free. The characteristic where multiple versions of the same data exist is called multi-version concurrency control, or MVCC. This explanation is simplified for ease of explanation, but illustrates the basic concepts.
Now imagine that the situations described in the previous paragraphs are happening all the time in a database, with hundreds, or even thousands of clients simultaneously reading, adding, deleting, and modifying entries. It’s not hard to see how we’d need a lot of extra space to hold on to extra versions of entries while they’re being sent across the network. And since LMDB doesn’t return space back to the OS, we see growth to a point where an operating equilibrium is attained.
You might suggest that making copies of these entries in memory would allow space to be freed up in the database immediately, but that would mean copying data from one buffer to another. One of the cardinal design rules of OpenLDAP and LMDB is that data is not copied unless absolutely necessary. That’s what makes OpenLDAP as efficient and as fast as it is. Consider also, what happens if one needs to make copies of many, very large, result sets. There’s one more wrinkle to all this. Remember we said that an entry that’s waiting to be transmitted to a client will exist in its current form until that read completes, regardless of how it’s changed by any subsequent operations? What would happen in a busy server that handles lots of modify traffic if clients took their own sweet time reading their search results? What if they stalled? Early versions of OpenLDAP with LMDB experienced truly explosive database growth because of this. However, this is no longer a problem because a reader will lose its Isolation if it stops reading from the server.
Now on to the question that started all this: why are databases on otherwise equivalent multi-master servers sometimes wildly different in size? The answer is they aren’t, you’re just looking with the wrong tools (disappointing, huh?). Previous paragraphs already described how differing read patterns on different servers can cause varying numbers of temporary ‘extra’ copies of database entries. Looking at the actual allocated file size on disk just indicates how any entries, temporary and actual, might once have existed in the database, not how many there actually are. The mdb_stat utility, however, will tell you the actual number of active pages in each database. Depending on the activity level of a server, the number of active pages in each database file should be quite close. Since entries don’t map exactly to database pages, the page counts will hardly ever be exactly the same, but the numbers should be quite close. Any unused pages stand ready, already fetched from the OS, to receive new changes.
System administrators who monitor a system’s virtual memory statistics as a way of judging the health of a system will become concerned when a system running LMDB shows high virtual memory use. This can happen in particular if an LMDB database file has undergone extensive growth and subsequent contraction. A panicky admin might stop the LDAP daemon, compress the database with mdb_stat, and restart it, but that is really just a waste of time because the unused space in the database file is not mapped into physical RAM and doesn’t affect system performance in any way. But it’s there to be used if it’s needed by any process that needs it.
There’s an often-forgotten aspect of virtual memory: the issue is not how much virtual memory is allocated beyond the amount of physical RAM that affects a system’s overall performance, it’s whether or not that memory is actually being referenced. If it’s not being referenced, it simply sits out on disk and doesn’t cause anything else to happen until it’s referenced. In the case of unused LMDB space, the memory is never referenced, so it doesn’t contribute to ‘thrashing’, ‘swapping’, or other adverse activity.
CONCLUSION
We hope that this gives you a better understanding how LMDB is designed and why the size of .mdb files can vary between providers and consumers (even by gigabytes), even when all contain the exact same data as the others. While it is prudent to monitor the memory and disk footprint of an LMDB database, the ultimate measure is the consistency of the actual data between providers and consumers. Do the contextCSNs match on all the servers? If you created ldif backups of the provider and a consumer, would the content of the backups match? If the answer is yes to these tests, then nothing is wrong, even though the .mdb files aren’t exactly the same size.
Comments