Cont’d from Building a database in C# – Part 2
Caching and persistence seem to go hand-in-hand for this adventure. With the cost of serialization and IO it’s just not possible to get anything to perform well without a little caching. The questions I had around caching is really more along the lines of “how” and “what” not about “if”. Let’s be clear, I generally hate using cache. It often produces ugly and problematic code that is hard to follow and even harder to debug. There are cases where it makes sense to use a cache and I felt this was one of those.
First things first, answering “What do I cache?”
This is a fairly obvious answer given the B+Tree structure. I need to cache these ‘node’ things and their associated lock. The biggest question that came up was if I wanted to exclude the data-nodes from the cache and only cache the hierarchy nodes. Conceptually this would improve the cache hits as the tree structure will be used many times more than the leaf data nodes. However, as I experimented with this I came to two issues with the exclusion: 1) The cost to rehydrate data nodes can be more expensive than the hierarchy nodes and to exclude it may reduce overall performance. 2) The cache mechanic I have in place is also responsible for caching the lock instance for a given node.
Next answering the hard question “How do I cache?”
I decided to tackle this as a three-prong approach. I wanted to provide several cache mechanics, one that does not cache data, one that caches everything, and then strike a balance somewhere in the middle. For obvious reasons all of these are write-through caching. The first was fairly easy, I needed only a lookup to provide the lock instance used for the node and to access the node itself I continually pass-through to the storage interface. The ‘full/complete’ cache was also trivial and performs very well if you happen to be able to fit all your data in memory. The usefulness of this seems limited since though since the whole point here is that the data is too large to fit entirely in memory. After all if I could fit it all in memory I might prefer to just use a Dictionary and simply save the contents to disk. So this complete cache is cool and all but doesn’t serve us well, so where is the balance?
Caches in a database tend to follow these basic rules: 1) allocate a hoard of memory, 2) sub-allocate from it until it is depleted, 3) once depleted try and unload something. This is a reasonable approach for most C++ applications, especially dedicated database processes. However, I’m building an embedded database/storage engine not a dedicated database server. The idea of allocating 100mb or 1gb at startup seems absurd in the managed world. After all I have something those C++ guys don’t have, a Garbage Collector. The trick to leveraging it to house the cache is that I need a few things to happen: 1) I can’t hold a direct reference to everything so the GC can clean up, and 2) I need to hold a direct reference to enough stuff that the GC won’t constantly clear my entire cache. So the first is easy enough, I’m sure your all aware of the WeakReference class and it’s properties so I won’t go into detail there. For the second part of this, keeping references alive, I needed some code…
Enter the ObjectKeepAlive class. I needed a high-throughput class that will hold references for me, but what criteria to use to keep reference around? Well to keep things easy I went with a simple min/max range for the number of items and timespan that decides when to reduce the number from max to min. Thus I always keep at least min items alive and have at most max items that are no older than the timeout. This was fairly easy to put together as a forward-only linked list of arrays where the list stores a time-stamp of last modification. I actually started by modifying the last example lock-less queue I posted here. I’ll have the code updated soon and for those wanting to join me abusing the GC you might find the ObjectKeepAlive class useful ;)
I know these posts have not been very detailed and stay on vague side. Unfortunately it is difficult to deep dive some of these topics without brutally long backgrounds on both concepts and implementation details. With the implementation spanning 1.3k LOC across 20 files it’s more involved than anyone in their right mind would care to read about. As it is I hope the code will prove more useful than these topics, I promise it’s coming soon, it’s already at the self-imposed required 100% functional coverage. I’m only waiting on sorting out some issues with the migration to mercurial over on googlecode.com.