Cont’d from Building a database in C# – Part 1

So with the B+ tree semantics out of the way it was time to start looking at what was missing. Obviously it was all in memory and not yet on disk, and once it is on disk I’ll certainly need to cache. Yet there was still one thing I wanted to address early on before I started playing with persistence…. Synchronization.

Tackling the Synchronization and Concurrency seemed like it was going to be a no-brainer. Take a simple reader-writer lock housed with each node, and lock accordingly… Right? Well turns out there was two problems with this. First what lock to use? I want a lock that will exhibit fairly consistent characteristics regardless off the type of loads (heavy read, heavy write, or balanced read and write). Apparently that simple request was ‘sooting for the moon’…

ReaderWriterLock – This was my first stop, I already had an interface I wrapped around it some time ago. Though at that time I quickly discovered that the implementation I had based on Monitor was more efficient I never really dug any deeper. Surely the ReaderWriterLock would be more efficient than exclusive access right? Well it really depends on a couple of things, How long does it take to acquire the lock, and how long are you going to hold it. After some initial testing I found that the ReaderWriterLock does work wonders for very read-heavy uses. When 9 out of 10 operations or more are read locks, the ReaderWriterLock performs reasonably well. At 50% read vs write this lock degrades in performance notably, and at 90% writes it’s almost unusable. So if you need a reader writer lock and will only use the write lock infrequently this is an option. However, it’s completely unusable for my needs in this project… next…

ReaderWriterLockSlim – This one is a total WTF, I can’t find anything this implementation is good at. The cost was consistently worse than using any other type of lock even a Monitor. According to MSDN: “In addition, the performance of ReaderWriterLockSlim is significantly better than ReaderWriterLock. ReaderWriterLockSlim is recommended for all new development.” I call Bull$h1t on that one.

FastResourceLock – Found this one online, and although it’s a GPL license, I gave it a try. As with most of these it performs very well at 1% to 10% writes; however, it degrades as that percentage increases significantly. Out of all locks, it’s performance for acquiring the Read lock was far superior, but with the GPL license its really not worth considering for my use.

System.Threading.Monitor – Frankly speaking this thing really moves. Even with a low percentage of writes, 10%, this lock is so efficient that it performs very close to best reader/writer lock out there. Only when writes drop below this number do you start to have a performance issue. At 1% write this guy is significantly (near one order of magnitude) slower than the plain old ReaderWriterLock. Bummer really, I sure wouldn’t mind just being able to use the lock keyword; however, since I can’t know the ratio of reads:writes beforehand I need a lock that will perform well in all cases.

Given that none of the above really worked out the sheer performance of the Monitor started me thinking… “Can I use this as a building-block for my own lock?” So there I go off to build my own reader writer lock. The concept is simple really, for readers I need to obtain a Monitor to ensure there are no writers, increment a count of readers, and release the Monitor. For writers I only need to obtain the Monitor, wait for the readers to decrement the count to zero, and then return. That sounds simple enough, let’s call it a SimpleReadWriteLock and get started:

SimpleReadWriteLock – So we only need a single piece of state, the reader count (int _readers) and we can write lock as shown here in it’s most basic form:

    public class SimpleReadWriteLock
    {
        int _readers;

        bool TryRead(int timeout)
        {
            if(!System.Threading.Monitor.TryEnter(this, timeout))
                return false;
            System.Threading.Interlocked.Increment(ref _readers);
            System.Threading.Monitor.Exit(this);
            return true;
        }
        void ExitRead()
        {
            System.Threading.Interlocked.Decrement(ref _readers);
        }
        bool TryWrite(int timeout)
        {
            if(!System.Threading.Monitor.TryEnter(this, timeout))
                return false;
            while(_readers > 0)
            { System.Threading.Thread.SpinWait(1000); /*some kind of wait and/or sleep*/ }
            return true;
        }
        void ExitWrite()
        {
            System.Threading.Monitor.Exit(this);
        }
    }

I’ve over-simplified the wait for readers loop for this post; however, it demonstrates just how simple the idea really is. In the version I’m using I actually spin-wait a few iterations until I fall back to an AutoResetEvent’s WaitOne. However, it should suffice to demonstrate the SimpleReadWriteLock concept and basic idea of the implementation. Test after test has proven this performs very reliably and very fast. At the low-end of writes 1% to 10%, this out performs the Monitor nicely. At the high-end of writes, 50% or more, this performs very consistently with the use of a Monitor, and never worse. So finally I have a lock that performs well enough at the low end, and very well at the high end of writers.

Lock Performance per 100,000 locks

This chart’s vertical axis is the average number of milliseconds over 3 runs, the horizontal axis is the percentage of write locks out of the total 100,000 with 1% being the far left edge. It’s worth noting that inside of each lock (read or write) the same workload was being performed; however, that work was not more than minimal. Had this been an actual usage of the lock and the lock was held during an expensive operation the results would vary drastically especially where the Monitor is concerned.

So what’s the final conclusion on the lock for the database I’m building? Well I started off using a variation of the SimpleReadWriteLock shown above. However, at present I’m using an implementation that completely ignores the read-lock aspect and uses the Monitor for write locks. Basically the tree supports MVCC (Muti Version Concurrency Control) to allow readers to perform reads without being blocked by write locks. This was a little tricky to get working but I’ll cover that in the next post as I start to unravel the caching scheme.

 

Ok, so for the last week or two I’ve been off on an adventure of sorts with persistent media data structures. At first I was like “Surely a good solution exists?” this isn’t a new problem. Unfortunately I was unable to locate any solutions that were open source (non-GPL) and written entirely in managed code. So then I thought I’d give it whirl and just go build something simple. That was a few days ago and I thought I’d share some of the experiences I’ve been through and those I’ve yet to tackle by starting this series of posts.

First things first, I need a box. A ‘box’ meaning what criteria defines the solution. In essence what am I trying to accomplish?

  1. Entirely managed code, and commercially embeddable open source (non-GPL)
  2. I want to interact with the database as an IDictionary for an interface
  3. I want the whole thing stored in one single file on disk
  4. I want to be able to update (read then write) an entry as an atomic operation
  5. ACID (Atomicity, Consistency, Isolation, Durability) operations
  6. A minimal MVCC at least providing for lockless reading from cache
  7. Low memory requirement, I want to be able to use it when it doesn’t fit in memory

 
Thats really all I want, a persistent IDictionary, fast, concurrent, scalable, quick and easy to use. That shouldn’t be hard, right? LMAO, Seriously I knew going in this was no walk in the park. That probably explains the lack of choices available.

Picking a data structure… Well I’m aware of several that fit this well. Rather quickly I narrowed my choices down to three well defined data structures: Hash Table, B Tree, or B+ Tree. After some reading through various websites on the application of these I found most often the preferred choice seems to be the B+ Tree as it provides several key benefits. First among these was iteration over key+value, including iteration over a key range, is easy to implement. Secondly disaster recovery is a little easier as all values are stored in the leaf nodes as opposed to a true B Tree where if your hierarchy is foo-bar you can’t recover all the values.

So at that point I know what I’m building, A B+ Tree that supports ACID operations and implements the generic IDictionary. The first step along this road was easy enough, make anything work. At the same time I didn’t want to rework the algorithm horribly once I got it working so I spent some time looking around the net.

One of the things that concerned me from the get-go was the cost to rewrite the tree when I hit the bottom (leaf). This requires that every node have knowledge of it’s parent. While this might seem like a reasonable thing at first it proved to be problematic at best. Basically the gist of the problem is that you wind up with one of two solutions, either you acquire write locks all the way down the tree so you can reorganize, or you acquire read locks down the tree and attempt to elevate the lock. Both of these approaches are crazy broken. Read lock elevation is just a broken concept to begin with IMHO, so that was a non-starter as the inevitable dead-lock and restart would be the result. Write locks all the way down the tree basically means you have no concurrency for write operations, so that’s out.

So what is the solution? Well I settled on a preemptive reorganization as you descend the tree. Basically if I’m going to insert, and I pass through a node that is full, I split the node. If I’m deleting, and I pass through a node that is at the minimum size, I join the node. This idea appealed greatly to me as it allows me to release the parent lock immediately after I check the size of the child I just locked. Another benefit is that children no longer need any reference to their parent making a reorganization quick and easy. Another great benefit is that multiple writers can exist within the tree at any given moment allowing for improved concurrency. Best of all was that now locks (write for insert/delete, or read) originate at the root and work their way down the tree. This eliminates any possibility of dead-locks within the tree (at least for a single operation).

So with this basic grasp of how I want the algorithm to behave I quickly got started. To get moving quickly I decided that my first pass would keep everything in memory and use a handle/pinning construct to allow me to swap out the storage mechanic later. I implemented the interface and immediately began writing the Add() method. To make my life easier I added a Print() method to dump the tree and then wrote a test harness to exercise the tree. By the time I finished Remove and Modify I had a pretty good start.

One thing I saw on Wikipedia about B+ Trees is that the leaf nodes can be a doubly-linked list allowing for fast iteration. So my first pass at the above implementation provided support for this. I realized later that this became a point of concern. Firstly it offended me to have the additional if/else statements in the split/join since they were already complicated enough. Yet the real problem came with multiple writers. The potential for dead-lock was back since I did not necessarily own a lock on the parent of the prev/next leaf. Thus I rewrote the enumeration to work from keys, and threw out the whole linked list thing.

Now for the hard parts, disk storage and caching…

… to be continued …