Ok, so for the last week or two I’ve been off on an adventure of sorts with persistent media data structures. At first I was like “Surely a good solution exists?” this isn’t a new problem. Unfortunately I was unable to locate any solutions that were open source (non-GPL) and written entirely in managed code. So then I thought I’d give it whirl and just go build something simple. That was a few days ago and I thought I’d share some of the experiences I’ve been through and those I’ve yet to tackle by starting this series of posts.
First things first, I need a box. A ‘box’ meaning what criteria defines the solution. In essence what am I trying to accomplish?
- Entirely managed code, and commercially embeddable open source (non-GPL)
- I want to interact with the database as an IDictionary for an interface
- I want the whole thing stored in one single file on disk
- I want to be able to update (read then write) an entry as an atomic operation
- ACID (Atomicity, Consistency, Isolation, Durability) operations
- A minimal MVCC at least providing for lockless reading from cache
- Low memory requirement, I want to be able to use it when it doesn’t fit in memory
Thats really all I want, a persistent IDictionary, fast, concurrent, scalable, quick and easy to use. That shouldn’t be hard, right? LMAO, Seriously I knew going in this was no walk in the park. That probably explains the lack of choices available.
Picking a data structure… Well I’m aware of several that fit this well. Rather quickly I narrowed my choices down to three well defined data structures: Hash Table, B Tree, or B+ Tree. After some reading through various websites on the application of these I found most often the preferred choice seems to be the B+ Tree as it provides several key benefits. First among these was iteration over key+value, including iteration over a key range, is easy to implement. Secondly disaster recovery is a little easier as all values are stored in the leaf nodes as opposed to a true B Tree where if your hierarchy is foo-bar you can’t recover all the values.
So at that point I know what I’m building, A B+ Tree that supports ACID operations and implements the generic IDictionary. The first step along this road was easy enough, make anything work. At the same time I didn’t want to rework the algorithm horribly once I got it working so I spent some time looking around the net.
One of the things that concerned me from the get-go was the cost to rewrite the tree when I hit the bottom (leaf). This requires that every node have knowledge of it’s parent. While this might seem like a reasonable thing at first it proved to be problematic at best. Basically the gist of the problem is that you wind up with one of two solutions, either you acquire write locks all the way down the tree so you can reorganize, or you acquire read locks down the tree and attempt to elevate the lock. Both of these approaches are crazy broken. Read lock elevation is just a broken concept to begin with IMHO, so that was a non-starter as the inevitable dead-lock and restart would be the result. Write locks all the way down the tree basically means you have no concurrency for write operations, so that’s out.
So what is the solution? Well I settled on a preemptive reorganization as you descend the tree. Basically if I’m going to insert, and I pass through a node that is full, I split the node. If I’m deleting, and I pass through a node that is at the minimum size, I join the node. This idea appealed greatly to me as it allows me to release the parent lock immediately after I check the size of the child I just locked. Another benefit is that children no longer need any reference to their parent making a reorganization quick and easy. Another great benefit is that multiple writers can exist within the tree at any given moment allowing for improved concurrency. Best of all was that now locks (write for insert/delete, or read) originate at the root and work their way down the tree. This eliminates any possibility of dead-locks within the tree (at least for a single operation).
So with this basic grasp of how I want the algorithm to behave I quickly got started. To get moving quickly I decided that my first pass would keep everything in memory and use a handle/pinning construct to allow me to swap out the storage mechanic later. I implemented the interface and immediately began writing the Add() method. To make my life easier I added a Print() method to dump the tree and then wrote a test harness to exercise the tree. By the time I finished Remove and Modify I had a pretty good start.
One thing I saw on Wikipedia about B+ Trees is that the leaf nodes can be a doubly-linked list allowing for fast iteration. So my first pass at the above implementation provided support for this. I realized later that this became a point of concern. Firstly it offended me to have the additional if/else statements in the split/join since they were already complicated enough. Yet the real problem came with multiple writers. The potential for dead-lock was back since I did not necessarily own a lock on the parent of the prev/next leaf. Thus I rewrote the enumeration to work from keys, and threw out the whole linked list thing.
Now for the hard parts, disk storage and caching…
… to be continued …