BPlusTree

 
Overview

BPlusTree is a implementation of the generic IDictionary<TKey, TValue> interface backed by a disk-based B+Tree. There is no practical size limit on the total amount of data that be stored or on the size of the key or value data used. Keys and values can be any class, struct, or built-in primitive written in .NET provided that you can supply an implementation of ISerializer<T> to read and write the object. There are also several available storage, concurrency, and caching strategies available to choose from.

Storage

All storage systems implement a single interface, INodeStorage. This interface is responsible for the creation, reading, writing, and deletion of a node in the tree. You can implement your own storage subsystem and provide it to the options class via the StorageSystem property. Most of the time you will want to use one of the two forms of storage BPlusTree supports by default each of which have numerous options.

File Storage – The first and most common of these storage types is the file based storage. To use the file store simply specify the FileName property and the CreateFile policy on the Options class. The file storage uses a fixed block sub-allocation scheme to allocate blocks within the specified file. This block size can be specified or left as the 4kb default. The other important setting here is the FileOpenOptions setting which allows you to specify a combination of the System.IO.FileOptions flags. Both WriteThrough and NoBuffering (FILE_FLAG_NO_BUFFERING = 0×20000000, see File Buffering) are supported. Using the WriteThrough and NoBuffering options provides you with the most durable storage at a significant performance cost.

Memory Storage – Most of the time this is fairly useless as the whole B+Tree goes since you would ordinarily just use a regular Dictionary<TKey, TValue>. It is possible the BPlusTree can outperform a Monitor locked Dictionary; however, real world use cases of the excessive contention required seem unlikely. If you simply must have an IDictionary that is completely thread safe you are welcome to give it a try, just set the StorageType property on the options class to Memory.

Concurrency

Concurrent reads and writes are completely safe. By default writers do not block readers so a reader is only guaranteed to read a modification if the read operation began after the completion of the write. In addition the use of localized node level locking for writers allows multiple concurrent writers to co-exist in different locations of the tree.

The BPlusTree uses an implementation of an IFactory<ILockStrategy> to construct locks for each node read from the store. The default ILockStrategy uses a Monitor TryEnter/Exit to control access only for writers. If you are using the BPlusTree from a single thread, or you are opening it in ReadOnly mode, then you can disable the locking by setting the LockingFactory property of the Options class to an instance of the IgnoreLockFactory class.

With a few exceptions all public members and enumerators are completely thread safe by default. Those exceptions are diagnostics methods like Print, and Validate, as well as the methods EnableCount, Clear, and UnloadCache. All of these methods may be made thread safe by specifying a valid reader/writer lock (i.e. SimpleReadWriteLocking) on the CallLevelLock property of the options class. In addition once the CallLevelLock is set you can obtain a Write lock on this to obtain exclusive access to the entire tree.

Caching

The cache is fairly baked into the implementation at present and is therefore not extensible; however, there are several options to control its behavior. Setting the CachePolicy of the Options class is the primary control allowing for None (nothing is cached, everything is read from disk), Recent (high utilization nodes are kept in memory), and All (everything is kept in cache).

The default is Recent and this cache option has a few options of its own. The CacheKeepAliveMinimum/MaximumHistory properties control the range of items allowed to be cached while the CacheKeepAliveTimeout defines how quickly the cache should be reduced back to the minimum value. It is important to note that these numbers do not reflect unique nodes, but rather include each visit to a node. By default the cache holds at least the most recent 10 nodes visited and at most the last 100 nodes visited. With mild activity this cache will reduced itself after 1 minute. Please note that the BPlusTree does not create or manage threads so without some level of activity the cache may remain at its maximum capacity.

Fault Tolerance

What can go wrong will go wrong. Every effort has been made to ensure that the BPlusTree continues to behave even when things don’t go as expected. All modifications to the tree and the file are transacted internally to ensure modifications to the data or tree remain atomic and either complete or fail as a single unit.

Because of the fact that threads may be killed by unmanaged code or other catastrophic thread event (stack overflow, stack corruption) there may be a case when .NET’s Monitor becomes confused about who owns it. If this happens every call to the BPlusTree may throw a TimeoutException when attempting to acquire a lock. This is the other reason apart from regaining memory for the UnloadCache method (see Concurrency and the CallLevelLock property).

Though I’ve never seen it happen under normal circumstances, if your file does become corrupt you can attempt to recover the data with the RecoverFile/RecoveryScan.

Multiple Instances

Multiple instances of a BPlusTree using the same file is not supported unless the ReadOnly property on the Options class is set to true. Though it is possible to build this into or on top of the BPlusTree I personally prefer to use a single instance and IPC/RPC to allow cross-domain or cross-process access.

Exception-less Errors

Although BPlusTree supports the IDictionary interface the public methods available directly on the class do not always mirror those semantics. You should note that the public Add method returns a boolean result. This is also true for the Update, AddOrUpdate, and Remove methods. If you like exceptions using the IDictionary method Add with an existing key will raise one for you.

Simple Example:

    BPlusTree<string, Guid>.Options options = new BPlusTree<string, Guid>.Options(
        PrimitiveSerializer.String, PrimitiveSerializer.Guid, StringComparer.Ordinal)
         {
             CreateFile = CreatePolicy.IfNeeded,
             FileName = @"C:\Temp\Storage.dat"
         };

    using(BPlusTree<string, Guid> map = new BPlusTree<string, Guid>(options))
    {
        Assert.IsTrue(map.Add("foo", Guid.NewGuid()), "Failed to add foo");
        Assert.IsTrue(map.ContainsKey("foo"), "Failed to find foo");
    }
Online Resources
 

A few weeks ago I published NuGet packages for version 1.11.924.348.

 
I also managed to get protobuf-csharp-port on board. I absolutely live by protobuffers.

 
Finally I recently published a complete protobuf services rpc layer build on the RpcLibrary and protobuf-csharp-port. Of course much of the code there could be used with any raw transport like WCF, TCP, Named Pipes, etc so check it out. This library is also available on NuGet:

 
If you haven’t tried NuGet you should look at it. If you manage an open source library you really need to get published there. It has allowed me to publish projects that have dependencies without the need to check those dependencies into source control. This reduces the burden on source control, allows for smaller archives, and makes it easier to upgrade dependencies. So trash the binaries from your source control and get started with NuGet.

 

Ok I swear this is the last B+Tree chart, graph, or metric I’ll post… at least for a while. I wanted to hit new limits so I fired up 6 threads on my six-core and let them rip… After three hours I finally stopped the run to look at the results.

Each of the six threads had inserted over 34 million records, a total of over 200 million records inserted in 3 hours averaging over 18k inserts per second. The data file had reached 5.79 GB (6,221,824,032 bytes). I guess that isn’t too bad considering I was using my slow 2tb WD caviar green (I didn’t want to thrash my SSD). Opening the file doesn’t do much other than to read the first block, from a cold start it took 62ms. The first thousand random seeks took an average of 10ms and thereafter settled in at around 9ms. Sequential scans of 1000 keys at a random offset takes about 500ms averaging .5ms each.

The following is a graph of one thread (all 6 near identical) showing the time in seconds (vertical) and total records inserted in millions (horizontal). Obviously the point being made here is that the line is straight :)

Time / Records
B+Tree Insert Progression
 
If you are wondering why these numbers don’t line up with the B+Tree benchmarks previously posted there are two reasons. First I totally screwed up the multi-threaded test for those numbers. The second difference was the amount of data/record size. The above graph shows inserts of an Int32 key and a Guid value or 20 byte records. The benchmark used a Guid key on a 100 byte value or 116 byte records. The major reason for reducing the record size was to improve throughput so I could target a larger number of records.

 

Test Overview
The benchmark results below are from a single thread on a process with the thread-affinity set except two. The BPlusTree(Threaded) and obviously MySql were not constrained. All data used during the test was generated before the test ran. The number displayed in the logarithmic horizontal axis is the average number of operations completed per second. This is the worst average of the last 5 of 8 total runs.

Benchmark Results
Inserts per second

BPlusTree
The strong showing here is most certainly the select time. BPlusTree and STSdb outpace all the rest by a full order of magnitude. As you can see though the insert, update, and delete times, though respectable, are not near some of the other options.

BPlusTree (Threaded)
This test used 4 CPUs and 4 threads to produce the throughput shown. The general goal here was to demonstrate the throughput was higher than a single-threaded test. Showing a 50% gain on insert and update times across 4 threads gives you an idea of the level of resource contention.

So why are no other tests repeated as a multi-threaded version? Well quite simply I tried and all of the other libraries either did not support threading, or failed to prove thread-safe during testing. Raptor DB corrputed state and hung the application. Even the beloved Berkeley DB blew up on threaded deletes both in transactional and non-transactional modes. While MySql worked the benefits in performance were not worth noting here.

RaptorDB
Although this library did produce some good numbers on modifications I really have to call this one a non-starter.  Not only can you not close/unload it, but it basically does nothing more than append the data to a log and perform a manual lookup (at least until their indexing thread catches up).  This, IMHO, is just a nasty design flaw and shows up well in their blazing slow seek performance.  I managed to crash and/or corrupt the store several times, even on a single thread.  As for the claimed multi-threading capabilities of this library it’s a joke, this library is far from thread safe.  Frankly you would be better off appending the data to a single file with a Dictionary<K,V> to give you the file offsets.  That is basically what it seems to be doing behind the scene, even so far as to make you call SaveIndex() to persist the index. (Update: Review of RaptorDB version 1.7)

BPlusDotNet
Nothing spectacular here unless you need Java/C++ compatibility which is nice.  The select times on this seem to be sub-par and there is no support AFAIK for multi-threading.  All-in-all it behaved well in every single-threaded test.  Unlike RaptorDB it seems stable enough to use in the right environment.  It even performs much better than my own for deletes, I’ll have to take a look at that ;)

STSdb
This library really blew me away.  Although I could not find any documentation about multi-threading it is so wicked fast I’m not sure I care.  The only down side is that this library is GPL licensed unless your willing to pay for commercial use.  I don’t know what the cost of it is but if you need that kind of performance it may be worth investigating.  Definitely a two-thumbs-up library if you have plenty of ram and hard-drive space to work with ;)

BerkeleyDB
Again I was surprised to see this outpace a fully managed solution by a margin of 3:1.  Outside of it’s relatively slow select speed it was a consistent high-performer.  I will warn you that I was able to corrupt it’s state a few times, mostly in multi-threaded deletes which is why I did not post a threaded performance chart.  If you’ve never worked with it, the library is cumbersome to work with due to the infinite number of options.  I’ll also warn you to stay away from it’s transacted mode, with transactions enabled this performs very poorly especially for inserts (1350/sec).

MySQL
Just threw this in to give a comparison to a well-known RDBMS solution.

Resource Usage
Inserts per second

The chart above indicates kilobytes per 1000 records of 116 bytes. All considered everyone was close on resource usages except the clear winner and loser. I was using the GC for obtaining memory usage so Berkely does not have memory usage information; however, it’s disk cost was very good. The looser here is not really a looser based on it’s performance, but clearly we know where STSdb is getting all the performance from :)

Summary
Out of all the purely managed, free, and open source (non-GPL) I still think the BPlusTree stands out as a winner based on it’s low memory usage, fast seek times, and multi-threading support that actually works.  If I were willing to buy it though I’d have to go with STSdb and just buy some more memory and hard-drive space :)

 

Before I go much farther talking about the BPlusTree I want to address one thing. Does the BPlusTree support ACID operations? (definitions below from wikipedia)

Atomicity

Atomicity requires that database modifications must follow an “all or nothing” rule. Each transaction is said to be atomic. If one part of the transaction fails, the entire transaction fails and the database state is left unchanged.

All modifications to the B+ tree and the disk are transacted internally to ensure that if an exception occurs the tree and storage file are left in a valid state. So while this is mostly supported this one is a No see Durability below.

Consistency

Consistency states that only consistent (valid according to all the rules defined) data will be written to the database. Quite simply, whatever rows will be affected by the transaction will remain consistent with each and every rule that is applied to them.

The only consistency constraint imposed by the BPlusTree is that keys are unique, and that the serializer provided can serialize both the key and value. So as far as those simple rules are concerned the BPlusTree does support consistency.

Isolation

Isolation refers to the requirement that no transaction should be able to interfere with another transaction at all. In other words, it should not be possible that two transactions affect the same rows run concurrently, as the outcome would be unpredicted and the system thus unreliable.

The BPlusTree fully supports isolation when using the default configuration.

Durability

Durability means that once a transaction has been committed, it will remain so. In other words, every committed transaction is protected against power loss/crash/errors and cannot be lost by the system and can thus be guaranteed to be completed.

Each modification to the BPlusTree amounts to a single ‘node’ that is being modified. In the case of reorganization of the hierarchy (split/join) copies are made for the new child nodes but the parent is ‘modified’. The modification of a node essentially rewrites the file block associated with it in a single write call (followed by a flush). During this write (usually a 4kb block) if the power is interrupted or another God ordained event happens, like a BSOD, it is possible to get only part of the 4kb block written. Since the BPlusTree does not currently use a transaction file, this can cause corruption in the store. This is detected by the BPlusTree via a CRC32 checksum written into every file block so you will receive a specific exception. In short, the BPlusTree is not a Durable store by this definition.

 

Cont’d from Building a database in C# – Part 2

Caching and persistence seem to go hand-in-hand for this adventure. With the cost of serialization and IO it’s just not possible to get anything to perform well without a little caching. The questions I had around caching is really more along the lines of “how” and “what” not about “if”. Let’s be clear, I generally hate using cache. It often produces ugly and problematic code that is hard to follow and even harder to debug. There are cases where it makes sense to use a cache and I felt this was one of those.

First things first, answering “What do I cache?”

This is a fairly obvious answer given the B+Tree structure. I need to cache these ‘node’ things and their associated lock. The biggest question that came up was if I wanted to exclude the data-nodes from the cache and only cache the hierarchy nodes. Conceptually this would improve the cache hits as the tree structure will be used many times more than the leaf data nodes. However, as I experimented with this I came to two issues with the exclusion: 1) The cost to rehydrate data nodes can be more expensive than the hierarchy nodes and to exclude it may reduce overall performance. 2) The cache mechanic I have in place is also responsible for caching the lock instance for a given node.

Next answering the hard question “How do I cache?”

I decided to tackle this as a three-prong approach. I wanted to provide several cache mechanics, one that does not cache data, one that caches everything, and then strike a balance somewhere in the middle. For obvious reasons all of these are write-through caching. The first was fairly easy, I needed only a lookup to provide the lock instance used for the node and to access the node itself I continually pass-through to the storage interface. The ‘full/complete’ cache was also trivial and performs very well if you happen to be able to fit all your data in memory. The usefulness of this seems limited since though since the whole point here is that the data is too large to fit entirely in memory. After all if I could fit it all in memory I might prefer to just use a Dictionary and simply save the contents to disk. So this complete cache is cool and all but doesn’t serve us well, so where is the balance?

Caches in a database tend to follow these basic rules: 1) allocate a hoard of memory, 2) sub-allocate from it until it is depleted, 3) once depleted try and unload something. This is a reasonable approach for most C++ applications, especially dedicated database processes. However, I’m building an embedded database/storage engine not a dedicated database server. The idea of allocating 100mb or 1gb at startup seems absurd in the managed world. After all I have something those C++ guys don’t have, a Garbage Collector. The trick to leveraging it to house the cache is that I need a few things to happen: 1) I can’t hold a direct reference to everything so the GC can clean up, and 2) I need to hold a direct reference to enough stuff that the GC won’t constantly clear my entire cache. So the first is easy enough, I’m sure your all aware of the WeakReference class and it’s properties so I won’t go into detail there. For the second part of this, keeping references alive, I needed some code…

Enter the ObjectKeepAlive class. I needed a high-throughput class that will hold references for me, but what criteria to use to keep reference around? Well to keep things easy I went with a simple min/max range for the number of items and timespan that decides when to reduce the number from max to min. Thus I always keep at least min items alive and have at most max items that are no older than the timeout. This was fairly easy to put together as a forward-only linked list of arrays where the list stores a time-stamp of last modification. I actually started by modifying the last example lock-less queue I posted here. I’ll have the code updated soon and for those wanting to join me abusing the GC you might find the ObjectKeepAlive class useful ;)

I know these posts have not been very detailed and stay on vague side. Unfortunately it is difficult to deep dive some of these topics without brutally long backgrounds on both concepts and implementation details. With the implementation spanning 1.3k LOC across 20 files it’s more involved than anyone in their right mind would care to read about. As it is I hope the code will prove more useful than these topics, I promise it’s coming soon, it’s already at the self-imposed required 100% functional coverage. I’m only waiting on sorting out some issues with the migration to mercurial over on googlecode.com.

 
Building a database in C# - Part 2

Cont’d from Building a database in C# – Part 1 So with the B+ tree semantics out of the way it was time to start looking at what was missing. Obviously it was all in memory and not yet on disk, and once it is on disk I’ll certainly need to cache. Yet there was [...]

 

Ok, so for the last week or two I’ve been off on an adventure of sorts with persistent media data structures. At first I was like “Surely a good solution exists?” this isn’t a new problem. Unfortunately I was unable to locate any solutions that were open source (non-GPL) and written entirely in managed code. [...]