BPlusTree

 
Overview

BPlusTree is a implementation of the generic IDictionary<TKey, TValue> interface backed by a disk-based B+Tree. There is no practical size limit on the total amount of data that be stored or on the size of the key or value data used. Keys and values can be any class, struct, or built-in primitive written in .NET provided that you can supply an implementation of ISerializer<T> to read and write the object. There are also several available storage, concurrency, and caching strategies available to choose from.

Storage

All storage systems implement a single interface, INodeStorage. This interface is responsible for the creation, reading, writing, and deletion of a node in the tree. You can implement your own storage subsystem and provide it to the options class via the StorageSystem property. Most of the time you will want to use one of the two forms of storage BPlusTree supports by default each of which have numerous options.

File Storage – The first and most common of these storage types is the file based storage. To use the file store simply specify the FileName property and the CreateFile policy on the Options class. The file storage uses a fixed block sub-allocation scheme to allocate blocks within the specified file. This block size can be specified or left as the 4kb default. The other important setting here is the FileOpenOptions setting which allows you to specify a combination of the System.IO.FileOptions flags. Both WriteThrough and NoBuffering (FILE_FLAG_NO_BUFFERING = 0×20000000, see File Buffering) are supported. Using the WriteThrough and NoBuffering options provides you with the most durable storage at a significant performance cost.

Memory Storage – Most of the time this is fairly useless as the whole B+Tree goes since you would ordinarily just use a regular Dictionary<TKey, TValue>. It is possible the BPlusTree can outperform a Monitor locked Dictionary; however, real world use cases of the excessive contention required seem unlikely. If you simply must have an IDictionary that is completely thread safe you are welcome to give it a try, just set the StorageType property on the options class to Memory.

Concurrency

Concurrent reads and writes are completely safe. By default writers do not block readers so a reader is only guaranteed to read a modification if the read operation began after the completion of the write. In addition the use of localized node level locking for writers allows multiple concurrent writers to co-exist in different locations of the tree.

The BPlusTree uses an implementation of an IFactory<ILockStrategy> to construct locks for each node read from the store. The default ILockStrategy uses a Monitor TryEnter/Exit to control access only for writers. If you are using the BPlusTree from a single thread, or you are opening it in ReadOnly mode, then you can disable the locking by setting the LockingFactory property of the Options class to an instance of the IgnoreLockFactory class.

With a few exceptions all public members and enumerators are completely thread safe by default. Those exceptions are diagnostics methods like Print, and Validate, as well as the methods EnableCount, Clear, and UnloadCache. All of these methods may be made thread safe by specifying a valid reader/writer lock (i.e. SimpleReadWriteLocking) on the CallLevelLock property of the options class. In addition once the CallLevelLock is set you can obtain a Write lock on this to obtain exclusive access to the entire tree.

Caching

The cache is fairly baked into the implementation at present and is therefore not extensible; however, there are several options to control its behavior. Setting the CachePolicy of the Options class is the primary control allowing for None (nothing is cached, everything is read from disk), Recent (high utilization nodes are kept in memory), and All (everything is kept in cache).

The default is Recent and this cache option has a few options of its own. The CacheKeepAliveMinimum/MaximumHistory properties control the range of items allowed to be cached while the CacheKeepAliveTimeout defines how quickly the cache should be reduced back to the minimum value. It is important to note that these numbers do not reflect unique nodes, but rather include each visit to a node. By default the cache holds at least the most recent 10 nodes visited and at most the last 100 nodes visited. With mild activity this cache will reduced itself after 1 minute. Please note that the BPlusTree does not create or manage threads so without some level of activity the cache may remain at its maximum capacity.

Fault Tolerance

What can go wrong will go wrong. Every effort has been made to ensure that the BPlusTree continues to behave even when things don’t go as expected. All modifications to the tree and the file are transacted internally to ensure modifications to the data or tree remain atomic and either complete or fail as a single unit.

Because of the fact that threads may be killed by unmanaged code or other catastrophic thread event (stack overflow, stack corruption) there may be a case when .NET’s Monitor becomes confused about who owns it. If this happens every call to the BPlusTree may throw a TimeoutException when attempting to acquire a lock. This is the other reason apart from regaining memory for the UnloadCache method (see Concurrency and the CallLevelLock property).

Though I’ve never seen it happen under normal circumstances, if your file does become corrupt you can attempt to recover the data with the RecoverFile/RecoveryScan.

Multiple Instances

Multiple instances of a BPlusTree using the same file is not supported unless the ReadOnly property on the Options class is set to true. Though it is possible to build this into or on top of the BPlusTree I personally prefer to use a single instance and IPC/RPC to allow cross-domain or cross-process access.

Exception-less Errors

Although BPlusTree supports the IDictionary interface the public methods available directly on the class do not always mirror those semantics. You should note that the public Add method returns a boolean result. This is also true for the Update, AddOrUpdate, and Remove methods. If you like exceptions using the IDictionary method Add with an existing key will raise one for you.

Simple Example:

    BPlusTree<string, Guid>.Options options = new BPlusTree<string, Guid>.Options(
        PrimitiveSerializer.String, PrimitiveSerializer.Guid, StringComparer.Ordinal)
         {
             CreateFile = CreatePolicy.IfNeeded,
             FileName = @"C:\Temp\Storage.dat"
         };

    using(BPlusTree<string, Guid> map = new BPlusTree<string, Guid>(options))
    {
        Assert.IsTrue(map.Add("foo", Guid.NewGuid()), "Failed to add foo");
        Assert.IsTrue(map.ContainsKey("foo"), "Failed to find foo");
    }
Version 2.x Released

The information above applies to Version 1 of the BPlusTree, for more information about the changes in Version 2 see the following series of articles.

Online Resources
 

In my previous post “Why GUID primary keys are a database’s worst nightmare” I spoke about some of downsides to using GUID primary keys. In this post I’m going to focus more specifically on the .NET Framework’s implementation of System.Guid. I have a few complaints about this class, let’s cover them…

1. Guid.ToByteArray()

This, IMHO, is one of the worst possible things about a Guid. First this method returns a byte array in little-endian format. Not only is this directly opposite the stated standard GUID format, but it also means the bytes do not compare the same as the structured value. Thus when you combine System.Guid with a sequential guid generation you wind up with either a structure that doesn’t compare sequentially, or a byte array that doesn’t compare sequentially. Yes you can always swap bytes (1,4), (2,3), (5,6), and (7,8), but really? should I have to?

The second issue I have with this method is that it absolutely requires me to allocate the bytes on the heap. While this may not sound like such a big deal, if you are moving large volumes of data and serializing those Guids the only way you can (as byte[]) it will thrash the GC. I’m hopeful that in some future version of .NET we will find a Guid with a ToByteArray(GuidFormatType, byte[], offset) where GuidFormatType would allow little-endian or big-endian formats.

2. Guid.GetHashCode()

This one does puzzle me. They choose to completely ignore bytes 9, 10, 12, 13, 14, and 15 when computing a Guid’s hash code. Ok so maybe with random Guids this really doesn’t matter, but when you start trying to use Guid with sequential values this starts to impact hash performance. If, as in my example on the previous post, bytes 1-8 are date-time, then your entire hash value amounts to the entropy introduced in bytes 11 and 16. To be specific, byte 11 xors with the MSB and byte 16 xors with the LSB.

3. Guid.NewSequentialGuid()

Don’t you wish that .NET could just do this for us? We’ve know this to be a problem, and have know of a resolution to it, since 2002. They’ve even added “NEWSEQUENTIALGUID()” to Sql Server. Wouldn’t you think they could get this done for us? I guess until they fix the endianness of ToByteArray it’s really a moot point anyway.

Of course one option is to just PInvoke UuidCreateSequential.

static class Win32
{
    [DllImport("rpcrt4.dll", SetLastError=true)]
    public static extern int UuidCreateSequential(out Guid guid);
}

4. Guid.(Missing Properties)

Another annoyance is the lack of exposing the values of the Guid. Would it have really been that hard to expose the same integer, two shorts, and 8 bytes that we can pass to the constructor? One could then create an efficient binary serializer for the Guid type without a GC allocation and without worrying about byte-swapping little-endian values. In addition, the underlying randomization could then be reused without using the ToByteArray method. Maybe this isn’t a big deal for most people, but I’m still miffed that (AFAIK) there is not a single thread-safe means of creating a random value without allocating a byte[].

 

When you ask most people why using a GUID column for a primary key in a database might be slower than auto-incremented number the answer your likely to get is usually along these lines:

“Guids are 16 bytes, and integers are only 4 bytes.”

While this information is technically accurate, it is completely the wrong answer to the question. The performance delta from a 16 byte key to a 4 byte key is most likely negligible and almost certainly impossible to accurately measure. To understand what is wrong with a guid, you first need to know a little about how a database stores it’s data.

The most common storage used by databases (to the best of my knowledge) is a B+Tree. They offer great performance even for very large sets of data. They do however suffer from one problem, let’s call this ‘key density‘.

By key density I’m referring to the density, or close proximity, of the keys being accessed at any given moment as it compares the universe of all the keys in storage. For example let’s say I have 10 million records, each keyed by a unique integer numbered from 1 to 10,000,000. I can say that keys {44,46,52} have a high-density, whereas keys {100,10553,733555} have a very low-density.

Cache is everything. When a database needs to read or write a record, they traverse nodes in the tree and read them into memory from disk. Disk IO is the single biggest time expense a database has. So to reduce the number of reads, nodes visited while fetching or writing a record are usually cached. This allows more efficient retrieval of the record next time it is requested. As more low-density keys are accessed, more and more unique nodes are fetched from disk into memory and cache. Yet every cache has its limits, bound primarily by the hardware available and often by software configuration. Once the available cache space has been used, stale/old nodes are removed from cache to make room for newer ones.

So now let us imagine how this applies when using a auto-numbered field. To insert the next new row I’ll need to travel down the right-edge (highest key values) of the tree. Once at the leaf, insert and done (oversimplified). Since the next write uses the next possible key from the last one used, it is practically guaranteed to find the nodes it needs in cache. So a few quick memory look-ups later the record can be inserted without reading from disk. It should now be obvious why Guids will have problems…

Creating a new Guid is essentially done by taking a value ‘uniformly at random’ from the entire range of possible Guid values. That means that unlike our auto-number field with a high key-density, our Guid keys are designed to be sparse, or to have a low key-density. Because of this for two Guid values to be stored in the same tree node is likely going to be statistically more improbable than winning the lottery. Using a Guid as a primary key practically obliterates the ability for a database to reliably find nodes in the cache based upon previous queries. By the time the database has passed about 10 million rows your performance will fall drastically.

Want more proof? Read the Guid article on Wikipedia under the section “Sequential algorithms” it discusses this very topic. It also discusses solutions to this, as first introduced by Jimmy Nilsson in 2002, called a COMB Guid or Combined Guid for the combination of a timestamp.

So are Guid’s Evil in a Database? No. It’s the algorithm that generates the Guid that is the issue, not the Guid itself. As mentioned in the articles linked above, there are other algorithms. There are numerous possibilities documented online. I’ve implemented one of these that out-performs the native Guid.NewGuid() implementation that will be released in the next few weeks. Until then, this is fairly similar to the algorithm I’m using…


static class SequentialGuidGenerator
{
    static int _sequenced = (int)DateTime.UtcNow.Ticks;

    private static readonly System.Security.Cryptography.RNGCryptoServiceProvider _random =
        new System.Security.Cryptography.RNGCryptoServiceProvider();
    private static readonly byte[] _buffer = new byte[6];

    public static Guid NewGuid()
    {
        long ticks = DateTime.UtcNow.Ticks;
        int sequenceNum = System.Threading.Interlocked.Increment(ref _sequenced);
        lock (_buffer)
        {
            _random.GetBytes(_buffer);
            return new Guid(
                (int) (ticks >> 32), (short) (ticks >> 16), (short) ticks,
                (byte) (sequenceNum >> 8), (byte) sequenceNum,
                _buffer[0], _buffer[1], _buffer[2], _buffer[3], _buffer[4], _buffer[5]
                );
        }
    }
}

The problem with this code is that not all databases compare Guids the same way, or in the same format. So you should be cautious about using this approach until you understand your database and how it stores and compares a Guid type.

Lessons Learned and Applied Concept

The interesting and rather undocumented aspect about this issue is that it applies just as well across all types of composite keys. Let’s take an example, we have a simple logging structure we are writing to with NLog in a database. The rows are identified by a Guid, but we almost never query these records by Id. Most of the time when we query this data we are looking within a range of dates for a specific event. How would you model the primary key and indexes? Well most people want to use as small of a primary key as possible and so the natural assumption is to use the ID of the record. This, as we’ve already covered, is generally a bad idea just because we are using Guids, but even more because our queries will be time based. By promoting the timestamp into the primary key we not only gain better query performance, but we also remove the problem of the GUID identifier. Let’s see the example:

TABLE LogFile {
    Column DateTime timestamp,
    Column GUID id,
    Column int eventType,
    Column string message,
    Primary Key ( timestamp, id )
}

With the primary key using the timestamp as it’s first comparison we will always be writing to the same area within the table and will consistently hit the cache for writes. When seeking for data the timestamp will group all the records needed together so that the data we are after is stored as dense as is possible requiring the fewest possible reads.

Now let’s drift this example a little to a new scenario. Let’s say the log above is being written from an ASP.NET application in which all users are authenticated. We want to add this user identity to the LogFile data being written, and we want to constrain queries to be associated with a user. In the example above it may be safe to simply modify the Primary Key ( timestamp, id ), to include the user as first key. Thus the Primary Key ( userId, timestamp, id ) will now perform well for a single user right? Well the answer Yes and No. It really depends greatly on the application. Introducing userId as the primary key means that we could be writing to as many places in the file as we have users. If the application is a mobile app polling our web server every minute, then we’ve just scattered our writes across all N thousands of users in our system. Yet if our application requires a human being to use a web browser or mobile app, the number of active writing points in the file drops considerably… Well at least until your Facebook and at that point you cash out and go home :)

One last example. Given you are building SkyDrive, or GDrive, or whatever you want to store the following entities: Customer, Drive, Folder, File. Each entity is identified by a GUID. What does the File’s primary key look like? Well I’d probably key the File by (CustomerID, DriveID, FolderID, Timestamp, FileID). You would obviously need an ancillary index by FileID in order to access the file directly. Food for thought anyway…

The reason I bring all this up is that there is no rule of thumb for a good primary key. Each application will have different needs and different requirements. Want you should take away is that the density of keys being written to in that first field of your primary key will ultimately dictate your write throughput and likely your scalability. Be conscious of this, and choose keys wisely. The Guid ID of a record is not always the best answer for a primary key.

As a side-note, using Guid’s as a primary key in the B+Tree will work but much more slowly at large volumes (2+ million). Using a sequential guid generator like the one above, or using an ordered natural key (like a qualified file name) will serve you much better. Ordered (or near-ordered) keys will provide linear scalability whereas unique random GUIDs will start to suffer once you’ve exceeded the cache space.

 

Almost the first thing you need to create a BPlusTree is going to be the key/value serializers which are implementations of the ISerializer<T> interface. They are also going to be one of the things that can make or break your first impressions of using the BPlusTree. A descent performing implementation is going to be important in order to get the most out of your BPlusTree. Aside from performance alone, there is one other area of concern, that of the semantic interface.

Implementing the ISerializer<T> Interface

First let’s take a quick look at the interface definition for those that are not familiar with it.

public interface ISerializer<T>
{
    void WriteTo(T value, Stream stream);
    T ReadFrom(Stream stream);
}

Simple enough, right? Well, not exactly. It seems that one thing is not obvious from this interface. The ‘stream’ parameter is not constrained in any way. One could seek to an offset they should not be at, or read past the end of the item they are suppose to be reading. Exceptions are also a bad idea here since it can prevent you from reading the tree. So the following guidelines should be used when implementing this interface for use with the BPlusTree:

  1. Always use an immutable class, a value type, or ensure that references to the class inserted/updated/fetched are not modified.
  2. When writing values you should try to avoid any case for throwing an exception.
  3. Always use a length-prefix or termination value to prevent reading more data than was written for this value.
  4. When reading you should only throw exceptions when you believe the data has been corrupted.
  5. Never read more data than was written for this record as this will cause subsequent values to be corrupted.
  6. Do not seek or use positional information from the stream, assume it is either append-only writing, or forward-only reading.

Primitive Serialization

For most primitive types you should not need to write one, instead use the The PrimitiveSerializer class (declared in CSharpTest.Net.Serialization). The PrimitiveSerializer class has static properties for many of the serializers you will need for primitives. These are all written using little-endian byte ordering for numeric values.

If you need to write a compact format there is another class to help you with numeric encoding. The VariantNumberSerializer class exposes serialization for protocol buffer-like variant encoding of integer values. Basically this can be used to store positive integer values in more compact format than using the PrimitiveSerializer.

Neither of these provide an array serializer for anything other than a byte[]. This can often be a desired type so here is a generic implementation that will serialize arrays given a serializer for the item type.

public class ArraySerializer<T> : ISerializer<T[]>
{
    private readonly ISerializer<T> _itemSerializer;
    public ArraySerializer(ISerializer<T> itemSerializer)
    {
        _itemSerializer = itemSerializer;
    }

    public T[] ReadFrom(Stream stream)
    {
        int size = PrimitiveSerializer.Int32.ReadFrom(stream);
        if (size < 0)
            return null;
        T[] value = new T[size];
        for (int i = 0; i < size; i++)
            value[i] = _itemSerializer.ReadFrom(stream);
        return value;
    }

    public void WriteTo(T[] value, Stream stream)
    {
        if (value == null)
        {
            PrimitiveSerializer.Int32.WriteTo(-1, stream);
            return;
        }
        PrimitiveSerializer.Int32.WriteTo(value.Length, stream);
        foreach (var i in value)
            _itemSerializer.WriteTo(i, stream);
    }
}

Using Protocol Buffers with the BPlusTree

Protocol buffers make an excellent way to serialize data for the BPlusTree. They are compact, fast, and can be extended/modified over time without having to rewrite the records. Due to this I highly recommend using protobuf-csharp-port to generate the classes you will use. In addition to the serialization benefits mentioned this library is also built upon immutable types with builders. This is very important for the accuracy of the data since the BPlusTree makes no guarantees about when the instance will be serialized and it may use the same instance multiple times when reading. The following serializer will provide you a generic implementation that can be used with the protobuf-csharp-port library (NuGet Google.ProtocolBuffers).

using CSharpTest.Net.Serialization;
using Google.ProtocolBuffers;
internal class ProtoSerializer<T, TBuilder> : ISerializer<T>
    where T : IMessageLite<T, TBuilder>
    where TBuilder : IBuilderLite<T, TBuilder>, new()
{
    private delegate TBuilder FactoryMethod();
    private readonly FactoryMethod _factory;

    public ProtoSerializer()
    {
        // Pre-create a delegate for instance creation, new TBuilder() is expensive...
        _factory = new TBuilder().DefaultInstanceForType.CreateBuilderForType;
    }

    public T ReadFrom(Stream stream)
    {
        // Use Build or BuildPartial, the later does not throw if required fields are missing.
        return _factory().MergeDelimitedFrom(stream).BuildPartial();
    }

    public void WriteTo(T value, Stream stream)
    {
        value.WriteDelimitedTo(stream);
    }
}

What about using protobuf-net? This is also very easy to implement and would be the preferred way of code-first serialization (no proto description, just a .NET class/struct). As we did with the protobuf-csharp-port you must length-prefix the messages. Avoid using the IFormatter implementation in protobuf-net since this has a semantic incompatibility with BinaryFormatter and other implementations (it is not length prefixed and reads to end-of-stream). Here is a generic class for serialization using the protobuf-net’s static Serializer class:

public class ProtoNetSerializer<T> : ISerializer<T>
{
    public T ReadFrom(Stream stream)
    {
        return ProtoBuf.Serializer.DeserializeWithLengthPrefix<T>(stream, ProtoBuf.PrefixStyle.Base128);
    }
    public void WriteTo(T value, Stream stream)
    {
        ProtoBuf.Serializer.SerializeWithLengthPrefix<T>(stream, value, ProtoBuf.PrefixStyle.Base128);
    }
}
 

Continued from: BPlusTree 2.0 Changes Part 4 – Rewriting the Storage Format

Getting Started

To utilize the new file format in the BPlusTree you will need to use OptionsV2 class. The previous Options class is still there and continues to use the same file format.

// Create the OptionsV2 class to specify construction arguments:
var options = new BPlusTree<string, Guid>.OptionsV2(
    PrimitiveSerializer.String, PrimitiveSerializer.Guid, StringComparer.Ordinal)
    {
        CreateFile = CreatePolicy.IfNeeded,
        FileName = @"C:\Temp\Storage.dat",
        StoragePerformance = StoragePerformance.Default,
        TransactionLogLimit = 100 * 1024 * 1024, // 100 MB
    }
    .CalculateBTreeOrder(10, 16);

// Create the BPlusTree using these options
using(var map = new BPlusTree<string, Guid>(options))
{
    map.Add("foo", Guid.NewGuid());
    Assert.IsTrue(map.ContainsKey("foo"), "Failed to find foo");
}

Choosing a StoragePerformance Option

I’ve attempted to simplify the multitude of options available into this simple enumeration. The default (LogFileInCache) should be acceptable to most uses, but read the following carefully to understand what you are getting. The StoragePerformance has the following possible values and their implications:

StoragePerformance.Fastest
This option is safe from corruption and provides the fastest possible throughput. This is essentially the same behavior as LogFileInCache without the ability to recover any data written since the last successful call to Commit() or Dispose(). For some operations, for example, a temporary index of data that will not fit entirely in memory, this is going to be the best option. The performance again can vary but you should be able to see 100~200k writes/second.

StoragePerformance.LogFileInCache (Default)
This is the Default option and fastest durable option. The transaction log is written to system cache and in the event of a power failure or OS crash some records may be lost. This option should not corrupt state but may take some time to recover from a crash. The performance, even on slow disks, is quite good. You should expect anywhere from 30~60k writes/second or more depending the size of the data.

StoragePerformance.CommitToCache
This is the least durable option and the only one that may result in data file corruption. Corruption may occur in the event of power failure or OS crash with a partially flushed system cache. Provided the OS cache is flushed entirely or not at all, the file will not be corrupted. The performance of this option will usually reach around 6~8k writes/second. This can be a viable option if you do not want a transaction log and need to keep every write on disk in the case of a process crash.

StoragePerformance.LogFileNoCache
This option is intended to give you an ACID BPlusTree at the least cost in performance. The downside is that should the process crash or the power fail the file will need to be recovered.
See the discussion below on using transaction log files. Generally this option will perform about 2~4k writes/second.

StoragePerformance.CommitToDisk
Using this options should give you a fully ACID compliant BPlusTree. To be completely ACID (as with any database) means you need to disable cache and write behind features on the hardware as well. Performance will suffer, like with any real database. Depending on the speed of the disk and size of the data you can expect 1~2k writes/second or less.

Using Transaction Logs

Recovering the file from log can be expensive depending on the volume of data.
This option will open a transaction log with FileOptions.WriteThrough to cause every write to go through system cache to disk. The data file will remain in it’s current state until a call to Commit() or Dispose(). Calling Rollback() will, if using transaction logs, revert the storage file to it’s last committed state. Calling Commit/Rollback are exclusive operations on the entire tree, so all readers and writers must exit prior to processing.

For long-running processes (i.e. services) you may want to provide a thread to perform periodic calls to Commit() when idle. If you do not have a way to detect idle conditions, or don’t want to bother, set the OptionsV2.TransactionLogLimit to the max allowable log size. Keep in mind that calls to Commit require a OS-level flush of all file buffers, and as such, are expensive. My recommendation is to use a large constant value, 100MB or so.

Transaction Log Recovery

Transaction logs will by default be auto-recovered the next time the BPlusTree is opened after a crash. This comes at a cost of performing the operation as an ‘exclusive’ task which is generally what you are going to want. There are times when controlling this recovery can be useful. To this end there is a new option OptionsV2.ExistingLogAction which can be used to control the activity performed when opening a BPlusTree that has a existing log data. For instance you can choose to use the “Truncate” option to discard all changes in the transaction log.

Upgrading Existing Data

Upgrading from one BPlusTree to another is easily accomplished via the BulkInsert operation. The following example leverages the fact that the existing tree will enumerate in key order:

using Btree = CSharpTest.Net.Collections.BPlusTree<int, string>;
static void UpgradeDatabase(Btree.Options source, Btree.OptionsV2 target)
{
    source = source.Clone();
    source.ReadOnly = true;
    using(Btree input = new Btree(source))
    using (Btree output = new Btree(target))
    {
        output.BulkInsert(input,
                          new BulkInsertOptions
                              {
                                  ReplaceContents = true
                                  InputIsSorted = true,
                                  CommitOnCompletion = true,
                                  DuplicateHandling = DuplicateHandling.RaisesException,
                              }
        );
    }
}
 

Continued from: BPlusTree 2.0 Changes Part 3 – AddRange and BulkInsert

So all the previous updates took place back in April of this year. As I started to prepare to release a new drop of the BPlusTree I started having thoughts about an alternative storage format. So the update got shelved and I started working on revisiting the file format.

You might be asking yourself “Why rewrite it? It’s stable and reasonably fast, what’s the problem?”. The problem in a single word is “durability”. My initial use-case for this library was primarily that of a ‘cache’, I can throw stuff in it and get it back and if it dies I’ll rebuild it. Although I’ve never experienced a loss of data or corruption I know the v1 file format is definitely capable of corrupting it’s file. So I began reading about durability in file stores… a lot.

File Storage Durability
Finally I came to realize that there are a few key concepts used to provide durability in a file store. The first possibility is the “don’t change data” approach. This is where you write data to a file and it will never be modified. MongoDB uses this concept always appending to files. This is also the chosen approach by STSDB and RaptorDB. It has a drawback, either you never reclaim space from deleted items, or you have to write a ‘compact’ process that rewrites the database. This is often employed with another concept, storing data twice. By writing a single piece of data once, flushing the file, and writing it again you can guarantee that one of two locations being written to will be readable. The next common practice for durable stores is to use a transaction log to record the changes being made and replay them on failure. The benefit here is that the transaction log data can be thrown away after the file is committed or closed.

Warning, the following ramblings may cause drowsiness.

Initially I started down the path of just introducing a transaction log. This proved to be inadequate since to replay the transaction log I need to be able to open the file, and if the file format allows corruption (like v1 did) you find yourself dead in the water. So I needed a format that allows me to make sure I get a valid file when I open it, then I can replay the log and all will be well. So began the search for a new format.

After toying with a couple of possibilities I chose a format that uses numeric ‘handles’ in the file to address one or more contiguous blocks of available space. This is close to the same concept as the original except it did not have a contiguous space requirement. To map these ‘handles’ to a file offset means one of two things, either we introduce a new file for this map, or we have to store the map in the file somehow. More importantly we need two copies of this map in order to be certain we do not corrupt state.

So I figured if a file block size is 4096, that means I can store 1024 int numbers in it. If I assume the entire file is broken into 4096 byte blocks then a signed int can address over 8 terabytes. That certainly seemed to fit the bill, so the question was where do I store the second copy, and how do I chain these together in the file? I chose to introduce a concept of a “segment” in the file, the segment is the block size divided by 4 and then multiplied by the block size. This means that in one block I have a 4-byte slot for a handle for every block in a segment. By reserving the first and last blocks in the segment I have two places to store the handle map. Since the first and last block are reserved this also allows me to place a 32-bit CRC of the handle map both at the beginning and end of the map block.

So essentially I have 1024 handles in the first segment, 0 and 1023 are reserved. Each handle can then be read to find the block number it’s data is stored in. To extend the file I just add a new segment with the exact same format. To commit the file in a reliable way, each segment writes it’s first map of handles and we tell windows to flush it to disk. Then we write all the second map of handles and again flush. Sounds like a lot of work I know, and it was.

This format gives the BPlusTree the ability to open a file, write data to it but keep all the handle map blocks in cache. By using this approach we keep the file in it’s current state even though we are adding data to it. This allows the file to be reopened and read without issue as it’s state has not changed. During the commit phase if someone tries to read it (or the power goes out) they may read a partially modified map block. This is easily detected by the preceding and following CRC values and all that is needed to recover is to read secondary map block for that segment. To ensure a consistent state, we either load all first maps, or all second maps, failing that we use the first or second. The last case is only needed for multi-threaded ‘commit-on-write’ where each block modified will also flush both it’s map tables for the segment containing the handle.

I’m not sure if anything I just said makes sense to anyone else, but if they get into the code hopefully it will be easier to understand.

Beyond this there were a few other hurtles to overcome. The first is when a handle has more than block size written to it. One possibility was to chain handles together; however, that proved to be inefficient. I wanted to keep the read/write in ‘contiguous space’ within the file. To accomplish this I had to reserve the high 4 bits of a handle’s int address (or block offset). This four bits is 0-14, the number of contiguous blocks. Only when 15 or more blocks are written together do I then need to actually read the offset the handle points to in order to determine it’s full length. This format then allows up to ((blocksize / 4)-2) contiguous blocks to be allocated. So for a block size of 4096 we have 1022 * 4096 maximum contiguous bytes. At around 4mb for all the values in a node, this should suffice for most users. If you are writing large values (over 1mb) to the BPlusTree you could hit this limit. You will either need to increase the block size (8kb=16mb limit, 16kb = 64mb limit), or you can always use the old file format ;)

The other side effect of reserving these 4 bits is the reduced address space. Instead of being able to address 2.1 billion blocks of 4kb, I can now only address 268 million blocks. So for a 4kb block size the maximum data file is 1 terabyte. I think for the BPlusTree, 1 terabyte of data in a single file seems like a reasonable limit.

I considered converting the store to use an 8-byte long instead of just an int. This would mean that I could reserve more bits for the number of concurrent blocks, as well as increase the number of addressable blocks. The downside to this approach is that fewer 8-byte values fit in a 4kb block and thus the number of blocks in a segment is halved. When the segment size is reduced so in the max number of contiguous blocks. This means that for a 4kb block the largest node would need to be no larger than 2mb. This could be mitigated by reserving 2 blocks on each side of the segment instead of 1. In the end I decided the limits fit all my needs and felt they where reasonable to impose on the BPlusTree.

Next we are going cover example code and summarize the performance and durability of the new storage system…

Continue reading on BPlusTree 2.0 Changes Part 5 – Using the v2 Storage Format.

 
BPlusTree 2.0 Changes Part 3 - AddRange and BulkInsert

Continued from: Part 2 – Enumerate Ranges and First/Last This is really the major piece of work in this release. There are a number of factors as to why these methods were added, but the foremost among them is performance when merging two sets of data. Before we go further let’s look at a typical [...]

 

Continued from: Part 1 – Interface Changes Enumerating Ranges I’ve been contacted several times about adding the ability to enumerate ranges from the b+tree. It’s actually a very trivial thing to implement and I’ve handed out the code several times. With this release the following range enumeration methods will be available: EnumerateFrom(TKey start) – Enumerate [...]

 

After much deliberation on the subject, I decided to introduce breaking changes interface. I know, I know, I don’t do this lightly, in fact I think it’s the third worse thing I could possible do (sorry it does follow adding bugs and semantic changes). So here are my reasons why I thought this was necessary: [...]

 

I’ve recently seen some comments about various aspects of btree/b+tree structures that are troublesome. They are, IMHO, just plain wrong. Let’s look at a few of these and see if we can shed some light on b+trees and their implementations. 1. Items in a tree node must be presorted. FALSE This is simply not true, [...]

 

A few weeks ago I published NuGet packages for version 1.11.924.348. CSharpTest.Net.BPlusTree CSharpTest.Net.Library CSharpTest.Net.Logging CSharpTest.Net.RpcLibrary   I also managed to get protobuf-csharp-port on board. I absolutely live by protobuffers. Google.ProtocolBuffers Google.ProtocolBuffersLite   Finally I recently published a complete protobuf services rpc layer build on the RpcLibrary and protobuf-csharp-port. Of course much of the code there [...]