So, just finally made time to port the last project out of my old Google code repo. So here are all the links for the new packages and repositories:

BPlusTree (and supporting parts of Library) -> CSharpTest.Net.Collections
https://github.com/csharptest/CSharpTest.Net.Collections
https://www.nuget.org/packages/CSharpTest.Net.Collections

Library\Commands -> CSharpTest.Net.Commands
https://github.com/csharptest/CSharpTest.Net.Commands
https://www.nuget.org/packages/CSharpTest.Net.Commands

RpcLibrary -> CSharpTest.Net.RpcLibrary
https://github.com/csharptest/CSharpTest.Net.RpcLibrary
https://www.nuget.org/packages/CSharpTest.Net.RpcLibrary

Library\Tools\* -> CSharpTest.Net.Tools
https://github.com/csharptest/CSharpTest.Net.Tools
https://www.nuget.org/packages/CSharpTest.Net.Tools

The former package for CSharpTest.Net.BPlusTree has been deprecated. This package depended upon the CSharpTest.Net.Library and I wanted to make the package stand-alone.  To avoid causing unneeded headaches I’ve chosen to rename the package to eliminate the confusion with the previous version(s).  The new Collections package contains all the BPlusTree code, plus the code within the Library that it depended on.  Hopefully this stand-alone version will make it much more usable.

For all the packages, my goal was to not change code, add features, or fix bugs during the move.  This is 99% the case, a few changes were made to reduce dependencies in some cases, but the overall code remains basically the same.

 

In my previous post “Why GUID primary keys are a database’s worst nightmare” I spoke about some of downsides to using GUID primary keys. In this post I’m going to focus more specifically on the .NET Framework’s implementation of System.Guid. I have a few complaints about this class, let’s cover them…

1. Guid.ToByteArray()

This, IMHO, is one of the worst possible things about a Guid. First this method returns a byte array in little-endian format. Not only is this directly opposite the stated standard GUID format, but it also means the bytes do not compare the same as the structured value. Thus when you combine System.Guid with a sequential guid generation you wind up with either a structure that doesn’t compare sequentially, or a byte array that doesn’t compare sequentially. Yes you can always swap bytes (1,4), (2,3), (5,6), and (7,8), but really? should I have to?

The second issue I have with this method is that it absolutely requires me to allocate the bytes on the heap. While this may not sound like such a big deal, if you are moving large volumes of data and serializing those Guids the only way you can (as byte[]) it will thrash the GC. I’m hopeful that in some future version of .NET we will find a Guid with a ToByteArray(GuidFormatType, byte[], offset) where GuidFormatType would allow little-endian or big-endian formats.

2. Guid.GetHashCode()

This one does puzzle me. They choose to completely ignore bytes 9, 10, 12, 13, 14, and 15 when computing a Guid’s hash code. Ok so maybe with random Guids this really doesn’t matter, but when you start trying to use Guid with sequential values this starts to impact hash performance. If, as in my example on the previous post, bytes 1-8 are date-time, then your entire hash value amounts to the entropy introduced in bytes 11 and 16. To be specific, byte 11 xors with the MSB and byte 16 xors with the LSB.

3. Guid.NewSequentialGuid()

Don’t you wish that .NET could just do this for us? We’ve know this to be a problem, and have know of a resolution to it, since 2002. They’ve even added “NEWSEQUENTIALGUID()” to Sql Server. Wouldn’t you think they could get this done for us? I guess until they fix the endianness of ToByteArray it’s really a moot point anyway.

Of course one option is to just PInvoke UuidCreateSequential.

static class Win32
{
    [DllImport("rpcrt4.dll", SetLastError=true)]
    public static extern int UuidCreateSequential(out Guid guid);
}

4. Guid.(Missing Properties)

Another annoyance is the lack of exposing the values of the Guid. Would it have really been that hard to expose the same integer, two shorts, and 8 bytes that we can pass to the constructor? One could then create an efficient binary serializer for the Guid type without a GC allocation and without worrying about byte-swapping little-endian values. In addition, the underlying randomization could then be reused without using the ToByteArray method. Maybe this isn’t a big deal for most people, but I’m still miffed that (AFAIK) there is not a single thread-safe means of creating a random value without allocating a byte[].

 

When you ask most people why using a GUID column for a primary key in a database might be slower than auto-incremented number the answer your likely to get is usually along these lines:

“Guids are 16 bytes, and integers are only 4 bytes.”

While this information is technically accurate, it is completely the wrong answer to the question. The performance delta from a 16 byte key to a 4 byte key is most likely negligible and almost certainly impossible to accurately measure. To understand what is wrong with a guid, you first need to know a little about how a database stores it’s data.

The most common storage used by databases (to the best of my knowledge) is a B+Tree. They offer great performance even for very large sets of data. They do however suffer from one problem, let’s call this ‘key density‘.

By key density I’m referring to the density, or close proximity, of the keys being accessed at any given moment as it compares the universe of all the keys in storage. For example let’s say I have 10 million records, each keyed by a unique integer numbered from 1 to 10,000,000. I can say that keys {44,46,52} have a high-density, whereas keys {100,10553,733555} have a very low-density.

Cache is everything. When a database needs to read or write a record, they traverse nodes in the tree and read them into memory from disk. Disk IO is the single biggest time expense a database has. So to reduce the number of reads, nodes visited while fetching or writing a record are usually cached. This allows more efficient retrieval of the record next time it is requested. As more low-density keys are accessed, more and more unique nodes are fetched from disk into memory and cache. Yet every cache has its limits, bound primarily by the hardware available and often by software configuration. Once the available cache space has been used, stale/old nodes are removed from cache to make room for newer ones.

So now let us imagine how this applies when using a auto-numbered field. To insert the next new row I’ll need to travel down the right-edge (highest key values) of the tree. Once at the leaf, insert and done (oversimplified). Since the next write uses the next possible key from the last one used, it is practically guaranteed to find the nodes it needs in cache. So a few quick memory look-ups later the record can be inserted without reading from disk. It should now be obvious why Guids will have problems…

Creating a new Guid is essentially done by taking a value ‘uniformly at random’ from the entire range of possible Guid values. That means that unlike our auto-number field with a high key-density, our Guid keys are designed to be sparse, or to have a low key-density. Because of this for two Guid values to be stored in the same tree node is likely going to be statistically more improbable than winning the lottery. Using a Guid as a primary key practically obliterates the ability for a database to reliably find nodes in the cache based upon previous queries. By the time the database has passed about 10 million rows your performance will fall drastically.

Want more proof? Read the Guid article on Wikipedia under the section “Sequential algorithms” it discusses this very topic. It also discusses solutions to this, as first introduced by Jimmy Nilsson in 2002, called a COMB Guid or Combined Guid for the combination of a timestamp.

So are Guid’s Evil in a Database? No. It’s the algorithm that generates the Guid that is the issue, not the Guid itself. As mentioned in the articles linked above, there are other algorithms. There are numerous possibilities documented online. I’ve implemented one of these that out-performs the native Guid.NewGuid() implementation that will be released in the next few weeks. Until then, this is fairly similar to the algorithm I’m using…


static class SequentialGuidGenerator
{
    static int _sequenced = (int)DateTime.UtcNow.Ticks;

    private static readonly System.Security.Cryptography.RNGCryptoServiceProvider _random =
        new System.Security.Cryptography.RNGCryptoServiceProvider();
    private static readonly byte[] _buffer = new byte[6];

    public static Guid NewGuid()
    {
        long ticks = DateTime.UtcNow.Ticks;
        int sequenceNum = System.Threading.Interlocked.Increment(ref _sequenced);
        lock (_buffer)
        {
            _random.GetBytes(_buffer);
            return new Guid(
                (int) (ticks >> 32), (short) (ticks >> 16), (short) ticks,
                (byte) (sequenceNum >> 8), (byte) sequenceNum,
                _buffer[0], _buffer[1], _buffer[2], _buffer[3], _buffer[4], _buffer[5]
                );
        }
    }
}

The problem with this code is that not all databases compare Guids the same way, or in the same format. So you should be cautious about using this approach until you understand your database and how it stores and compares a Guid type.

Lessons Learned and Applied Concept

The interesting and rather undocumented aspect about this issue is that it applies just as well across all types of composite keys. Let’s take an example, we have a simple logging structure we are writing to with NLog in a database. The rows are identified by a Guid, but we almost never query these records by Id. Most of the time when we query this data we are looking within a range of dates for a specific event. How would you model the primary key and indexes? Well most people want to use as small of a primary key as possible and so the natural assumption is to use the ID of the record. This, as we’ve already covered, is generally a bad idea just because we are using Guids, but even more because our queries will be time based. By promoting the timestamp into the primary key we not only gain better query performance, but we also remove the problem of the GUID identifier. Let’s see the example:

TABLE LogFile {
    Column DateTime timestamp,
    Column GUID id,
    Column int eventType,
    Column string message,
    Primary Key ( timestamp, id )
}

With the primary key using the timestamp as it’s first comparison we will always be writing to the same area within the table and will consistently hit the cache for writes. When seeking for data the timestamp will group all the records needed together so that the data we are after is stored as dense as is possible requiring the fewest possible reads.

Now let’s drift this example a little to a new scenario. Let’s say the log above is being written from an ASP.NET application in which all users are authenticated. We want to add this user identity to the LogFile data being written, and we want to constrain queries to be associated with a user. In the example above it may be safe to simply modify the Primary Key ( timestamp, id ), to include the user as first key. Thus the Primary Key ( userId, timestamp, id ) will now perform well for a single user right? Well the answer Yes and No. It really depends greatly on the application. Introducing userId as the primary key means that we could be writing to as many places in the file as we have users. If the application is a mobile app polling our web server every minute, then we’ve just scattered our writes across all N thousands of users in our system. Yet if our application requires a human being to use a web browser or mobile app, the number of active writing points in the file drops considerably… Well at least until your Facebook and at that point you cash out and go home :)

One last example. Given you are building SkyDrive, or GDrive, or whatever you want to store the following entities: Customer, Drive, Folder, File. Each entity is identified by a GUID. What does the File’s primary key look like? Well I’d probably key the File by (CustomerID, DriveID, FolderID, Timestamp, FileID). You would obviously need an ancillary index by FileID in order to access the file directly. Food for thought anyway…

The reason I bring all this up is that there is no rule of thumb for a good primary key. Each application will have different needs and different requirements. Want you should take away is that the density of keys being written to in that first field of your primary key will ultimately dictate your write throughput and likely your scalability. Be conscious of this, and choose keys wisely. The Guid ID of a record is not always the best answer for a primary key.

As a side-note, using Guid’s as a primary key in the B+Tree will work but much more slowly at large volumes (2+ million). Using a sequential guid generator like the one above, or using an ordered natural key (like a qualified file name) will serve you much better. Ordered (or near-ordered) keys will provide linear scalability whereas unique random GUIDs will start to suffer once you’ve exceeded the cache space.

 

Continue reading »

 

Continued from: BPlusTree 2.0 Changes Part 4 – Rewriting the Storage Format

Getting Started

To utilize the new file format in the BPlusTree you will need to use OptionsV2 class. The previous Options class is still there and continues to use the same file format.

// Create the OptionsV2 class to specify construction arguments:
var options = new BPlusTree<string, Guid>.OptionsV2(
    PrimitiveSerializer.String, PrimitiveSerializer.Guid, StringComparer.Ordinal)
    {
        CreateFile = CreatePolicy.IfNeeded,
        FileName = @"C:\Temp\Storage.dat",
        StoragePerformance = StoragePerformance.Default,
        TransactionLogLimit = 100 * 1024 * 1024, // 100 MB
    }
    .CalculateBTreeOrder(10, 16);

// Create the BPlusTree using these options
using(var map = new BPlusTree<string, Guid>(options))
{
    map.Add("foo", Guid.NewGuid());
    Assert.IsTrue(map.ContainsKey("foo"), "Failed to find foo");
}

Choosing a StoragePerformance Option

I’ve attempted to simplify the multitude of options available into this simple enumeration. The default (LogFileInCache) should be acceptable to most uses, but read the following carefully to understand what you are getting. The StoragePerformance has the following possible values and their implications:

StoragePerformance.Fastest
This option is safe from corruption and provides the fastest possible throughput. This is essentially the same behavior as LogFileInCache without the ability to recover any data written since the last successful call to Commit() or Dispose(). For some operations, for example, a temporary index of data that will not fit entirely in memory, this is going to be the best option. The performance again can vary but you should be able to see 100~200k writes/second.

StoragePerformance.LogFileInCache (Default)
This is the Default option and fastest durable option. The transaction log is written to system cache and in the event of a power failure or OS crash some records may be lost. This option should not corrupt state but may take some time to recover from a crash. The performance, even on slow disks, is quite good. You should expect anywhere from 30~60k writes/second or more depending the size of the data.

StoragePerformance.CommitToCache
This is the least durable option and the only one that may result in data file corruption. Corruption may occur in the event of power failure or OS crash with a partially flushed system cache. Provided the OS cache is flushed entirely or not at all, the file will not be corrupted. The performance of this option will usually reach around 6~8k writes/second. This can be a viable option if you do not want a transaction log and need to keep every write on disk in the case of a process crash.

StoragePerformance.LogFileNoCache
This option is intended to give you an ACID BPlusTree at the least cost in performance. The downside is that should the process crash or the power fail the file will need to be recovered.
See the discussion below on using transaction log files. Generally this option will perform about 2~4k writes/second.

StoragePerformance.CommitToDisk
Using this options should give you a fully ACID compliant BPlusTree. To be completely ACID (as with any database) means you need to disable cache and write behind features on the hardware as well. Performance will suffer, like with any real database. Depending on the speed of the disk and size of the data you can expect 1~2k writes/second or less.

Using Transaction Logs

Recovering the file from log can be expensive depending on the volume of data.
This option will open a transaction log with FileOptions.WriteThrough to cause every write to go through system cache to disk. The data file will remain in it’s current state until a call to Commit() or Dispose(). Calling Rollback() will, if using transaction logs, revert the storage file to it’s last committed state. Calling Commit/Rollback are exclusive operations on the entire tree, so all readers and writers must exit prior to processing.

For long-running processes (i.e. services) you may want to provide a thread to perform periodic calls to Commit() when idle. If you do not have a way to detect idle conditions, or don’t want to bother, set the OptionsV2.TransactionLogLimit to the max allowable log size. Keep in mind that calls to Commit require a OS-level flush of all file buffers, and as such, are expensive. My recommendation is to use a large constant value, 100MB or so.

Transaction Log Recovery

Transaction logs will by default be auto-recovered the next time the BPlusTree is opened after a crash. This comes at a cost of performing the operation as an ‘exclusive’ task which is generally what you are going to want. There are times when controlling this recovery can be useful. To this end there is a new option OptionsV2.ExistingLogAction which can be used to control the activity performed when opening a BPlusTree that has a existing log data. For instance you can choose to use the “Truncate” option to discard all changes in the transaction log.

Upgrading Existing Data

Upgrading from one BPlusTree to another is easily accomplished via the BulkInsert operation. The following example leverages the fact that the existing tree will enumerate in key order:

using Btree = CSharpTest.Net.Collections.BPlusTree<int, string>;
static void UpgradeDatabase(Btree.Options source, Btree.OptionsV2 target)
{
    source = source.Clone();
    source.ReadOnly = true;
    using(Btree input = new Btree(source))
    using (Btree output = new Btree(target))
    {
        output.BulkInsert(input,
                          new BulkInsertOptions
                              {
                                  ReplaceContents = true
                                  InputIsSorted = true,
                                  CommitOnCompletion = true,
                                  DuplicateHandling = DuplicateHandling.RaisesException,
                              }
        );
    }
}
 

Continued from: BPlusTree 2.0 Changes Part 3 – AddRange and BulkInsert So all the previous updates took place back in April of this year. As I started to prepare to release a new drop of the BPlusTree I started having thoughts about an alternative storage format. So the update got shelved and I started working [...]

 
BPlusTree 2.0 Changes Part 3 - AddRange and BulkInsert

Continued from: Part 2 – Enumerate Ranges and First/Last This is really the major piece of work in this release. There are a number of factors as to why these methods were added, but the foremost among them is performance when merging two sets of data. Before we go further let’s look at a typical [...]

 

Continued from: Part 1 – Interface Changes Enumerating Ranges I’ve been contacted several times about adding the ability to enumerate ranges from the b+tree. It’s actually a very trivial thing to implement and I’ve handed out the code several times. With this release the following range enumeration methods will be available: EnumerateFrom(TKey start) – Enumerate [...]

 

After much deliberation on the subject, I decided to introduce breaking changes interface. I know, I know, I don’t do this lightly, in fact I think it’s the third worse thing I could possible do (sorry it does follow adding bugs and semantic changes). So here are my reasons why I thought this was necessary: [...]

 

I’ve recently seen some comments about various aspects of btree/b+tree structures that are troublesome. They are, IMHO, just plain wrong. Let’s look at a few of these and see if we can shed some light on b+trees and their implementations. 1. Items in a tree node must be presorted. FALSE This is simply not true, [...]