In my previous post “Why GUID primary keys are a database’s worst nightmare” I spoke about some of downsides to using GUID primary keys. In this post I’m going to focus more specifically on the .NET Framework’s implementation of System.Guid. I have a few complaints about this class, let’s cover them…

1. Guid.ToByteArray()

This, IMHO, is one of the worst possible things about a Guid. First this method returns a byte array in little-endian format. Not only is this directly opposite the stated standard GUID format, but it also means the bytes do not compare the same as the structured value. Thus when you combine System.Guid with a sequential guid generation you wind up with either a structure that doesn’t compare sequentially, or a byte array that doesn’t compare sequentially. Yes you can always swap bytes (1,4), (2,3), (5,6), and (7,8), but really? should I have to?

The second issue I have with this method is that it absolutely requires me to allocate the bytes on the heap. While this may not sound like such a big deal, if you are moving large volumes of data and serializing those Guids the only way you can (as byte[]) it will thrash the GC. I’m hopeful that in some future version of .NET we will find a Guid with a ToByteArray(GuidFormatType, byte[], offset) where GuidFormatType would allow little-endian or big-endian formats.

2. Guid.GetHashCode()

This one does puzzle me. They choose to completely ignore bytes 9, 10, 12, 13, 14, and 15 when computing a Guid’s hash code. Ok so maybe with random Guids this really doesn’t matter, but when you start trying to use Guid with sequential values this starts to impact hash performance. If, as in my example on the previous post, bytes 1-8 are date-time, then your entire hash value amounts to the entropy introduced in bytes 11 and 16. To be specific, byte 11 xors with the MSB and byte 16 xors with the LSB.

3. Guid.NewSequentialGuid()

Don’t you wish that .NET could just do this for us? We’ve know this to be a problem, and have know of a resolution to it, since 2002. They’ve even added “NEWSEQUENTIALGUID()” to Sql Server. Wouldn’t you think they could get this done for us? I guess until they fix the endianness of ToByteArray it’s really a moot point anyway.

Of course one option is to just PInvoke UuidCreateSequential.

static class Win32
{
    [DllImport("rpcrt4.dll", SetLastError=true)]
    public static extern int UuidCreateSequential(out Guid guid);
}

4. Guid.(Missing Properties)

Another annoyance is the lack of exposing the values of the Guid. Would it have really been that hard to expose the same integer, two shorts, and 8 bytes that we can pass to the constructor? One could then create an efficient binary serializer for the Guid type without a GC allocation and without worrying about byte-swapping little-endian values. In addition, the underlying randomization could then be reused without using the ToByteArray method. Maybe this isn’t a big deal for most people, but I’m still miffed that (AFAIK) there is not a single thread-safe means of creating a random value without allocating a byte[].

 

When you ask most people why using a GUID column for a primary key in a database might be slower than auto-incremented number the answer your likely to get is usually along these lines:

“Guids are 16 bytes, and integers are only 4 bytes.”

While this information is technically accurate, it is completely the wrong answer to the question. The performance delta from a 16 byte key to a 4 byte key is most likely negligible and almost certainly impossible to accurately measure. To understand what is wrong with a guid, you first need to know a little about how a database stores it’s data.

The most common storage used by databases (to the best of my knowledge) is a B+Tree. They offer great performance even for very large sets of data. They do however suffer from one problem, let’s call this ‘key density‘.

By key density I’m referring to the density, or close proximity, of the keys being accessed at any given moment as it compares the universe of all the keys in storage. For example let’s say I have 10 million records, each keyed by a unique integer numbered from 1 to 10,000,000. I can say that keys {44,46,52} have a high-density, whereas keys {100,10553,733555} have a very low-density.

Cache is everything. When a database needs to read or write a record, they traverse nodes in the tree and read them into memory from disk. Disk IO is the single biggest time expense a database has. So to reduce the number of reads, nodes visited while fetching or writing a record are usually cached. This allows more efficient retrieval of the record next time it is requested. As more low-density keys are accessed, more and more unique nodes are fetched from disk into memory and cache. Yet every cache has its limits, bound primarily by the hardware available and often by software configuration. Once the available cache space has been used, stale/old nodes are removed from cache to make room for newer ones.

So now let us imagine how this applies when using a auto-numbered field. To insert the next new row I’ll need to travel down the right-edge (highest key values) of the tree. Once at the leaf, insert and done (oversimplified). Since the next write uses the next possible key from the last one used, it is practically guaranteed to find the nodes it needs in cache. So a few quick memory look-ups later the record can be inserted without reading from disk. It should now be obvious why Guids will have problems…

Creating a new Guid is essentially done by taking a value ‘uniformly at random’ from the entire range of possible Guid values. That means that unlike our auto-number field with a high key-density, our Guid keys are designed to be sparse, or to have a low key-density. Because of this for two Guid values to be stored in the same tree node is likely going to be statistically more improbable than winning the lottery. Using a Guid as a primary key practically obliterates the ability for a database to reliably find nodes in the cache based upon previous queries. By the time the database has passed about 10 million rows your performance will fall drastically.

Want more proof? Read the Guid article on Wikipedia under the section “Sequential algorithms” it discusses this very topic. It also discusses solutions to this, as first introduced by Jimmy Nilsson in 2002, called a COMB Guid or Combined Guid for the combination of a timestamp.

So are Guid’s Evil in a Database? No. It’s the algorithm that generates the Guid that is the issue, not the Guid itself. As mentioned in the articles linked above, there are other algorithms. There are numerous possibilities documented online. I’ve implemented one of these that out-performs the native Guid.NewGuid() implementation that will be released in the next few weeks. Until then, this is fairly similar to the algorithm I’m using…


static class SequentialGuidGenerator
{
    static int _sequenced = (int)DateTime.UtcNow.Ticks;

    private static readonly System.Security.Cryptography.RNGCryptoServiceProvider _random =
        new System.Security.Cryptography.RNGCryptoServiceProvider();
    private static readonly byte[] _buffer = new byte[6];

    public static Guid NewGuid()
    {
        long ticks = DateTime.UtcNow.Ticks;
        int sequenceNum = System.Threading.Interlocked.Increment(ref _sequenced);
        lock (_buffer)
        {
            _random.GetBytes(_buffer);
            return new Guid(
                (int) (ticks >> 32), (short) (ticks >> 16), (short) ticks,
                (byte) (sequenceNum >> 8), (byte) sequenceNum,
                _buffer[0], _buffer[1], _buffer[2], _buffer[3], _buffer[4], _buffer[5]
                );
        }
    }
}

The problem with this code is that not all databases compare Guids the same way, or in the same format. So you should be cautious about using this approach until you understand your database and how it stores and compares a Guid type.

Lessons Learned and Applied Concept

The interesting and rather undocumented aspect about this issue is that it applies just as well across all types of composite keys. Let’s take an example, we have a simple logging structure we are writing to with NLog in a database. The rows are identified by a Guid, but we almost never query these records by Id. Most of the time when we query this data we are looking within a range of dates for a specific event. How would you model the primary key and indexes? Well most people want to use as small of a primary key as possible and so the natural assumption is to use the ID of the record. This, as we’ve already covered, is generally a bad idea just because we are using Guids, but even more because our queries will be time based. By promoting the timestamp into the primary key we not only gain better query performance, but we also remove the problem of the GUID identifier. Let’s see the example:

TABLE LogFile {
    Column DateTime timestamp,
    Column GUID id,
    Column int eventType,
    Column string message,
    Primary Key ( timestamp, id )
}

With the primary key using the timestamp as it’s first comparison we will always be writing to the same area within the table and will consistently hit the cache for writes. When seeking for data the timestamp will group all the records needed together so that the data we are after is stored as dense as is possible requiring the fewest possible reads.

Now let’s drift this example a little to a new scenario. Let’s say the log above is being written from an ASP.NET application in which all users are authenticated. We want to add this user identity to the LogFile data being written, and we want to constrain queries to be associated with a user. In the example above it may be safe to simply modify the Primary Key ( timestamp, id ), to include the user as first key. Thus the Primary Key ( userId, timestamp, id ) will now perform well for a single user right? Well the answer Yes and No. It really depends greatly on the application. Introducing userId as the primary key means that we could be writing to as many places in the file as we have users. If the application is a mobile app polling our web server every minute, then we’ve just scattered our writes across all N thousands of users in our system. Yet if our application requires a human being to use a web browser or mobile app, the number of active writing points in the file drops considerably… Well at least until your Facebook and at that point you cash out and go home :)

One last example. Given you are building SkyDrive, or GDrive, or whatever you want to store the following entities: Customer, Drive, Folder, File. Each entity is identified by a GUID. What does the File’s primary key look like? Well I’d probably key the File by (CustomerID, DriveID, FolderID, Timestamp, FileID). You would obviously need an ancillary index by FileID in order to access the file directly. Food for thought anyway…

The reason I bring all this up is that there is no rule of thumb for a good primary key. Each application will have different needs and different requirements. Want you should take away is that the density of keys being written to in that first field of your primary key will ultimately dictate your write throughput and likely your scalability. Be conscious of this, and choose keys wisely. The Guid ID of a record is not always the best answer for a primary key.

As a side-note, using Guid’s as a primary key in the B+Tree will work but much more slowly at large volumes (2+ million). Using a sequential guid generator like the one above, or using an ordered natural key (like a qualified file name) will serve you much better. Ordered (or near-ordered) keys will provide linear scalability whereas unique random GUIDs will start to suffer once you’ve exceeded the cache space.

 

Sometimes it is necessary to generate a password. This might be done to create a secure account on a machine, or to reset a user’s password via email (although a one-time use security token is a much better answer). Another possible use is to generate passwords for your own use on a website. There are lots of ways to achieve this, but the method below would be my approach…

/// <summary>
/// Creates a pseudo-random password containing the number of character classes
/// defined by complexity, where 2 = alpha, 3 = alpha+num, 4 = alpha+num+special.
/// </summary>
public static string GeneratePassword(int length, int complexity)
{
    System.Security.Cryptography.RNGCryptoServiceProvider csp =
        new System.Security.Cryptography.RNGCryptoServiceProvider();
    // Define the possible character classes where complexity defines the number
    // of classes to include in the final output.
    char[][] classes =
    {
        @"abcdefghijklmnopqrstuvwxyz".ToCharArray(),
        @"ABCDEFGHIJKLMNOPQRSTUVWXYZ".ToCharArray(),
        @"0123456789".ToCharArray(),
        @" !""#$%&'()*+,./:;<>?@[\]^_{|}~".ToCharArray(),
    };

    complexity = Math.Max(1, Math.Min(classes.Length, complexity));
    if(length < complexity)
        throw new ArgumentOutOfRangeException("length");

    // Since we are taking a random number 0-255 and modulo that by the number of
    // characters, characters that appear earilier in this array will recieve a
    // heavier weight. To counter this we will then reorder the array randomly.
    // This should prevent any specific character class from recieving a priority
    // based on it's order.
    char[] allchars = classes.Take(complexity).SelectMany(c => c).ToArray();
    byte[] bytes = new byte[allchars.Length];
    csp.GetBytes(bytes);
    for (int i = 0; i < allchars.Length; i++)
    {
        char tmp = allchars[i];
        allchars[i] = allchars[bytes[i]%allchars.Length];
        allchars[bytes[i]%allchars.Length] = tmp;
    }

    // Create the random values to select the characters
    Array.Resize(ref bytes, length);
    char[] result = new char[length];

    while(true)
    {
        csp.GetBytes(bytes);
        // Obtain the character of the class for each random byte
        for (int i = 0; i < length; i++)
            result[i] = allchars[bytes[i]%allchars.Length];

        // Verify that it does not start or end with whitespace
        if (Char.IsWhiteSpace(result[0]) || Char.IsWhiteSpace(result[(length - 1) % length]))
            continue;

        string testResult = new string(result);
        // Verify that all character classes are represented
        if (0 != classes.Take(complexity).Count(c => testResult.IndexOfAny(c) < 0))
            continue;

        return testResult;
    }
}

Essentially this method starts by creating a randomly ordered set of characters to choose from. The reason this is randomly ordered is that we will offset into this array by taking a random number from 0-255 modulo the length of the array. Because of the modulo operation this can weight the first set of characters higher so we randomize it.

Once we have a randomly ordered set of characters to choose from we use the RNGCryptoServiceProvider.GetBytes to populate a byte array with random values. Each random byte will be used to select a character in the output array.

Lastly we make a few assertions about the result. Since a space ‘ ‘ can be a viable password character we make sure that the password neither starts with or ends with whitespace. Having whitespace at the beginning or end would not only be confusing, but some input forms will automatically remove the trailing whitespace. The second thing we verify is that each requested character class is represented by at least one character. This ensures that an alpha-numeric password will contain at least one number and one letter.

This brings us to the ‘how long of a password to use’ kind of question. The following is a chart to display how many password characters are required to produce an equivelent key in bit-length:

case insensitive
(26 chars)
case sensitive
(52 chars)
alpha + numeric
(62 chars)
alpha + numeric + special
(92 chars)
*56-bit 12 10 9 7
*64-bit 14 12 11 10
128-bit 28 23 22 20
256-bit 55 45 43 40

*56-bit keys are capable of being brute forced in just a few hours on current specialized hardware. Distributed systems are also capable of cracking 56-bit encryption. 128-bit keys are considered a minimum key space for today’s cryptographic algorithms.

The table is actually somewhat surprising to me in how little difference the added numeric and special characters make. To make a password as secure as the AES/128-bit algorithm you need 23 upper and lower case characters. Saving one keystroke by adding numbers to the mix just doesn’t seem all that worth while to me. When you look at most people using just 8-12 characters for what they consider a ‘good’ password it almost makes you laugh.

It seems to me that most password complexity rules requiring special characters and numbers make little sense. I guess it prevents dictionary attacks to a degree, but clearly the length of the password is much more important.

 

Ok I swear this is the last B+Tree chart, graph, or metric I’ll post… at least for a while. I wanted to hit new limits so I fired up 6 threads on my six-core and let them rip… After three hours I finally stopped the run to look at the results.

Each of the six threads had inserted over 34 million records, a total of over 200 million records inserted in 3 hours averaging over 18k inserts per second. The data file had reached 5.79 GB (6,221,824,032 bytes). I guess that isn’t too bad considering I was using my slow 2tb WD caviar green (I didn’t want to thrash my SSD). Opening the file doesn’t do much other than to read the first block, from a cold start it took 62ms. The first thousand random seeks took an average of 10ms and thereafter settled in at around 9ms. Sequential scans of 1000 keys at a random offset takes about 500ms averaging .5ms each.

The following is a graph of one thread (all 6 near identical) showing the time in seconds (vertical) and total records inserted in millions (horizontal). Obviously the point being made here is that the line is straight :)

Time / Records
B+Tree Insert Progression
 
If you are wondering why these numbers don’t line up with the B+Tree benchmarks previously posted there are two reasons. First I totally screwed up the multi-threaded test for those numbers. The second difference was the amount of data/record size. The above graph shows inserts of an Int32 key and a Guid value or 20 byte records. The benchmark used a Guid key on a 100 byte value or 116 byte records. The major reason for reducing the record size was to improve throughput so I could target a larger number of records.

 

So as I mentioned in the previous post, all this work to build lockless queues is really a waste of time. Why, well to answer that question we need something to compare it against. Thus the class below derives from Queue and provides the locking necessary to make the queue thread-safe (well, not thread safe, only those two methods are safe, but you get the idea). After a few benchmarks I quickly realized that after 10 million queue/dequeue operations (on two threads) the performance delta was around 2 seconds. So our lock + event overhead costs us around 0.0002 ms per queue/dequeue operation. When you compare that 200 nanoseconds with your 100 millisecond network latency to get the request in the first place it would be absurd to use a custom ‘Lockless’ queue. Maybe, JUST maybe, you could find use for something like that if your developing games or something; however, what you wind up with in most cases will be a maintenance nightmare and a debugging hell on earth.

Go ahead now and build yourself a little test harness using this queue and the one from the previous post and see for yourself. The below implementation is superior in many ways, it doesn’t have a polling loop so response to an enqueue are potentially faster, it supports any number of producer/consumers so you can perform operations in parallel, and you basically get this behavior free (or near to).

Now this post is not to put down the effort of the new threading objects in .Net 4.0. This is really more about what you should, and should not attempt to do on your own. I’ll leave it for you to decide if the .Net 4.0 team wasted their time ;)

    class LockingQueue<T> : Queue<T>
    {
        ManualResetEvent _mre = new ManualResetEvent(false);
        public LockingQueue(int size) : base(size)
        { }

        public bool IsEmpty { get { return base.Count == 0; } }
            public new void Enqueue(T obj)
        {
            lock (this)
            {
                base.Enqueue(obj);
                _mre.Set();
            }
        }

        public bool TryDequeue(int timeout, out T value)
        {
            lock (this)
            {
                if (base.Count > 0)
                {
                    value = base.Dequeue();
                    return true;
                }
                _mre.Reset();
            }
            _mre.WaitOne(timeout, false);
            lock (this)
            {
                if (base.Count > 0)
                {
                    value = base.Dequeue();
                    return true;
                }
            }
            value = default(T);
            return false;
        }
    }
 

Most people I’ve seen online compute a simple hash of password + salt for persistence and authentication. This is the accepted standard in a straight-forward solution: byte[] Hash(string password) { byte[] pass = System.Text.Encoding.UTF8.GetBytes(password); //Create the salt to use byte[] salt = new byte[32]; new RNGCryptoServiceProvider().GetBytes(salt); //Create the hash of password and salt HashAlgorithm hashAlgo [...]

 

Well just got the release version of VS2010 installed and all I can say is “OMG”.  Not an OMG as in “OMG that is so cool!”, but more like an “OMG, are you serious?”.   Several things are standing out as wickedly wrong before I even open a solution…   My hardware is no longer sufficient. [...]

 
The exponential cost of doing things "Right"

Sometimes a picture is worth a thousand words… I simply ‘cringe’ every time I hear a developer utter the words “I’m going to do this RIGHT!”. Why do we use the words “Right way” and “Wrong way” when describing things? The fact is that there is a lot more than just one “Wrong way” to [...]

 

There are three virtual methods that IMHO should have never been added to System.Object… • ToString() • GetHashCode() • Equals() All of these could have been implemented as an interface. Had they done so I think we’d be much better off. So why are these a problem? First let’s focus on ToString(): 1. If ToString() [...]

 

The answer is simple… only when the profiler tells you to. Optimizations often make code less reliable and often constrain implementations making them less flexible. Good software performance is not created by making lots of micro optimizations throughout your code. A good design from the architectural point of view is the critical key to success [...]