Continued from: BPlusTree 2.0 Changes Part 3 – AddRange and BulkInsert

So all the previous updates took place back in April of this year. As I started to prepare to release a new drop of the BPlusTree I started having thoughts about an alternative storage format. So the update got shelved and I started working on revisiting the file format.

You might be asking yourself “Why rewrite it? It’s stable and reasonably fast, what’s the problem?”. The problem in a single word is “durability”. My initial use-case for this library was primarily that of a ‘cache’, I can throw stuff in it and get it back and if it dies I’ll rebuild it. Although I’ve never experienced a loss of data or corruption I know the v1 file format is definitely capable of corrupting it’s file. So I began reading about durability in file stores… a lot.

File Storage Durability
Finally I came to realize that there are a few key concepts used to provide durability in a file store. The first possibility is the “don’t change data” approach. This is where you write data to a file and it will never be modified. MongoDB uses this concept always appending to files. This is also the chosen approach by STSDB and RaptorDB. It has a drawback, either you never reclaim space from deleted items, or you have to write a ‘compact’ process that rewrites the database. This is often employed with another concept, storing data twice. By writing a single piece of data once, flushing the file, and writing it again you can guarantee that one of two locations being written to will be readable. The next common practice for durable stores is to use a transaction log to record the changes being made and replay them on failure. The benefit here is that the transaction log data can be thrown away after the file is committed or closed.

Warning, the following ramblings may cause drowsiness.

Initially I started down the path of just introducing a transaction log. This proved to be inadequate since to replay the transaction log I need to be able to open the file, and if the file format allows corruption (like v1 did) you find yourself dead in the water. So I needed a format that allows me to make sure I get a valid file when I open it, then I can replay the log and all will be well. So began the search for a new format.

After toying with a couple of possibilities I chose a format that uses numeric ‘handles’ in the file to address one or more contiguous blocks of available space. This is close to the same concept as the original except it did not have a contiguous space requirement. To map these ‘handles’ to a file offset means one of two things, either we introduce a new file for this map, or we have to store the map in the file somehow. More importantly we need two copies of this map in order to be certain we do not corrupt state.

So I figured if a file block size is 4096, that means I can store 1024 int numbers in it. If I assume the entire file is broken into 4096 byte blocks then a signed int can address over 8 terabytes. That certainly seemed to fit the bill, so the question was where do I store the second copy, and how do I chain these together in the file? I chose to introduce a concept of a “segment” in the file, the segment is the block size divided by 4 and then multiplied by the block size. This means that in one block I have a 4-byte slot for a handle for every block in a segment. By reserving the first and last blocks in the segment I have two places to store the handle map. Since the first and last block are reserved this also allows me to place a 32-bit CRC of the handle map both at the beginning and end of the map block.

So essentially I have 1024 handles in the first segment, 0 and 1023 are reserved. Each handle can then be read to find the block number it’s data is stored in. To extend the file I just add a new segment with the exact same format. To commit the file in a reliable way, each segment writes it’s first map of handles and we tell windows to flush it to disk. Then we write all the second map of handles and again flush. Sounds like a lot of work I know, and it was.

This format gives the BPlusTree the ability to open a file, write data to it but keep all the handle map blocks in cache. By using this approach we keep the file in it’s current state even though we are adding data to it. This allows the file to be reopened and read without issue as it’s state has not changed. During the commit phase if someone tries to read it (or the power goes out) they may read a partially modified map block. This is easily detected by the preceding and following CRC values and all that is needed to recover is to read secondary map block for that segment. To ensure a consistent state, we either load all first maps, or all second maps, failing that we use the first or second. The last case is only needed for multi-threaded ‘commit-on-write’ where each block modified will also flush both it’s map tables for the segment containing the handle.

I’m not sure if anything I just said makes sense to anyone else, but if they get into the code hopefully it will be easier to understand.

Beyond this there were a few other hurtles to overcome. The first is when a handle has more than block size written to it. One possibility was to chain handles together; however, that proved to be inefficient. I wanted to keep the read/write in ‘contiguous space’ within the file. To accomplish this I had to reserve the high 4 bits of a handle’s int address (or block offset). This four bits is 0-14, the number of contiguous blocks. Only when 15 or more blocks are written together do I then need to actually read the offset the handle points to in order to determine it’s full length. This format then allows up to ((blocksize / 4)-2) contiguous blocks to be allocated. So for a block size of 4096 we have 1022 * 4096 maximum contiguous bytes. At around 4mb for all the values in a node, this should suffice for most users. If you are writing large values (over 1mb) to the BPlusTree you could hit this limit. You will either need to increase the block size (8kb=16mb limit, 16kb = 64mb limit), or you can always use the old file format ;)

The other side effect of reserving these 4 bits is the reduced address space. Instead of being able to address 2.1 billion blocks of 4kb, I can now only address 268 million blocks. So for a 4kb block size the maximum data file is 1 terabyte. I think for the BPlusTree, 1 terabyte of data in a single file seems like a reasonable limit.

I considered converting the store to use an 8-byte long instead of just an int. This would mean that I could reserve more bits for the number of concurrent blocks, as well as increase the number of addressable blocks. The downside to this approach is that fewer 8-byte values fit in a 4kb block and thus the number of blocks in a segment is halved. When the segment size is reduced so in the max number of contiguous blocks. This means that for a 4kb block the largest node would need to be no larger than 2mb. This could be mitigated by reserving 2 blocks on each side of the segment instead of 1. In the end I decided the limits fit all my needs and felt they where reasonable to impose on the BPlusTree.

Next we are going cover example code and summarize the performance and durability of the new storage system…

Continue reading on BPlusTree 2.0 Changes Part 5 – Using the v2 Storage Format.

Comments