I’ve recently seen some comments about various aspects of btree/b+tree structures that are troublesome. They are, IMHO, just plain wrong. Let’s look at a few of these and see if we can shed some light on b+trees and their implementations.
1. Items in a tree node must be presorted.
FALSE This is simply not true, a btree can decide to sort the items in a node, or to perform a linear search. Neither approach change more than the small constants in the big-O, it’s still O(n log(n)). The BPlusTree keeps it’s children sorted since the memory-move cost is nothing compared to the disk IO.
2. Split/Join of a node requires reorganizing it’s ancestors
FALSE While this is the most common approach it is also the most naive. Reorganization of the tree from the bottom-up is not only expensive but more complicated than the top-down approach. The BPlusTree uses a top-down reorg that says simply “If I’m inserting and this node is full, split” all the way down the tree. This means that at any given time I only need two nodes ‘locked’ for write, me and my parent. Once I’m sure that this node is not full I can release my parent’s lock and move down the tree. Not only does this strategy allow multiple concurrent writers, but it also prevents deadlocks. By using this top-down approach locks are always acquired at the root and traverse down so that a deadlock is simply impossible.
3. Nodes can have at most order-1 items
FALSE This is also a naive implementation detail where leaf nodes (or all nodes) have a vacant slot. It’s most commonly used when using a bottom-up reorganization so that you always have room to insert a split node. It is simply not necessary with either bottom-up or top-down reorganization, it’s just easier. BPlusTree uses all the space available in a node, only when inserting and already full will it split the node.
4. Nodes can have no fewer than order/2 items
FALSE The minimum items in a node is 1. When you calculate the lookup time with a minimum of 1 (let b = 1) you get O(logbn) or O(n). So you never want a minimum value of 1, but you can do it. Based on my own testing the best value for the minimum node count in BPlusTree is 1/3 of the order (or max).
5. The order of a btree is the same at all levels
FALSE The BPlusTree supports having a different order for hierarchy nodes than that of it’s leaf nodes. The reason is simple. Disk-based B+Trees optimize disk access based on what will fit in a ‘page’. Using 8kb as a page size, at the leaf node this is 8192 / (sizeof(key) + sizeof(value)). At all other locations in the hierarchy this is 8192 / (sizeof(key) + c) where c is some constant size of data used to reference a child node. This means that if I am storing 1kb records with a 4-byte key I want no more than 8 records in a leaf, whereas I can have as many as 680 children in a non-leaf node. Now we can let b = 680 for O(logbn), meaning we can find the leaf node of a 1,000,000 item tree in approximately 3 operations.