Building a database: Large Nodes
Our BufferManager
differs from a traditional buffer manager in that it allows variable sized pages, through the notion of page classes. The goal of variable size pages is to allow us to avoid overflow pages.
In a traditional system, if you need to insert data that is larger than the fixed page size, the layer above the buffer manager (e.g. the b+tree), will allocate additional pages and link then to the initial page. Data will then be chopped up and bleed into the additional pages. However, these pages are not usually guaranteed to be contiguous. This means, that upper layers need to aware that their data may be spread over various chunks of memory.
This makes using a traditional buffer manager more challenging.
So we have a buffer manager that can dole out different page sizes, and we’ve made our btree nodes able to be up to 2^32 bytes (which does act as our upper limit on value size). Now we need to determine when to use these larger pages.
When we’re just inserting smaller key-value pairs, there would be no reason to utilize a large page. In these cases we simply perform splits and merges as normal.
However, what happens if we need to insert a large value? We will get to a point in which we attempt to insert into the node, and the node will say “hey, I do not have space for that”, which is the signal to the btree that it should perform a node split.
However, if the value is large, we can do a few things:
- Resize the node (allocate a new large node and replace the smaller)
- Split, but make the right a large node
- Split and resize the left node
Resizing the current node is obviously simple, but we don’t want our strategy to devolve into “when out of space, make it bigger!”. We need to balance these options. We can decide based on number of keys – if the key count is low (<6?), resize. Else, perform either 2 or 3 based on size of data moving around.
Note, this is only done on leaf nodes.
Rough sketch:
if node.key_count() < 6 {
// upsize node
} else {
// Check capacity
let (left, mut right) = node.capacity_for_split(pivot);
let insert_left = node.is_insert_before(pivot, key);
if insert_left &&
left + insert_data_len > left.total_capacity_with_buffer() {
// resize (left) node
}
if !insert_left {
right += insert_data_len;
}
let right: Swip<Node> =
self.buffer_manager.alloc_page_with_capacity(right);
}
Why might this be a reasonable strategy? My theory is that since the larger the values, the fewer keys will be on a page when it needs to resize. Therefore, if we’re out of space with less than 6 keys on the page, values are at least page_size / 6
in length, so resizing the page will mean it will still hold few keys. We don’t want pages with huge number of keys, and this likely avoids that (not in all cases).
This feels like a decent hueristic, vs something more complex like the btree tracking statistics on stored values and using that to determine probabilities which drive better split/resize decisions.
When we split, we need to make a decision of what size the right (new) node should be and if we need to change the size of the (current) left node. The simpliest model would be to use the current node’s size for both left and right. While this is good for values that are roughly the same size, it is less ideal for pages that have a single huge value and the others are smaller (would naturally be on a smaller page size). To this end, we attempt to downsize the left and right node to fit the anticipated data.
✨