« The Position Table | Home | Low-Hanging Key Size Reduction Results »
Looking through the analysis for each table, I noticed an easy way to reduce the key size for all the tables which include a document id (which is all except the spelling and synonym tables - the second of which wasn't present in the analysed database).
Currently the document id is encoded in keys using the function pack_uint_preserving_sort() which stores an initial byte saying how long the encoding is, followed by the value in big-endian representation with leading zero bytes dropped. So values encode like so
Range of values | Encoded form |
---|---|
0 | 0x00000000 -> 0x00 |
0x01-0xff | 0x000000YZ -> 0x01 0xYZ |
0x0100-0xffff | 0x0000WXYZ -> 0x02 0xWX 0xYZ |
0x010000-0xffffff | 0x00UVWXYZ -> 0x03 0xUV 0xWX 0xYZ |
0x01000000-0xffffffff | 0xSTUVWXYZ -> 0x04 0xST 0xUV 0xWX 0xYZ |
The above is for values up to 32 bits, but the encoding works for larger types (up to 2048 bits!)
But this encoding is quite wasteful - currently Xapian supports 32 bit document ids (which allows 4 billion documents in a database), so we're using a whole byte to store one of 5 possible values (and in fact since document id 0 isn't valid in Xapian, one of 4 possible values).
We could put the length in the most significant bits of the length byte and use the remaining bits to store the most significant bits of the value being encoded. Saving less than a byte per encoded value may not sound very worthwhile, but when there are more than 10 billion keys, each bit saved equates to more than 1GB.
For illustrative purposes, if we were to reserve 4 bits for the length then the new encoding would look like this:
Range of values | Encoded form |
---|---|
0x00-0x0f | 0x0000000Z -> 0x0Z |
0x10-0x0fff | 0x00000XYZ -> 0x1X 0xYZ |
0x1000-0x0fffff | 0x000VWXYZ -> 0x2V 0xWX 0xYZ |
0x100000-0x0fffffff | 0x0TUVWXYZ -> 0x3T 0xUV 0xWX 0xYZ |
0x10000000-0xffffffff | 0xSTUVWXYZ -> 0x40 0xST 0xUV 0xWX 0xYZ |
For now, a 32-bit docid limit isn't likely to be an issue (especially since it is easy to increase to 64-bit for those who need it). If we use the 2 byte encoding even for values which could be encoded as 1, we can store the length in 2 bits, and use 6 for the document value:
Range of values | Old length of encoded form | New length of encoded form |
---|---|---|
0 | 1 | 2 |
0x01-0xff | 2 | 2 |
0x0100-0x3fff | 3 | 2 |
0x4000-0xffff | 3 | 3 |
0x010000-0x3fffff | 4 | 3 |
0x400000-0xffffff | 4 | 4 |
0x01000000-0x3fffffff | 5 | 4 |
0x40000000-0xffffffff | 5 | 5 |
So for our database of 70,978,511 documents, we can expect to save 0.822 bytes per docid encoded in a key. There are 10,531,122,912 such keys, so that makes 7.95GB.
This figure is somewhat imprecise as the postlist table encodes the docids which start each chunk (other than the first), but it seems reasonable to assume they are fairly evenly spread. And we'll probably save a bit more as some keys are repeated in branch blocks, and also there will be fewer blocks, so we'll save on the per-block overhead.
As well as this saving, we can also save in how we encode terms into keys in the postlist table. This uses the pack_string_preserving_sort() function which appends the string with any zero bytes changed to "\x00\xff" and then two zero bytes appended. In normal text-based retrieval, terms don't have zero bytes in, so this adds a two byte overhead.
However, the second appended zero byte is wasteful - all we actually require is that this byte isn't \xff (so that it sorts before a longer term with a zero byte at that position). And in the postlist, the encoded term is always followed by either nothing, or by the encoded document id, which can't start with \xff with either the current and new pack_uint_preserving_sort() encodings.
And in the case of the first postlist chunk for a term where nothing follows the encoded term, we don't need the first appended zero byte either - we just need to transform any zero bytes in the key so that the first chunk's key sorts in the right place.
So we save 2 bytes per initial postlist chunk, of which we have 142,942,820 (the number of distinct terms in this database, as reported by delve -v) and a quick count shows 566,064,831 total entries in the postlist table of which 101,794 aren't postlist chunks (document length chunks and the metainfo chunk), so there are 423,020,217 non-initial postlist chunks for which we save a byte, making a total of 708,905,857 bytes, which is 0.66GB. So the total expected saving is around 8.61GB.
I spent most of yesterday implementing this, and knocked together a quick conversion utility which is currently running on the gmane database. I'll report the actual savings once it finishes.
Posted in xapian by Olly Betts on 2009-12-16 17:16 | Permalink
« The Position Table | Home | Low-Hanging Key Size Reduction Results »