Olly's Blog :

Looking through the analysis for each table, I noticed an easy way to reduce the key size for all the tables which include a document id (which is all except the spelling and synonym tables - the second of which wasn't present in the analysed database).

Currently the document id is encoded in keys using the function pack_uint_preserving_sort() which stores an initial byte saying how long the encoding is, followed by the value in big-endian representation with leading zero bytes dropped. So values encode like so

Range of values	Encoded form
0	0x00000000 -> 0x00
0x01-0xff	0x000000YZ -> 0x01 0xYZ
0x0100-0xffff	0x0000WXYZ -> 0x02 0xWX 0xYZ
0x010000-0xffffff	0x00UVWXYZ -> 0x03 0xUV 0xWX 0xYZ
0x01000000-0xffffffff	0xSTUVWXYZ -> 0x04 0xST 0xUV 0xWX 0xYZ

The above is for values up to 32 bits, but the encoding works for larger types (up to 2048 bits!)

But this encoding is quite wasteful - currently Xapian supports 32 bit document ids (which allows 4 billion documents in a database), so we're using a whole byte to store one of 5 possible values (and in fact since document id 0 isn't valid in Xapian, one of 4 possible values).

We could put the length in the most significant bits of the length byte and use the remaining bits to store the most significant bits of the value being encoded. Saving less than a byte per encoded value may not sound very worthwhile, but when there are more than 10 billion keys, each bit saved equates to more than 1GB.

For illustrative purposes, if we were to reserve 4 bits for the length then the new encoding would look like this:

Range of values	Encoded form
0x00-0x0f	0x0000000Z -> 0x0Z
0x10-0x0fff	0x00000XYZ -> 0x1X 0xYZ
0x1000-0x0fffff	0x000VWXYZ -> 0x2V 0xWX 0xYZ
0x100000-0x0fffffff	0x0TUVWXYZ -> 0x3T 0xUV 0xWX 0xYZ
0x10000000-0xffffffff	0xSTUVWXYZ -> 0x40 0xST 0xUV 0xWX 0xYZ

For now, a 32-bit docid limit isn't likely to be an issue (especially since it is easy to increase to 64-bit for those who need it). If we use the 2 byte encoding even for values which could be encoded as 1, we can store the length in 2 bits, and use 6 for the document value:

Range of values	Old length of encoded form	New length of encoded form
0	1	2
0x01-0xff	2	2
0x0100-0x3fff	3	2
0x4000-0xffff	3	3
0x010000-0x3fffff	4	3
0x400000-0xffffff	4	4
0x01000000-0x3fffffff	5	4
0x40000000-0xffffffff	5	5

So for our database of 70,978,511 documents, we can expect to save 0.822 bytes per docid encoded in a key. There are 10,531,122,912 such keys, so that makes 7.95GB.

This figure is somewhat imprecise as the postlist table encodes the docids which start each chunk (other than the first), but it seems reasonable to assume they are fairly evenly spread. And we'll probably save a bit more as some keys are repeated in branch blocks, and also there will be fewer blocks, so we'll save on the per-block overhead.

As well as this saving, we can also save in how we encode terms into keys in the postlist table. This uses the pack_string_preserving_sort() function which appends the string with any zero bytes changed to "\x00\xff" and then two zero bytes appended. In normal text-based retrieval, terms don't have zero bytes in, so this adds a two byte overhead.

However, the second appended zero byte is wasteful - all we actually require is that this byte isn't \xff (so that it sorts before a longer term with a zero byte at that position). And in the postlist, the encoded term is always followed by either nothing, or by the encoded document id, which can't start with \xff with either the current and new pack_uint_preserving_sort() encodings.

And in the case of the first postlist chunk for a term where nothing follows the encoded term, we don't need the first appended zero byte either - we just need to transform any zero bytes in the key so that the first chunk's key sorts in the right place.

So we save 2 bytes per initial postlist chunk, of which we have 142,942,820 (the number of distinct terms in this database, as reported by delve -v) and a quick count shows 566,064,831 total entries in the postlist table of which 101,794 aren't postlist chunks (document length chunks and the metainfo chunk), so there are 423,020,217 non-initial postlist chunks for which we save a byte, making a total of 708,905,857 bytes, which is 0.66GB. So the total expected saving is around 8.61GB.

I spent most of yesterday implementing this, and knocked together a quick conversion utility which is currently running on the gmane database. I'll report the actual savings once it finishes.

Posted in xapian by Olly Betts on 2009-12-16 17:16 | Permalink

Low-Hanging Key Size Reduction