Xapian's position table holds positional information for terms. That is, given a term and a document id, it can tell you at what positions the given term appears in the given document (assuming you decide to record this information at indexing time). This is used to implement phrase searching and similar features.

The format chert uses is unchanged from that used by flint, so see the description of the format of flint's position table on the Xapian wiki for details.

For most databases (including gmane's) this is the largest table by quite a margin. It's often as large as the others put together.

Analysis

Table size: 9,823,202,853 entries in 27,068,479 8KB blocks (206.5GB).

(Yes, nearly 10 billion entries!)

Block level breakdown

Level	Number of blocks	Comment
3	1	The root block
2	142
1	61,458
0	27,006,878	Leaf blocks, which hold the tag data

Keys

Key size range: 3 to 70 bytes (mean 10.35).

Frequency range: 3 to 1,610,372,078 occurrences of a particular key size.

The key is built from the document id (stored as the keys for the record and termlist tables - a length byte, then the big-endian value with leading zero bytes dropped), with the term name appended. With 71 million documents, most entries will have 1 length byte and 4 docid bytes - the average is actually about 4.76 bytes. The average English word is apparently 5.10 characters would give an average term length of 9.86 bytes - pretty close to the observed mean of 10.35.

If we used a numeric termid for the key, then this database has 142,942,820 distinct terms which would encode as an average of 4.88 bytes using the same length-byte-prefixed sortable encoding for an average of 9.64, which is a reduction of 0.71 bytes per entry, or nearly 6.5GB.

If we also specialise the key function per table, we could knock another whole byte off that and save a total of 15.6GB.

And we don't need a whole byte to store the length of the docid - we can use 3 bits and still support 2⁶⁴ documents, while using the other 5 bits to store the highest bits of the docid, which should typically save another 5/8 of a byte per item - that's a total of 21.4GB.

Space breakdown

Item	Total size (bytes)	Per entry (bytes)	Per leaf block (bytes)
Per entry overhead	88,408,825,677	9.0000	3,273.5670
61,601 branch blocks	504,635,392	0.0514	18.6854
Unused space in leaf blocks	302,262,121	0.0308	11.1920
Leaf block headers	297,075,658	0.0302	11.0000
Overhead in leaf blocks from split tags	12,870,336	0.0013	0.4766

There are 721,568 continuation items for the 9,823,202,853 entries, which isn't all that many - avoiding splitting tags would only save 12.2MB.

Again, the per entry overhead is the most substantial of these (and the others are much less than a byte per entry in total).

Conclusions

It seems pretty clear that this table is currently rather inefficient in terms of disk space. The total size of the tags is 28.5GB, but the table is 206.5GB. We do need some overhead for the keys and other table structure which allow us to efficiently find an item, and to add, delete or replace items, but it shouldn't need to be so high.

This table is also almost 50% larger than all the others put together, and has more than 13 times as many entries as all the others put together, so per-block and per-item savings are particularly worthwhile here. Each byte off the per-entry overhead would save 9.1GB. Decreasing the average size of the key and specialising the key comparison function would also make a big difference.

Prefix-compressing keys would help, but covers some of the same ground that decreasing the key size would.

Storing the positional data as chunked streams is worth considering too.

Posted in xapian by Olly Betts on 2009-12-15 14:41 | Permalink

The Position Table

Analysis

Block level breakdown

Keys

Tags

Space breakdown

Conclusions