Olly
[ RSS | ATOM 1.0 ]
Powered by PyBlosxom

« The Termlist Table | Home | Low-Hanging Key Size Reduction »

The Position Table

Xapian's position table holds positional information for terms. That is, given a term and a document id, it can tell you at what positions the given term appears in the given document (assuming you decide to record this information at indexing time). This is used to implement phrase searching and similar features.

The format chert uses is unchanged from that used by flint, so see the description of the format of flint's position table on the Xapian wiki for details.

For most databases (including gmane's) this is the largest table by quite a margin. It's often as large as the others put together.

Analysis

Table size: 9,823,202,853 entries in 27,068,479 8KB blocks (206.5GB).

(Yes, nearly 10 billion entries!)

Block level breakdown

Level Number of blocks Comment
3 1 The root block
2 142  
1 61,458  
0 27,006,878 Leaf blocks, which hold the tag data

Keys

Key size range: 3 to 70 bytes (mean 10.35).

Frequency range: 3 to 1,610,372,078 occurrences of a particular key size.

position-key.png

The key is built from the document id (stored as the keys for the record and termlist tables - a length byte, then the big-endian value with leading zero bytes dropped), with the term name appended. With 71 million documents, most entries will have 1 length byte and 4 docid bytes - the average is actually about 4.76 bytes. The average English word is apparently 5.10 characters would give an average term length of 9.86 bytes - pretty close to the observed mean of 10.35.

If we used a numeric termid for the key, then this database has 142,942,820 distinct terms which would encode as an average of 4.88 bytes using the same length-byte-prefixed sortable encoding for an average of 9.64, which is a reduction of 0.71 bytes per entry, or nearly 6.5GB.

If we also specialise the key function per table, we could knock another whole byte off that and save a total of 15.6GB.

And we don't need a whole byte to store the length of the docid - we can use 3 bits and still support 264 documents, while using the other 5 bits to store the highest bits of the docid, which should typically save another 5/8 of a byte per item - that's a total of 21.4GB.

Tags

Tag size range: 1 to 15,950 bytes (mean 3.11).

Frequency range: 1 to 3,585,463,883 occurrences of a particular tag size.

Total size of all tags is 29,151.6MB (or 28.5GB).

position-tag.png

At first glance, the plotted red line may not be obvious - the graph starts at a maximum very close to the X-axis, decaying quickly to small values very close to the Y-axis.

This plot has log(1 + frequency) on the Y axis and shows the distribution more clearly:

position-tag-log.png

The surprising thing here is that most entries are so short - I guess this shows how effective the interpolative coding technique is. The average key is more than 3.3 times the size of the average tag.

It might be worth considering storing the positional data for each term as a chunked stream, as we do for postings - we could store a few hundred short entries in a chunk requiring only a single (albeit slightly more complex) key.

Space breakdown

Item Total size (bytes) Per entry (bytes) Per leaf block (bytes)
Per entry overhead 88,408,825,677 9.0000 3,273.5670
61,601 branch blocks 504,635,392 0.0514 18.6854
Unused space in leaf blocks 302,262,121 0.0308 11.1920
Leaf block headers 297,075,658 0.0302 11.0000
Overhead in leaf blocks from split tags 12,870,336 0.0013 0.4766

There are 721,568 continuation items for the 9,823,202,853 entries, which isn't all that many - avoiding splitting tags would only save 12.2MB.

Again, the per entry overhead is the most substantial of these (and the others are much less than a byte per entry in total).

Conclusions

It seems pretty clear that this table is currently rather inefficient in terms of disk space. The total size of the tags is 28.5GB, but the table is 206.5GB. We do need some overhead for the keys and other table structure which allow us to efficiently find an item, and to add, delete or replace items, but it shouldn't need to be so high.

This table is also almost 50% larger than all the others put together, and has more than 13 times as many entries as all the others put together, so per-block and per-item savings are particularly worthwhile here. Each byte off the per-entry overhead would save 9.1GB. Decreasing the average size of the key and specialising the key comparison function would also make a big difference.

Prefix-compressing keys would help, but covers some of the same ground that decreasing the key size would.

Storing the positional data as chunked streams is worth considering too.


Posted in xapian by Olly Betts on 2009-12-15 14:41 | Permalink


« The Termlist Table | Home | Low-Hanging Key Size Reduction »