Olly
[ RSS | ATOM 1.0 ]
Powered by PyBlosxom

« The Spelling Table | Home | The Postlist Table »

The Record Table

The record table holds the "document data", as set by Xapian::Document::set_data().

The format chert uses is unchanged from that used by flint, so see the description of the format of flint's record table on the Xapian wiki for details.

The gmane indexer saves at most 340 bytes of "sample" from the message body, plus some other fields. Internally, Xapian compresses the tag using zlib, and it's the compressed size we're considering here. Obviously the numbers below are likely to vary quite considerably if you store more or less information in the document data, but this sort of size is probably fairly typical.

Analysis

Table size: 70,978,511 entries in 2,588,122 8KB blocks (20,219.7MB).

Block level breakdown

Level Number of blocks Comment
3 1 The root block
2 10  
1 4,980  
0 2,583,131 Leaf blocks, which hold the tag data

Keys

Key size range: 2 to 5 bytes (mean 4.76).

Frequency range: 255 to 54,201,296 occurrences of a particular key size.

The key consists of an initial byte which has the length of the key, followed by the document id in big-endian format with leading zero bytes removed. The initial byte is required to make the keys sort correctly with a string sort, so we could shorten all keys by a byte if we allowed the sort routine to be specialised for each table.

record-key.png

Tags

Tag size range: 35 to 513 bytes (mean 282.16).

Frequency range: 1 to 1,002,829 occurrences of a particular tag size.

Total size of all tags is 19,099.3MB.

record-tag.png

This plot has log(1 + frequency) on the Y axis and shows the tails of the distribution more clearly:

record-tag-log.png

There are 8,498,293 continuation items for the 70,978,511 entries. The maximum tag size is short enough that we should never have to split entries, but the B-tree manager can decide to do so if the split entry requires less storage than the unsplit item when wasted space at the end of blocks is taken into account.

Space breakdown

Item Total size (bytes) Per entry (bytes) Per leaf block (bytes)
Per entry overhead 638,806,599 9.0000 247.2993
Overhead in leaf blocks from split tags 116,909,065 1.6471 45.2587
4,991 branch blocks 40,886,272 0.5760 15.8282
Leaf block headers 28,414,441 0.4003 11.0000
Unused space in leaf blocks 11,800,570 0.1663 4.5683

Conclusions

The record table's contents is particularly variable depending on how Xapian is being used in the application. The tag sizes produced by gmane are probably fairly typical, but optimisations which also work as well or better for larger tag sizes are particularly worthwhile.

Avoiding the need to split large tags, or removing the overhead of doing so is likely to make a significant difference here, and more so for larger tags.

Reducing the per entry overhead would make quite a difference.

Being able to specialise the key comparison function would allow keys to be a byte shorter, though prefix compressing keys would probably be somewhat redundant with that.


Posted in xapian by Olly Betts on 2009-12-14 18:35 | Permalink


« The Spelling Table | Home | The Postlist Table »