« The Spelling Table | Home | The Postlist Table »
The record table holds the "document data", as set by Xapian::Document::set_data().
The format chert uses is unchanged from that used by flint, so see the description of the format of flint's record table on the Xapian wiki for details.
The gmane indexer saves at most 340 bytes of "sample" from the message body, plus some other fields. Internally, Xapian compresses the tag using zlib, and it's the compressed size we're considering here. Obviously the numbers below are likely to vary quite considerably if you store more or less information in the document data, but this sort of size is probably fairly typical.
Table size: 70,978,511 entries in 2,588,122 8KB blocks (20,219.7MB).
Level | Number of blocks | Comment |
---|---|---|
3 | 1 | The root block |
2 | 10 | |
1 | 4,980 | |
0 | 2,583,131 | Leaf blocks, which hold the tag data |
Key size range: 2 to 5 bytes (mean 4.76).
Frequency range: 255 to 54,201,296 occurrences of a particular key size.
The key consists of an initial byte which has the length of the key, followed by the document id in big-endian format with leading zero bytes removed. The initial byte is required to make the keys sort correctly with a string sort, so we could shorten all keys by a byte if we allowed the sort routine to be specialised for each table.
Tag size range: 35 to 513 bytes (mean 282.16).
Frequency range: 1 to 1,002,829 occurrences of a particular tag size.
Total size of all tags is 19,099.3MB.
This plot has log(1 + frequency) on the Y axis and shows the tails of the distribution more clearly:
There are 8,498,293 continuation items for the 70,978,511 entries. The maximum tag size is short enough that we should never have to split entries, but the B-tree manager can decide to do so if the split entry requires less storage than the unsplit item when wasted space at the end of blocks is taken into account.
Item | Total size (bytes) | Per entry (bytes) | Per leaf block (bytes) |
---|---|---|---|
Per entry overhead | 638,806,599 | 9.0000 | 247.2993 |
Overhead in leaf blocks from split tags | 116,909,065 | 1.6471 | 45.2587 |
4,991 branch blocks | 40,886,272 | 0.5760 | 15.8282 |
Leaf block headers | 28,414,441 | 0.4003 | 11.0000 |
Unused space in leaf blocks | 11,800,570 | 0.1663 | 4.5683 |
The record table's contents is particularly variable depending on how Xapian is being used in the application. The tag sizes produced by gmane are probably fairly typical, but optimisations which also work as well or better for larger tag sizes are particularly worthwhile.
Avoiding the need to split large tags, or removing the overhead of doing so is likely to make a significant difference here, and more so for larger tags.
Reducing the per entry overhead would make quite a difference.
Being able to specialise the key comparison function would allow keys to be a byte shorter, though prefix compressing keys would probably be somewhat redundant with that.
Posted in xapian by Olly Betts on 2009-12-14 18:35 | Permalink