« The Record Table | Home | The Termlist Table »
In Xapian's chert backend, the postlist table holds:
- A single entry containing summary statistics for the database (currently the total length of all documents, the used document id high-water mark, upper and lower bounds on the document length, and an upper bound on the wdf).
- User meta-data - one entry per metadata (key, value) pair.
- The document lengths, as a series of chunks.
- Statistics about values - one entry per value slot recording upper and lower bounds on the values which occur, and the number of documents for which a non-empty value is set.
- The document values, as a series of chunks. There's a separate series of chunks for each value slot.
- The posting list for each term, as a series of chunks.
The gmane database doesn't store any user meta-data or document values, so will only have the single summary statistics entry, the document length chunks, and the posting list chunks. The latter will hugely dominate.
Even if you're using values and user meta-data, it's likely the posting list chunks will dominate.
Table size: 566,064,831 entries in 7,050,309 8KB blocks (55,080.5MB).
Level | Number of blocks | Comment |
---|---|---|
3 | 1 | The root block |
2 | 65 | |
1 | 20,814 | |
0 | 7,029,429 | Leaf blocks, which hold the tag data |
Key size range: 1 to 193 bytes (mean 17.23).
Frequency range: 1 to 52,259,610 occurrences of a particular key size.
The keys for the postlist chunks contain the termname, and all chunks after the first also have the first document id of the chunk in the key. Almost all keys are fairly short.
Common words in human languages naturally tend to be shorter, so using the termname literally works pretty well, but we could try a lexicon to map termnames to integers which we then use in the postlist chunk keys. We could also use these in the keys of the position list table, and possibly in the termlist table tags.
Tag size range: 3 to 2,016 bytes (mean 74.60).
Frequency range: 6 to 165,083,656 occurrences of a particular tag size.
Total size of all tags is 40,272.7MB.
Shorter tags clearly dominate. This plot has log(1 + frequency) on the Y axis and shows the distribution of longer entries better:
There's a limit of about 2000 bytes on the chunk size which is why there's a sharp cutoff in length, with a spike in frequency for lengths just before it (we cut off at the end of an entry, so the length we cut off at can vary by a few bytes).
There are 17,129,701 continuation items for the 566,064,831 entries, so we're splitting a lot of entries. The chunk size limit should mean we aren't forced to split many entries (at least with the default 8KB block size) so these are presumably split because the Btree manager found it took less space overall to do so.
Item | Total size (bytes) | Per entry (bytes) | Per leaf block (bytes) |
---|---|---|---|
Per entry overhead | 5,094,583,479 | 9.0000 | 724.7507 |
Overhead in leaf blocks from split tags | 367,984,644 | 0.6501 | 52.3492 |
20,880 branch blocks | 171,048,960 | 0.3022 | 24.3333 |
Leaf block headers | 77,323,719 | 0.1366 | 11.0000 |
Unused space in leaf blocks | 61,818,819 | 0.1092 | 8.7943 |
The per-entry overhead is the largest of the overheads, but avoiding the overhead of splitting tags (or the alternative of wasting unused space) would help too.
It would be worthwhile experimenting with a lexicon and numeric term ids - the termname would then only be used in the key for the lexicon entry, and the postlist chunks would use the numeric termid as the key.
The keys can be fairly long, and many will have a common prefix, so prefix compressing keys would make a substantial difference. It's likely to achieve some of the same gain that numeric term ids would.
Posted in xapian by Olly Betts on 2009-12-14 18:55 | Permalink