The termlist table stores the list of terms for each document. This is used to allow postlist entries to be removed when documents are deleted or replaced, to report which query terms match a given document, and to implement query expansion.

If your documents have more distinct terms then the termlist tags will tend to be larger, though we store the terms sorted with common prefixes eliminated which will tend to save more for longer lists of terms.

Analysis

Table size: 70,978,511 entries in 8,632,963 8KB blocks (67,445.0MB).

Block level breakdown

Level	Number of blocks	Comment
3	1	The root block
2	33
1	16,603
0	8,616,326	Leaf blocks, which hold the tag data

Keys

Key size range: 2 to 5 bytes (mean 4.76).

Frequency range: 255 to 54,201,296 occurrences of a particular key size.

The key format is the same as that of the record table, and again we could shorten all keys by a byte if the sort routine could be specialised for each table.

Tags

Tag size range: 24 to 621,628 bytes (mean 972.85).

Frequency range: 1 to 80,928 occurrences of a particular tag size.

Total size of all tags is 65,852.9MB.

While there are some really long tags, most are fairly modest in size (although note the X-axis scale - the first tick is 100,000 bytes).

This plot has log(1 + frequency) on the Y axis and shows the distribution more clearly:

There are 32,634,436 continuation items for the 70,978,511 entries. Most tags aren't getting split, but many of those that are require a lot of continuation items.

Space breakdown

Item	Total size (bytes)	Per entry (bytes)	Per leaf block (bytes)
Per entry overhead	638,806,599	9.0000	74.1391
Overhead in leaf blocks from split tags	448,812,840	6.3232	52.0887
16,637 branch blocks	136,290,304	1.9202	15.8177
Leaf block headers	94,779,586	1.3353	11.0000
Unused space in leaf blocks	12,748,180	0.1796	1.4795

Conclusions

Unlike the previous tables we've looked at, the "per entry overhead" isn't so dominant here. We're losing a lot of space from having to split large tag values into a lot of <2KB pieces, and being able to store large tags unsplit would make a great difference here. Reducing the per entry overhead would be worthwhile too.

Being able to specialise the key comparison function would allow keys to be a byte shorter, though prefix compressed keys would probably be somewhat redundant with that.