« Numeric Term IDs | Home | Downloading flickr CSV Logs »
I've implemented much of the numeric term id support now. Currently only appending documents is fully supported, and I haven't changed the position table keys yet.
I made some minor additional changes too while I was working on the code.
In chert the first chunk of the postings is special and contains the term frequency (how many documents contain the term), and the collection frequency (how many times the term occurs in total), and also the first document in that chunk (in subsequent chunks this is read from the key). Also, all chunks contain a flag indicating if they are the last chunk for their term.
Both this special header and the flag mean that a chunk needs to be altered if it becomes or ceases to be the first or last chunk for a term, which makes the update code more complex and slower. So in the new code I've moved the term frequency and collection frequency to the lexicon entry (which also makes it cheaper if you just want to look these up), all chunk keys contain the first docid, and the lexicon entry also stores the lowest and highest document id which contain the term (the highest can be used to check if a chunk is the last one, replacing the flag in each chunk; the lowest is potentially useful elsewhere).
Overall these changes should be fairly size neutral - the frequencies have the same size wherever we put them; putting the first docid in the key shouldn't change much, and the last docid flag is a byte per chunk, while it'll take 1-4 bytes per term in the lexicon. Putting all these statistics for a term together also allows us to store them more compactly in some cases - for example if the term frequency is 1, we don't need to store it explicitly as its value can be deduced from the values of the lowest and highest document ids being equal.
Also, a term which only occurs once only has a lexicon entry, and no postlist chunk. In this case the wdf of the single occurrence is the collection frequency. In particular, this case covers unique ID terms which create one such term for each document.
I've built a couple of small test databases (of 5353 and 7829 documents). The indexing times seem to reliably be a bit faster than for chert, which is good. However, the sizes are disappointing - in both cases, the postlist is just over 10% larger with the modified brass than with chert. Compaction reduces the difference a little, but not by much. This may be less bad for larger databases, but it's hard to really know without building one.
I've also noticed some problems:
Currently I allocate termids when the batch of documents is flushed. At this point I could sort the terms such that the most common get lower term ids, but I haven't currently because I spotted an issue with using termids for the position keys - currently we write these out when the document is added or modified, which means we will either need to allocate the termids at that point, or else buffer all the positional data until we flush, which will vastly increase the amount of data in memory, and reduce the number of documents which can be buffered. Using termids in the position keys pretty much forces us to allocate termids as each document is added or updated.
Since the big gain from termids is in the position table keys, this seems rather a problem. We need to use them here, but if we do, it seems we can't allocate them efficiently.
I haven't tried to update xapian-compact yet, but in the merging case it will be much less efficient because the termids will need changing for data from all except one of the source databases, and we'll need to read data out of order from these databases so that we can write it to the output database in order (which gets us the best compaction).
Currently I've ignored the issue of recycling termids which are no longer in use - if documents are deleted a lot, we could end up with many unused termids, but tracking these needs us to store yet more data...
So I'm currently wondering how viable using numeric termids actually is. Making the postlist table bigger will tend to slow up all searches, as will adding the lexicon (as each term in the query will need an extra lookup).
We should save quite a bit by using termids in the keys of the position table, but we should be able to get a lot of that saving by prefix-compressing keys, which will also help the other tables.
So I'm currently thinking to put this to one side for now, and revisit it later once we've got prefix compressed keys. If numeric termids no longer improve the position key size then, there's not much going for them. On the other hand, inspiration might strike as to how to address the various issues above.
Posted in xapian by Olly Betts on 2010-01-25 21:39 | Permalink