Olly
[ RSS | ATOM 1.0 ]
Powered by PyBlosxom

« Numeric Term IDs | Home | Downloading flickr CSV Logs »

Numeric Term ID Implementation

I've implemented much of the numeric term id support now. Currently only appending documents is fully supported, and I haven't changed the position table keys yet.

I made some minor additional changes too while I was working on the code.

In chert the first chunk of the postings is special and contains the term frequency (how many documents contain the term), and the collection frequency (how many times the term occurs in total), and also the first document in that chunk (in subsequent chunks this is read from the key). Also, all chunks contain a flag indicating if they are the last chunk for their term.

Both this special header and the flag mean that a chunk needs to be altered if it becomes or ceases to be the first or last chunk for a term, which makes the update code more complex and slower. So in the new code I've moved the term frequency and collection frequency to the lexicon entry (which also makes it cheaper if you just want to look these up), all chunk keys contain the first docid, and the lexicon entry also stores the lowest and highest document id which contain the term (the highest can be used to check if a chunk is the last one, replacing the flag in each chunk; the lowest is potentially useful elsewhere).

Overall these changes should be fairly size neutral - the frequencies have the same size wherever we put them; putting the first docid in the key shouldn't change much, and the last docid flag is a byte per chunk, while it'll take 1-4 bytes per term in the lexicon. Putting all these statistics for a term together also allows us to store them more compactly in some cases - for example if the term frequency is 1, we don't need to store it explicitly as its value can be deduced from the values of the lowest and highest document ids being equal.

Also, a term which only occurs once only has a lexicon entry, and no postlist chunk. In this case the wdf of the single occurrence is the collection frequency. In particular, this case covers unique ID terms which create one such term for each document.

I've built a couple of small test databases (of 5353 and 7829 documents). The indexing times seem to reliably be a bit faster than for chert, which is good. However, the sizes are disappointing - in both cases, the postlist is just over 10% larger with the modified brass than with chert. Compaction reduces the difference a little, but not by much. This may be less bad for larger databases, but it's hard to really know without building one.

I've also noticed some problems:

So I'm currently wondering how viable using numeric termids actually is. Making the postlist table bigger will tend to slow up all searches, as will adding the lexicon (as each term in the query will need an extra lookup).

We should save quite a bit by using termids in the keys of the position table, but we should be able to get a lot of that saving by prefix-compressing keys, which will also help the other tables.

So I'm currently thinking to put this to one side for now, and revisit it later once we've got prefix compressed keys. If numeric termids no longer improve the position key size then, there's not much going for them. On the other hand, inspiration might strike as to how to address the various issues above.


Posted in xapian by Olly Betts on 2010-01-25 21:39 | Permalink


« Numeric Term IDs | Home | Downloading flickr CSV Logs »