Olly
[ RSS | ATOM 1.0 ]
Powered by PyBlosxom

« Kilmister Track | Home | Numeric Term ID Implementation »

Numeric Term IDs

Nearly a decade ago, Open Muscat (the project Xapian has evolved out of) used integer term ids to represent terms internally. This turned out to be awkward to deal with when running searches over several databases together since any term will generally have a different term id in each database. It's especially problematic when generating relevance feedback terms since we can't run through the lists of terms for each document in the same order without sorting them.

So in late 2000, we changed to representing terms internally as strings.

As part of the work GMX are sponsoring on reducing database size, I've been revisiting this decision.

The Current Situation

People are often surprised to discover Xapian stores terms as strings, assuming it incurs a big overhead over using numeric ids, but the overhead isn't so large in practice.

The first thing to note is that in natural languages, the most common words tend to be the shortest, and common words will produce the terms which get repeated most in the database.

Also, we avoid having to store a lexicon mapping term string to numeric id, which saves space and also avoids an extra lookup for every term in the query.

Looking in more details, there are 3 places where terms appear in the database:

  • In keys for the postlist table.
  • In keys for the position table.
  • In the tags for the termlist table.

Looking at each of these in turn:

Postlist

The postlist table maps a term to a list of documents it appears in, along with the wdf (within-document frequency) of the term in each document. It also contains some other data, but that's not relevant here.

In Chert, each postlist is stored split into chunks of at most about 2000 bytes which allows us to skip forward without having to read all the intervening entries. The key for the first chunk is built from the termname, and those of subsequent chunks from the termname and the first document id in that chunk. So a rare term will probably only have a single entry, but a common term may have many.

Position

The position table holds positional information for terms. Given a term and a document id, it can tell you at what positions the given term appears in the given document (assuming you decide to record this information at indexing time). This is used to implement phrase searching and similar features.

In Chert, each position list entry key contains the docid followed by the term. Each term appears in a key for each document which that term appears in (assuming positional data was indexed for it).

So a rare term will appear only a small number of times, while a common term will appear many many times.

Termlist

The termlist table stores the list of terms for each document.

For Chert, we store the terms in sorted (by byte value) order, but instead of storing each term in full, we store the length of the prefix to reuse from the previous term, and then what to append to that. This gives a very compact encoding, particular when a document contains a lot of terms.

With numeric termids, we would probably store them sorted by termid, and encoded by storing the difference between each and the next as a variable length integer, or perhaps using interpolative coding (like we do with positional data) if that tended to give a smaller result (the downside is it is slower to encode and decode).

Implementation Plan

To avoid the issues with multiple databases, my plan is that the numeric termids will only be used within the backend, and term will still be strings in the rest of Xapian. That will also allow use of databases with termids alongside those which don't use them without having to invent termids on the fly.

However, to avoid the problem with query expansion which we had before, my initial plan is to leave the termlist table as it is. To use termids there, we would need to decode the whole termlist to strings and sort them, which is likely to make generating terms from relevance feedback measurably slower. Because we already have a good compression scheme here, the potential for savings is less, but we can look at this later if numeric termids work out elsewhere.

How Numeric Term IDs might compare

We will generally be storing numeric term IDs in the database using variable length integer encodings, so we want to try to allocate lower IDs to the most common terms - experience from other implementations using term IDs is that this makes a big difference. The exact ordering isn't so important, especially as there are only a few lengths of encoded termid, each with a range of termid values (so for example, a lot of different termids encode as 2 bytes).

However, we generally need to start allocating termids before we know the frequencies of the terms in the corpus, and it's costly to change the termid for a term once allocated, so this is hard to get perfect. To some extent a frequent term will tend to be seen sooner than a rare one, so will tend to get a lower termid if we just allocated them in ascending order the first time we see a term.

Looking at the analysis of the gmane database I blogged about in December, the numeric termids should make the average position table keysize 7.91 bytes rather than the current 9.53 bytes, will save about 14.8GB. This is assuming each termid is the average length (which is on the pessimistic side), which would be about a 4.3% saving on the total database size.

The space saving in the postlist table is harder to predict as the changes there are more complex, but I'd guess we'll save 3.9GB or so from key size changes, which is about a 1.1% saving.

But to achieve these savings, we now need to store a lexicon which maps termname to termid. The key is probably going to average about 6 bytes, and the tag just under 4, plus about 10 bytes of overhead per entry (from the size analysis) makes 20 bytes and there are 142,942,820 distinct terms, so that's something like 2.6GB (or 0.8%) overhead.

So assuming the estimates are about right, the postlist table should be a little smaller, and the position table substantially so.


Posted in xapian by Olly Betts on 2010-01-14 15:55 | Permalink


« Kilmister Track | Home | Numeric Term ID Implementation »