I've implemented much of the numeric term id support now. Currently only appending documents is fully supported, and I haven't changed the position table keys yet.
I made some minor additional changes too while I was working on the code.
Posted in xapian by Olly Betts on 2010-01-25 21:39 | Permalink
Nearly a decade ago, Open Muscat (the project Xapian has evolved out of) used integer term ids to represent terms internally. This turned out to be awkward to deal with when running searches over several databases together since any term will generally have a different term id in each database. It's especially problematic when generating relevance feedback terms since we can't run through the lists of terms for each document in the same order without sorting them.
So in late 2000, we changed to representing terms internally as strings.
As part of the work GMX are sponsoring on reducing database size, I've been revisiting this decision.
Posted in xapian by Olly Betts on 2010-01-14 15:55 | Permalink
We'd meant to go for a walk on Sunday, but it got to about 3pm before we actually got moving (my fault mostly), so we decided it had better be something fairly local. It looked like it was pretty windy up on the tops, so we plumped for the Kilmister Track in Belmont Regional Park, as it looked to be fairly sheltered, was close by, and we'd not done most of the route before. Also, it wasn't in OpenStreetMap yet.
There doesn't seem to be a lot of information about this track online, so I thought it worth a quick write up. The most useful existing information we found was GWRC's brief description and map (note that the loop described there isn't the one we did).
Posted in tramping by Olly Betts on 2009-12-24 01:11 | Permalink
I have rerun the analysis scripts on the converted database, and will summarise the changes.
The total size of the database has dropped from 346GB to 338GB. Here's the breakdown of the key size statistics:
| Table | Original key size range (bytes) | Original key size mean (bytes) | New key size range (bytes) | New key size mean (bytes) | Reduction in mean (bytes) |
|---|---|---|---|---|---|
| record | 2-5 | 4.76 | 2-4 | 3.94 | 0.82 |
| termlist | 2-5 | 4.76 | 2-4 | 3.94 | 0.82 |
| spelling | 3-65 | 10.29 | 3-65 | 10.29 | 0 |
| position | 3-70 | 10.35 | 3-69 | 9.53 | 0.82 |
| postlist | 1-193 | 17.23 | 1-191 | 15.38 | 1.85 |
The graphs are all essentially the same shape as before.
Looking at the "space breakdown" tables, the per-entry overhead has increased relative to the others in every case, which should come as little surprise. So reducing the per item overhead is more important than ever.
More unexpectedly, there are now more continuation items from splitting tags for all the affected tables. The explanation for this is presumably that with a shorter key, forcing an entry to be split to make use of unused space at the end of a block is more often going to save space (splitting an entry means that the key has to be repeated), and this effect is greater than the reduction in the number of entries which have to be split because of their size.
Posted in xapian by Olly Betts on 2009-12-18 14:14 | Permalink
To summarise my previous entry, my calculations predicted a size reduction of around 8.61GB, with the expectation that it would probably be a little larger due to second-order savings not being accounted for.
The conversion utility has now finished running, and the actual saving is 8.73GB, which is 2.5% of the database size.
| Table | Before (KB) | After (KB) | Reduction (KB) | Saving |
|---|---|---|---|---|
| spelling | 75,360 | 75,360 | 0 | 0% |
| record | 20,725,220 | 20,658,724 | 66,496 | 0.32% |
| postlist | 56,457,616 | 55,382,492 | 1,075,124 | 1.90% |
| termlist | 69,131,224 | 69,040,416 | 90,808 | 0.13% |
| position | 216,759,520 | 208,832,488 | 7,927,032 | 3.66% |
There are no changes to the spelling keys, hence there's no size change there. The position table has a lot of small entries, so benefits most. And the postlist table benefits from the improvements to both the uint and string encodings, so does next best.
Posted in xapian by Olly Betts on 2009-12-16 22:52 | Permalink
Looking through the analysis for each table, I noticed an easy way to reduce the key size for all the tables which include a document id (which is all except the spelling and synonym tables - the second of which wasn't present in the analysed database).
Posted in xapian by Olly Betts on 2009-12-16 17:16 | Permalink
Xapian's position table holds positional information for terms. That is, given a term and a document id, it can tell you at what positions the given term appears in the given document (assuming you decide to record this information at indexing time). This is used to implement phrase searching and similar features.
The format chert uses is unchanged from that used by flint, so see the description of the format of flint's position table on the Xapian wiki for details.
For most databases (including gmane's) this is the largest table by quite a margin. It's often as large as the others put together.
Posted in xapian by Olly Betts on 2009-12-15 14:41 | Permalink
The termlist table stores the list of terms for each document. This is used to allow postlist entries to be removed when documents are deleted or replaced, to report which query terms match a given document, and to implement query expansion.
If your documents have more distinct terms then the termlist tags will tend to be larger, though we store the terms sorted with common prefixes eliminated which will tend to save more for longer lists of terms.
Posted in xapian by Olly Betts on 2009-12-14 20:27 | Permalink
In Xapian's chert backend, the postlist table holds:
- A single entry containing summary statistics for the database (currently the total length of all documents, the used document id high-water mark, upper and lower bounds on the document length, and an upper bound on the wdf).
- User meta-data - one entry per metadata (key, value) pair.
- The document lengths, as a series of chunks.
- Statistics about values - one entry per value slot recording upper and lower bounds on the values which occur, and the number of documents for which a non-empty value is set.
- The document values, as a series of chunks. There's a separate series of chunks for each value slot.
- The posting list for each term, as a series of chunks.
The gmane database doesn't store any user meta-data or document values, so will only have the single summary statistics entry, the document length chunks, and the posting list chunks. The latter will hugely dominate.
Even if you're using values and user meta-data, it's likely the posting list chunks will dominate.
Posted in xapian by Olly Betts on 2009-12-14 18:55 | Permalink
The record table holds the "document data", as set by Xapian::Document::set_data().
The format chert uses is unchanged from that used by flint, so see the description of the format of flint's record table on the Xapian wiki for details.
The gmane indexer saves at most 340 bytes of "sample" from the message body, plus some other fields. Internally, Xapian compresses the tag using zlib, and it's the compressed size we're considering here. Obviously the numbers below are likely to vary quite considerably if you store more or less information in the document data, but this sort of size is probably fairly typical.
Posted in xapian by Olly Betts on 2009-12-14 18:35 | Permalink
The spelling table holds data used for suggesting spelling corrections. In the gmane database I'm analysing, it was generated after the database was, taking words from those indexed, but ignoring the rarer words. This means it contains fewer words than if every word in the database had been added. If all the words had been indexed for spelling correction, the n-gram tags would certainly be longer, and there would be more word frequency entries. There would probably be a few more n-gram entries too, which would have fairly short tags (since they wouldn't occur in many words).
Posted in xapian by Olly Betts on 2009-12-14 17:07 | Permalink
The first step to reducing the database size is to look at what uses the space in databases at the moment.
I maintain the Xapian-based search for the gmane mailing list archive, so I've used that as my initial case-study. It's a fairly large database (about 71 million documents taking 346GB).
I've bundled up the scripts and code I used in case you want to analyse your own database. If you do, I'd love to see the results, especially if you have a large database and they differ from mine. There's a README file in the tarball which explains how to use them.
Note that this script will take several hours to process a large table. I did bear efficiency in mind while writing it, but only to a certain extent.
Posted in xapian by Olly Betts on 2009-12-14 13:22 | Permalink