Olly
[ RSS | ATOM 1.0 ]
Powered by PyBlosxom

Home

Xapian 1.3 Branched

(Actually, we branched six weeks ago, but I've not got around to writing about it until now.)

The development branch approach we used for 1.1.x development releases leading to a stable 1.2.0 release seemed to work pretty well, so we're adopting that again.

The main problem last time was that it took a long time to actually stabilise 1.1.x because we kept slipping more changes in. For 1.3.x, we need to be more disciplined and changes should be developed on a branch and not merged prematurely. We now have solid git mirroring, so developing on a branch is a more pleasant experience than before. We also need to be brutal sooner. It's better for everyone to (say) achieve two releases series in two years than have one release series take two years.

When I was in the UK back in May, Richard and I sat down and hashed out a list of goals for a 1.4 release series. This is what we came up with (the order is just how they came to mind, so isn't really significant):


read more…

Posted in xapian by Olly Betts on 2011-07-22 16:29 | Permalink

Xapian GSoC Applications for 2011

Student applications for GSoC closed a few hours ago. This is Xapian's first year as a mentoring organisation (though I've been involved in previous years with SWIG and Debian) and we've been blown away by the response from students.

If you'd have asked me when we'd got accepted, I'd have guessed we might get 20 applications and feel we'd done well, but counting up now we have 42. Ignoring two which were withdrawn (one duplicate, one a spam which surprisingly got withdrawn when I politely suggested such applications weren't useful), here is a graph of applications against time:

Graph of student applications to Xapian in GSoC 2011

If you're an admin or a mentor, you can produce a similar graph for your own org(s) - just download this OpenDocument spreadsheet and follow the instructions inside.

Now the task of selection starts in earnest. I've gone through and marked the seven spam proposals as ineligible (that's one line proposals, proposals with no connection at all to Xapian, and proposals which are just a title and/or paste from our ideas list with a generic biography).

That leaves 33, but not all are really in the running, before our student applicants start to despair! I don't have a good picture yet, but it looks like there are something like 10-15 we'll be seriously considering.


Posted in xapian by Olly Betts on 2011-04-09 14:36 | Permalink

Numeric Term ID Implementation

I've implemented much of the numeric term id support now. Currently only appending documents is fully supported, and I haven't changed the position table keys yet.

I made some minor additional changes too while I was working on the code.


read more…

Posted in xapian by Olly Betts on 2010-01-25 21:39 | Permalink

Numeric Term IDs

Nearly a decade ago, Open Muscat (the project Xapian has evolved out of) used integer term ids to represent terms internally. This turned out to be awkward to deal with when running searches over several databases together since any term will generally have a different term id in each database. It's especially problematic when generating relevance feedback terms since we can't run through the lists of terms for each document in the same order without sorting them.

So in late 2000, we changed to representing terms internally as strings.

As part of the work GMX are sponsoring on reducing database size, I've been revisiting this decision.


read more…

Posted in xapian by Olly Betts on 2010-01-14 15:55 | Permalink

Gmane Size Analysis Update

I have rerun the analysis scripts on the converted database, and will summarise the changes.

The total size of the database has dropped from 346GB to 338GB. Here's the breakdown of the key size statistics:

Table Original key size range (bytes) Original key size mean (bytes) New key size range (bytes) New key size mean (bytes) Reduction in mean (bytes)
record 2-5 4.76 2-4 3.94 0.82
termlist 2-5 4.76 2-4 3.94 0.82
spelling 3-65 10.29 3-65 10.29 0
position 3-70 10.35 3-69 9.53 0.82
postlist 1-193 17.23 1-191 15.38 1.85

The graphs are all essentially the same shape as before.

Looking at the "space breakdown" tables, the per-entry overhead has increased relative to the others in every case, which should come as little surprise. So reducing the per item overhead is more important than ever.

More unexpectedly, there are now more continuation items from splitting tags for all the affected tables. The explanation for this is presumably that with a shorter key, forcing an entry to be split to make use of unused space at the end of a block is more often going to save space (splitting an entry means that the key has to be repeated), and this effect is greater than the reduction in the number of entries which have to be split because of their size.


Posted in xapian by Olly Betts on 2009-12-18 14:14 | Permalink

Low-Hanging Key Size Reduction Results

To summarise my previous entry, my calculations predicted a size reduction of around 8.61GB, with the expectation that it would probably be a little larger due to second-order savings not being accounted for.

The conversion utility has now finished running, and the actual saving is 8.73GB, which is 2.5% of the database size.

Table Before (KB) After (KB) Reduction (KB) Saving
spelling 75,360 75,360 0 0%
record 20,725,220 20,658,724 66,496 0.32%
postlist 56,457,616 55,382,492 1,075,124 1.90%
termlist 69,131,224 69,040,416 90,808 0.13%
position 216,759,520 208,832,488 7,927,032 3.66%

There are no changes to the spelling keys, hence there's no size change there. The position table has a lot of small entries, so benefits most. And the postlist table benefits from the improvements to both the uint and string encodings, so does next best.


Posted in xapian by Olly Betts on 2009-12-16 22:52 | Permalink


Home