How search.gmane.org works
Olly Betts
Gmane is...
An archive of public mailing lists
Genesis
Lars Magne Ingebrigtsen:
"I just thought it was a good idea"
(ding) Gnus
Rewrite of GNUS newsreader
Norway
100% Free/Open Source Software
On the search side, notably libgmime and Xapian
Gmane statistics
- 12 thousand lists
- 92 million messages (74 million searchable)
- 35-40 thousand new messages each day:
we:search
- First search facility added in December 2002
- "we:search", written by Lars in a weekend
- Worked pretty well, but only basic features
- Fragmentation caused slow down requiring reindexing
- My co-worker suggested Xapian and got me involved
Search Hardware
- Current (rain)
- 2U Athlon 64 2GHz machine, 3GB RAM, ~800GB disk. INN spool NFS mounted.
- New (plane)
- 1U Dual dual-core Opteron 2.4GHz machine, 16GB RAM, 3x1.5TB disks. Own spool mirror.
The Date Spool
- Make Xapian numeric document ids match date order
- Provides very fast "sort by date"
- Similar trick can be used with other global orderings
Format of the Date Spool
One file per minute, e.g.: /ispool/datespool/2009/09/21/13/10
Contents:
macro@sourceware.org
1253538608
1426
From: macro@sourceware.org
Subject: src/gas/testsuite ChangeLog gas/mips/eret-1.d ...
Xref: news.gmane.org gmane.comp.gnu.binutils.cvs:14452
CVSROOT: /cvs/src
Module name: src
[...]
Creating the Date Spool
- Parse all messages in the INN spool (use libgmime)
- Create list of:
<time_t> <path>
- Sort it
- Create date spool in that order
- Build Xapian index from datespool
Updating the Date Spool
- New messages are added to the date spool
- Also appended to an incremental file
- Build incremental Xapian index and merge
Spelling
- Uses Xapian's spell correction features
- Generate spell dictionary without rarest terms
Stemming
- Removal of linguistic suffixes
- Aim: to improve recall
- Stem by default
- Not capitalised words or inside quotes
Stop words
- Suppression of common words with little "meaning"
- Done at search time
- Not inside quotes, or with a plus-prefix
Filtering
- Filter on author name or email
- Filter by group or sub-hierarchy
Sorting
- By relevance
- Newest first
- Oldest first
Future Plans
- Finish commissioning the new server
- Search API (e.g. RSS feeds of results)
- Easier group search (cpan not gmane.comp.lang.perl.cpan.*)
- More frequent updates
The End

Questions welcome
- notmuch BOF (Carl Worth) - Thursday 11:30
- Building a Xapian index of Wikipedia - Thursday 1:30
- Xapian BOF - Thursday 2:30
- Also happy to chat any time during LCA