Modelling Your Data in Xapian

Olly Betts

Modern Expectations

Search is increasingly rich.

Overview

Xapian provides efficient search of large quantities of text
Lots of filtering and sorting options too
Somewhat akin to "NoSQL" databases

Like a Book's Index

But usually a book index isn't exhaustive, but a Xapian index is.

Not like a SQL Index

A SQL index provides efficient access to rows ordered by a key.

What's a Document?

Often obvious, but a key thing to decide!

The "document" is really the "thing you want to search for".

Examples of "Documents"

web page
song
person
newspaper article
book (e.g. library catalogue)
subsection of a book (e.g. full text search of Shakespeare)
search at a smaller granularity and collapse matches

Harder Cases

Long documents without explicit structure are problematic:

IRC log (conversation bursts? overlapping windows?)
scanned book (page?)
arbitrary chunking probably better than nothing

Examples of Documents

Web pages
Web bookmarks
Songs
Politicians
Products
VCS commits

Document Granularity

Sometimes the granularity needs thought:

Newspaper articles (not newspapers)
Books:
- Library catalogue search: each book is a document
- Full text search: page or section

Collapsing can help here

Xapian::Document Anatomy

A unique positive integer identifier
A blob of opaque "document data"
Terms
Document values

Document ID

Auto-increments by default
Can be set explicitly (e.g. to match external ID)
- Batch adding much more efficient if in ascending order
Non-integer external IDs: index as a term, store in data

Document Data

Opaque to Xapian - just a variable size cargo of bytes
Xapian will try to gzip it
Stored assuming used for showing "a result"
Some possible uses:
- Don't! (just use document ID)
- Store ID for external system (e.g. URL, filename, ISBN)
- Serialise fields e.g. title, author, date, sample
Don't look up document data during match!

Terms

Terms are the "index entries", used for both text searching and boolean filtering.

Each term maps to:

A list of document IDs
For each document in the list:
- Optional list of positions (phrases, NEAR, ADJ, etc)
- Within Document Frequency (wdf)

Term Uses

Words in text (perhaps per-field)
Meta-data for filtering (e.g. "Content-Type is text/html")
External unique ID

Terms from text

term may be prefixed for fielded search (e.g. Sxapian)
wdf is how many times this term occurs in this document
- document length = Σ(wdf)
positions are word offsets into the document

Boolean Filter Terms

no inherent internal difference
- wdf is usually 0
- usually no positions
terms prefixed to separate filter types (e.g. Mtext/html, Hxapian.org)

Document Values

numbered slots optionally containing a string for each document
fast access to many entries in the same slot

Uses of Document Values

ordering results (e.g. sort by date, price, distance from user)
filtering (e.g. date range restrictions, within X km)
- invoice modified:6/3/1999..10/12/1999
query-independent relevance (e.g. hyperlink analysis, age bias)
grouping and collapsing matches (e.g. no more than 3 matches per website)
don't abuse for storing fields just for displaying results

Also in the Database

Not directly related to documents:

each database has a UUID
spelling data - dynamic and/or dictionary
- edit distance with n-gram acceleration
synonyms
user metadata - arbitrary key->value store
- changes committed atomically with other database change

Stemming

Xapian supports stemming

built-in algorithms for many languages
supply your own
normalise forms of the "same word" - e.g. football and footballer -> footbal
improves "recall" at the expense of "precision"
Xapian default: unstemmed terms with positions + stems without positions

Filtering - term or value?

Both terms and values can be used for filtering, so which to use?

a value needs checking for every candidate
- the matcher knows to check values last
terms faster for an exact filter (e.g. Mtext/html)
values better for arbitrary ranges (e.g. 11/11/1918..12/10/2010)
if exact and range searches are common, "both" may be the right answer

Query Independent Weighting

(e.g. from hyperlink analysis)

Encode the weight for each document in a value slot, and use a PostingSource subclass to either supply the whole weight, or contribute to the weight.

Meeting Expectations

Geospatial Filtering and Weighting

...

The End

Questions welcome

Image credits:

openclipart.org for clipart
infobunny for You know it's an Andrea Dworkin index when ...

OSDC 2010