Modelling Your Data in Xapian
Olly Betts
Modern Expectations
Search is increasingly rich.
Overview
- Xapian provides efficient search of large quantities of text
- Lots of filtering and sorting options too
- Somewhat akin to "NoSQL" databases
Like a Book's Index
But usually a book index isn't exhaustive, but a Xapian index is.
Not like a SQL Index
A SQL index provides efficient access to rows ordered by a key.
What's a Document?
Often obvious, but a key thing to decide!
The "document" is really the "thing you want to search for".
Examples of "Documents"
- web page
- song
- person
- newspaper article
- book (e.g. library catalogue)
- subsection of a book (e.g. full text search of Shakespeare)
- search at a smaller granularity and collapse matches
Harder Cases
Long documents without explicit structure are problematic:
- IRC log (conversation bursts? overlapping windows?)
- scanned book (page?)
- arbitrary chunking probably better than nothing
Examples of Documents
- Web pages
- Web bookmarks
- Songs
- Politicians
- Products
- VCS commits
Document Granularity
Sometimes the granularity needs thought:
- Newspaper articles (not newspapers)
- Books:
- Library catalogue search: each book is a document
- Full text search: page or section
Collapsing can help here
Xapian::Document Anatomy
- A unique positive integer identifier
- A blob of opaque "document data"
- Terms
- Document values
Document ID
- Auto-increments by default
- Can be set explicitly (e.g. to match external ID)
- Batch adding much more efficient if in ascending order
- Non-integer external IDs: index as a term, store in data
Document Data
- Opaque to Xapian - just a variable size cargo of bytes
- Xapian will try to gzip it
- Stored assuming used for showing "a result"
- Some possible uses:
- Don't! (just use document ID)
- Store ID for external system (e.g. URL, filename, ISBN)
- Serialise fields e.g. title, author, date, sample
- Don't look up document data during match!
Terms
Terms are the "index entries", used for both text searching and boolean filtering.
Each term maps to:
- A list of document IDs
- For each document in the list:
- Optional list of positions (phrases, NEAR, ADJ, etc)
- Within Document Frequency (wdf)
Term Uses
- Words in text (perhaps per-field)
- Meta-data for filtering (e.g. "Content-Type is text/html")
- External unique ID
Terms from text
- term may be prefixed for fielded search (e.g. Sxapian)
- wdf is how many times this term occurs in this document
- positions are word offsets into the document
Boolean Filter Terms
- no inherent internal difference
- wdf is usually 0
- usually no positions
- terms prefixed to separate filter types (e.g. Mtext/html, Hxapian.org)
Document Values
- numbered slots optionally containing a string for each document
- fast access to many entries in the same slot
Uses of Document Values
- ordering results (e.g. sort by date, price, distance from user)
- filtering (e.g. date range restrictions, within X km)
- invoice modified:6/3/1999..10/12/1999
- query-independent relevance (e.g. hyperlink analysis, age bias)
- grouping and collapsing matches (e.g. no more than 3 matches per website)
- don't abuse for storing fields just for displaying results
Also in the Database
Not directly related to documents:
- each database has a UUID
- spelling data - dynamic and/or dictionary
- edit distance with n-gram acceleration
- synonyms
- user metadata - arbitrary key->value store
- changes committed atomically with other database change
Stemming
Xapian supports stemming
- built-in algorithms for many languages
- supply your own
- normalise forms of the "same word" - e.g. football and footballer -> footbal
- improves "recall" at the expense of "precision"
- Xapian default: unstemmed terms with positions + stems without positions
Filtering - term or value?
Both terms and values can be used for filtering, so which to use?
- a value needs checking for every candidate
- the matcher knows to check values last
- terms faster for an exact filter (e.g. Mtext/html)
- values better for arbitrary ranges (e.g. 11/11/1918..12/10/2010)
- if exact and range searches are common, "both" may be the right answer
Query Independent Weighting
(e.g. from hyperlink analysis)
Encode the weight for each document in a value slot, and use a PostingSource
subclass to either supply the whole weight, or contribute to the weight.
Meeting Expectations
Geospatial Filtering and Weighting
...
The End
Questions welcome