Modelling Your Data in Xapian

Olly Betts

Modern Expectations

Search is increasingly rich.

Overview

Like a Book's Index

But usually a book index isn't exhaustive, but a Xapian index is.

Not like a SQL Index

A SQL index provides efficient access to rows ordered by a key.

What's a Document?

Often obvious, but a key thing to decide!

The "document" is really the "thing you want to search for".

Examples of "Documents"

Harder Cases

Long documents without explicit structure are problematic:

Examples of Documents

Document Granularity

Sometimes the granularity needs thought:

Collapsing can help here

Xapian::Document Anatomy

Document ID

Document Data

Terms

Terms are the "index entries", used for both text searching and boolean filtering.

Each term maps to:

Term Uses

Terms from text

Boolean Filter Terms

Document Values

Uses of Document Values

Also in the Database

Not directly related to documents:

Stemming

Xapian supports stemming

Filtering - term or value?

Both terms and values can be used for filtering, so which to use?

Query Independent Weighting

(e.g. from hyperlink analysis)

Encode the weight for each document in a value slot, and use a PostingSource subclass to either supply the whole weight, or contribute to the weight.

Meeting Expectations

Geospatial Filtering and Weighting

...

The End

 Questions welcome

 

Image credits: