The method search provided by catalog objects is the entry point to the search programming interface. Here is its prototype and definition:
search(query=None, **kw)
Perform a search to the catalog with the given query. Returns an instance of the SearchResults class, which provides an API to retrieve the documents found (see below).
There are two ways to define the query, either we build it and then pass it to the search method, or we use the named arguments that this method accepts.
See now an example that shows the two ways to perform the same query. Imagine we have a catalog of books that we index by the author and the title; and we want to find out all the books written by somebody called Marx that talk about money.
We can either explicitly build the query:
>>> from itools.xapian import PhraseQuery, AndQuery
>>>
>>> q1 = PhraseQuery('author', 'marx')
>>> q2 = PhraseQuery('title', 'capital')
>>> query = AndQuery(q1, q2)
>>> results = catalog.search(query)
Or use the named arguments:
>>> results = catalog.search(author='marx', title='capital')
The second method is more compact, but less powerful. A query made implicitly from named arguments will always be an “and” query of one or more “phrase” queries.
If we want to make an “or” or “range” query, we need to build it explicitly.
The two most simple queries are EqQuery and PhraseQuery:
EqQuery(name, value)
Match all documents where the value of the field name matches or contains the given value, which will be a single word.
PhraseQuery(name, value)
Similar to EqQuery, except that value is not a word, but a phrase (this is to say, a sequence of words).
Typically we will use phrase queries when looking for in a text field, because in this context the phrase query is a generalisation of the equal query:
# These two are the same
>>> EqQuery('author', 'marx')
>>> PhraseQuery('author', 'marx')
# This is non-sense, because 'karl marx' is not a word but two
>>> EqQuery('author', 'karl marx')
The equal query (EqQuery) will be typically used for any other kind of fields (keyword, boolean or integer). Because a phrase query is a non-sense in this context.
To perform a EqQuery or PhraseQuery on a field, this one had to be declared indexed.
The simple queries seen above are for exact matches. If we want to match all values within a range we use the RangeQuery:
RangeQuery(name, left, right)
Match all documents whose field name has a value within the given range: greater or equal than left, and lesser or equal than right.
If left is None, all values smaller than right will be matched. If right is None, all values greater than left will be matched.
At least one of the limits must be given, both left and right can not be None.
Let’s see an example with dates. If we index documents by their last modification time (mtime), we could search all documents that have been modified since the last week:
>>> from datetime import date, timedelta
>>> from itools.xapian import RangeQuery
>>>
>>> today = date.today()
>>> last_week = today - timedelta(7)
>>>
>>> last_week = last_week.strftime('%Y-%m-%d')
>>> query = RangeQuery('mtime', last_week, None)
Note that since we don’t have a field type for dates, we have to transform the date values to strings (the field type used would be KeywordField).
To perform a RangeQuery on a field, this one had to be declared stored.
We support three boolean queries:
AndQuery(*args)
Match the documents that satisfy all the given queries. Each positional argument must be a query; obviously there should be two or more positional arguments.
OrQuery(*args)
Match the documents that satisfy any of the given queries. Each positional argument must be a query; obviously there should be two or more positional arguments.
NotQuery(query)
Match all documents that are not matched by query.
Boolean queries can be combined to build very complex queries.
Now that we have built a query and performed a search, how to retrieve the documents found? Remember that the value returned by the search method is an object, instance of the SearchResults class. This object offers two methods:
get_n_documents()
Return the number of documents found.
get_documents(sort_by=None, reverse=False, start=0, size=0)
Return the documents found. By default the documents are sorted by weight (how much relevant they are regarding the performed query).
But the documents may also be ordered by one of the stored fields. To do so pass the argument sort_by with the name of the field to use as the order criteria.
By default the results are ordered from greater to lesser (weight or field value). But if the argument reverse is true then they will be ordered in the other sense, from lesser to greater.
It is also possible to return only a batch of the total results. To do so pass the arguments start and size, which indicate, respectively, which is the first document to return, and how many documents at most must be returned.
Note that to sort by a field, it must be stored (see Section 14.4).
Now let’s see again the initial example:
>>> results = catalog.search(body='python')
>>> for document in results.get_documents():
... print document.url
...
http://www.python.org
>>>
The thing is, the documents returned are not the original objects, but instances of the Document class defined by itools.xapian. These documents offer access to the stored fields, so we can show some info to the users without having to load the original document.
And if we want to load the original document we use the external id (see Section 14.4.2):
>>> results = catalog.search(body='python')
>>> for document in results.get_documents():
... handler = get_handler(document.url)
... # Do something