The method get_catalog_fields defines the fields to index.
One thing we must choose when defining a field is its type. There are four built-in types to choose from (it is also possible to define custom field types, as we will see later):
This field type allows for full-text indexing. What means that the given text will be split into words, so it will be possible to search for individual words, or for phrases.
This type of field will index the value as it is, without any particular processing. This is useful for example to search fields whose values belong to a defined set of possible values (like an enumerate).
This type of field is for boolean values (True or False).
This one is for integers.
Let’s see again the fields definition of our example:
def get_catalog_fields(self):
return [KeywordField('url', is_stored=True),
TextField('body')]
Other than the field type, we must define the name of the field, in this example url and body. As it’s easy to guess we will use the field name to make reference to it, when indexing and searching.
And finally, a field may be indexed and/or stored1.
If we choose to define a field as indexed (the default), we will be able to search for it later.
If we choose to define a field as stored, we will be able to retrieve its value from the catalog, without the need to load the original document; think of it as a cache. By default a field is not stored.
For example, when indexing office documents, we will want to be able to search their content, but we should not store it, because that would take too much resources. However we may like to store some metadata, like the author and the title, so we can show this information to the user without loading the original document, hence speeding up the interface.
So the decision to index and/or store a field depends on the usage (no sense to index a field if we are not going to search for it), and on performance considerations.
The first field in the definition (url in our example) is a special field: it defines the external id. That is, the value that uniquely identifies the original document, and that can be used to load it.
This first field must be both indexed and stored, and should probably be of the type KeywordField.
Internally the catalog only uses the external identifier when unindexing documents. The method unindex_document expects as parameter an external id value, for example:
# Un-index
>>> catalog.unindex_document('http://www.python.org')
# Test
>>> results = catalog.search(body='python')
>>> for document in results.get_documents():
... print document.url
...
>>>
To define your own field type, you must create a new class that inherits from BaseField. BaseField provides three static member functions:
to cup up the data into words. This function must return the words into the form (word, position).
to translate the data into a storage form (a string).
to retrieve the data from the encoded form.
For example:
>>> from itools.xapian import make_catalog, CatalogAware
>>> from itools.xapian import BaseField, KeywordField
>>> from itools.xapian import register_field
>>>
>>> class FloatField(BaseField):
... type = 'float'
...
... @staticmethod
... def split(value):
... yield unicode(value), 0
...
... @staticmethod
... def decode(string):
... return float(string)
...
... @staticmethod
... def encode(value):
... return unicode(value)
>>>
>>> register_field(FloatField)
>>>
>>> class Document(CatalogAware):
... def __init__(self, name, value):
... self.name = name
... self.value = value
...
... def get_catalog_fields(self):
... return [KeywordField('name', is_stored=True),
... FloatField('value', is_stored=True)]
...
... def get_catalog_values(self):
... return {'name': self.name, 'value': self.value}
>>>
>>> catalog = make_catalog('catalog_test')
>>> doc1 = Document('pi', 3.1415)
>>> doc2 = Document('e', 2.718)
>>> catalog.index_document(doc1)
>>> catalog.index_document(doc2)
>>>
>>> results = catalog.search()
>>>
>>> for document in results.get_documents(sort_by='value'):
>>> print document.name, document.value
e 2.718
pi 3.1415
As you can see, you must register your new type with the function register_field.
Footnotes