4.1 File Handlers

The itools package includes handlers for many file formats, like XML, HTML or CSV. Each one is able to parse and load the content of the file into an appropriate data structure; for example a CSV handler will store the data as a table: a list of rows, where each row is a list of values. Each handler class provides a distinct API to inspect and manipulate this data structure.

When there is not a handler class available that understands the file format at hand, we can always use the basic File class, that offers access to the file’s content as a raw byte string:

    >>> from itools.handlers import File
    >>>
    >>> file = File('itools.pdf')
    >>> print file.uri
    file:///home/jdavid/sandboxes/itools-docs/itools.pdf

The instance variable uri tells us what file in the file system (or somewhere else) this file handler is attached to. To inspect its data we can type:

    >>> print type(file.data)
    <type 'str'>
    >>> print len(file.data)
    994739

The API to access and change the data of a basic file handler is quite simple:

The File class is the base class for all file handlers. Figure 4.1 shows a subset of the handler classes included in itools.

\includegraphics[width=1\textwidth ]{figures/handlers}
Figure 4.1: Some file handler classes included in itools.

4.1.1 Text Files

When the file we want to work with is a text file, we can use the TextFile handler class. This one represents the file’s content as a text string:

    >>> from itools.handlers import TextFile
    >>>
    >>> file = TextFile('itools.tex')
    >>> print type(file.data)
    <type 'unicode'>
    >>> print file.data[:40]
    \documentclass{book}

    \usepackage{color}

The public API is much similar to the base File handler’s API:

Here the method set_data expects a text string instead of a byte string. And the method to_str accepts an optional parameter to define the encoding used to serialize the handler’s content.

4.1.2 Configuration Files

While not an standard file format, the format supported by the ConfigFile class can be used for example to manage some configuration files found in Unix systems.

It is also useful to study this handler class as an example of a file handler with some structure. This is an excerpt of the setup.conf file from the itools package:

    # The name of the package
    name = itools

    # The author details
    author_name = "J. David Ibáñez"
    author_email = jdavid@itaapy.com

    # The license
    license = "GNU General Public License (GPL)"

We have comments and variables.

    >>> from itools.handlers import ConfigFile
    >>> 
    >>> config = ConfigFile('setup.conf')
    >>> print config.get_value('author_name')
    J. David Ibáñez

The code above shows how to get the value of a variable. Follows an excerpt of the public API specific to the ConfigFile class:

4.1.3 Loading

File handlers support lazy load, what means that the handler is only loaded when we try to retrieve its data:

    >>> from itools.handlers import TextFile
    >>> 
    >>> file = TextFile('itools.tex')
    >>> print file.__dict__.keys()
    ['uri']
    >>>
    >>> print len(file.data)
    994739
    >>> print file.__dict__.keys()
    ['dirty', 'timestamp', 'data', 'uri', 'encoding']

Here two new instance variables show up:

These variables are read-only: do not change them by hand! The dirty variable will be studied in the Section 4.1.4.

The timestamp variable allows to know whether the file resource was changed after the file handler was loaded, what means that our file handler is out-of-date:

    # Create a file
    $ echo "Hello" > test.txt
    # Start the Python interpreter
    $ python
    ...
    >>> from itools.handlers import TextFile
    >>>
    >>> test = TextFile('test.txt')
    >>> test.load_state()
    >>> print test.timestamp
    2007-11-19 20:14:57
    >>> print test.is_outdated()
    False

Here we have learned how to explicitly load the state of a file handler, with the load_state method. And how to check whether the handler is up-to-date or not, with the is_outdated method.

But what happens if from another console we modify the test file?

    # From another console...
    $ echo "Bye" > test.txt
    # Switch back to the first console
    >>> print test.data
    Hello

    >>> print test.is_outdated()
    True

The handler still contains the old data and the method is_outdated correctly tells the file resource has been modified since the last time we loaded the file handler.

To re-load the handler and get things back in order:

    >>> test.load_state()
    >>> print test.to_str()
    Bye

    >>> print test.is_outdated()
    False

Programming Interface

This is the full collection of load related methods:

Note that the last three methods actually modify the handler’s state with a content that is alien to the associated file resource. This does not change the timestamp, but sets the dirty variable to the current datetime, meaning that the handler’s state has changed and is newer than the associated file resource.

This brings us to the next section: saving changes.

4.1.4 Saving

We continue with our test file above, now we are going to change the handler’s state:

    >>> print test.dirty
    None
    >>> test.set_data(u'The king is naked.\n')
    >>> print test.dirty
    2008-03-27 14:25:54.080461
    >>> print test.to_str()
    The king is naked.

    # From another console...
    $ cat test.txt
    Bye

To know whether the handler has been modified to become newer than the associated file resource we just check the dirty variable. To save the changes made to the associated file resource we use save_state:

    >>> test.save_state()
    >>> print test.dirty
    None
    # From another console...
    $ cat test.txt
    The king is naked.

Programming Interface

This is the programming interface for save operations:

Note that the last two methods do not set the dirty variable to None, since the handler’s state has not been saved to its associated file resource, but to some other file.

4.1.5 The Registry

So far we have explicitly choosed which handler class we want to use to work with some file. It is also possible to let itools.handlers to choose the better handler class available for us, with the get_handler function:

    >>> from itools.handlers import get_handler
    >>>
    >>> get_handler('itools.pdf')
    <itools.handlers.file.File object at 0x2b65c5f01910>

Here the get_handler function did not found an specific handler class for the PDF document, so it chose the basic File class. But we can do it better:

    >>> import itools.pdf
    >>>
    >>> get_handler('itools.pdf')
    <itools.pdf.pdf.PDFFile object at 0xf5d450>

The itools.handlers package provides the basic infrastructure, and a few handler classes. For most specific handler classes the right package must be imported, like itools.pdf, itools.xml or itools.odf.

How it works

To find out the best available handler class for a file itools uses the file’s mimetype2, and keeps a registry from mimetype to handler class.

The programming interface of the registry is:

To illustrate the register interface, this is how a handler class looks like:

    from itools.handlers import File
    from itools.handlers import register_handler_class

    class PDFFile(File):
        class_mimetypes = ['application/pdf']

    register_handler_class(PDFFile)

4.1.6 New Handlers

So far we have seen how to load a file handler for a file resource that already exists, in the local filesystem or somewhere else. But sometimes we want to create new files, or just to work with temporary files that will never be stored anywhere:

    >>> from itools.html import HTMLFile
    >>>
    >>> file = HTMLFile()
    >>> print file.uri
    None

Note that we have created the handler calling to the handler class, but without passing any arguments. This creates a new handler that is not associated to any resource, the value of handler.uri is None. The general prototype for a handler class is:

For instance, we are going to build an HTML handler with some title:

    >>> file = HTMLFile(title=u'Hello World')
    >>> print file.to_str()
    <html>
      <head>
        <meta http-equiv="Content-Type" content="text/html; ...
        <title>Hello World</title>
      </head>
      <body></body>
    </html>

State initialization

When writing a new handler class the method new must be implemented, it initializes the handler’s state for handlers not associated to a file resource. For example, the handler class for a PDF file may look like:

    from itools.handlers import File

    class PDFFile(File):
        class_mimetypes = ['application/pdf']

        def new(self):
            self.data = '%PDF-1.4\n'

Note that the example above only intent is to show the prototype of the new method, don’t expect it to work properly (I don’t really know the PDF file format).

Footnotes

  1. All file handlers must implement the to_str method, which serializes the handler’s content to a byte string. It is required for the correct working of the load/save API explained later.
  2. To find out the file’s mimetype the vfs.get_mimetype function is used, see Chapter 11.