The itools package includes handlers for many file formats, like XML, HTML or CSV. Each one is able to parse and load the content of the file into an appropriate data structure; for example a CSV handler will store the data as a table: a list of rows, where each row is a list of values. Each handler class provides a distinct API to inspect and manipulate this data structure.
When there is not a handler class available that understands the file format at hand, we can always use the basic File class, that offers access to the file’s content as a raw byte string:
>>> from itools.handlers import File
>>>
>>> file = File('itools.pdf')
>>> print file.uri
file:///home/jdavid/sandboxes/itools-docs/itools.pdf
The instance variable uri tells us what file in the file system (or somewhere else) this file handler is attached to. To inspect its data we can type:
>>> print type(file.data)
<type 'str'>
>>> print len(file.data)
994739
The API to access and change the data of a basic file handler is quite simple:
to_str()
Returns the content of the handler (a byte string).1
set_data(data)
Changes the content of the handler to the given byte string.
The File class is the base class for all file handlers. Figure 4.1 shows a subset of the handler classes included in itools.
When the file we want to work with is a text file, we can use the TextFile handler class. This one represents the file’s content as a text string:
>>> from itools.handlers import TextFile
>>>
>>> file = TextFile('itools.tex')
>>> print type(file.data)
<type 'unicode'>
>>> print file.data[:40]
\documentclass{book}
\usepackage{color}
The public API is much similar to the base File handler’s API:
to_str(encoding=’utf-8’)
Returns a byte string with the content of the handler, using the given encoding (by default UTF-8).
set_data(data)
Changes the content of the handler to the given text string.
Here the method set_data expects a text string instead of a byte string. And the method to_str accepts an optional parameter to define the encoding used to serialize the handler’s content.
While not an standard file format, the format supported by the ConfigFile class can be used for example to manage some configuration files found in Unix systems.
It is also useful to study this handler class as an example of a file handler with some structure. This is an excerpt of the setup.conf file from the itools package:
# The name of the package
name = itools
# The author details
author_name = "J. David Ibáñez"
author_email = jdavid@itaapy.com
# The license
license = "GNU General Public License (GPL)"
We have comments and variables.
>>> from itools.handlers import ConfigFile
>>>
>>> config = ConfigFile('setup.conf')
>>> print config.get_value('author_name')
J. David Ibáñez
The code above shows how to get the value of a variable. Follows an excerpt of the public API specific to the ConfigFile class:
set_value(name, value, comment=None)
Sets the variable with the given name to the given value. If a comment is given, attach it to the variable.
get_value(name, type=None)
Returns the value of the variable with the given name. The value returned will be a byte string, unless the type parameter is passed.
If the type parameter is passed, the value will be deserialized using that type.
has_value(name)
Returns True if there is a variable with the given name, False otherwise.
get_comment(name)
Returns the comment associated to the given variable.
File handlers support lazy load, what means that the handler is only loaded when we try to retrieve its data:
>>> from itools.handlers import TextFile
>>>
>>> file = TextFile('itools.tex')
>>> print file.__dict__.keys()
['uri']
>>>
>>> print len(file.data)
994739
>>> print file.__dict__.keys()
['dirty', 'timestamp', 'data', 'uri', 'encoding']
Here two new instance variables show up:
timestamp
The modification time of the file, the last time the handler and the file were synchronised through the load or save operations.
dirty
A datetime value, the last time the state of the handler has changed, or None while the handler and the file are synchronised.
These variables are read-only: do not change them by hand! The dirty variable will be studied in the Section 4.1.4.
The timestamp variable allows to know whether the file resource was changed after the file handler was loaded, what means that our file handler is out-of-date:
# Create a file
$ echo "Hello" > test.txt
# Start the Python interpreter
$ python
...
>>> from itools.handlers import TextFile
>>>
>>> test = TextFile('test.txt')
>>> test.load_state()
>>> print test.timestamp
2007-11-19 20:14:57
>>> print test.is_outdated()
False
Here we have learned how to explicitly load the state of a file handler, with the load_state method. And how to check whether the handler is up-to-date or not, with the is_outdated method.
But what happens if from another console we modify the test file?
# From another console...
$ echo "Bye" > test.txt
# Switch back to the first console
>>> print test.data
Hello
>>> print test.is_outdated()
True
The handler still contains the old data and the method is_outdated correctly tells the file resource has been modified since the last time we loaded the file handler.
To re-load the handler and get things back in order:
>>> test.load_state()
>>> print test.to_str()
Bye
>>> print test.is_outdated()
False
This is the full collection of load related methods:
is_outdated()
Returns True if the file resource has been modified since the handler was loaded (or saved) for the last time; False otherwise.
load_state()
(Re)loads the handler’s state from its associated file resource. The timestamp is updated.
load_state_from_string(string)
Updates the handler’s state with the contents of the given byte string.
load_state_from_file(file)
Updates the handler’s state with the contents of the given open file.
load_state_from(uri)
Updates the handler’s state with the contents of the file resource identified by the given URI reference.
Note that the last three methods actually modify the handler’s state with a content that is alien to the associated file resource. This does not change the timestamp, but sets the dirty variable to the current datetime, meaning that the handler’s state has changed and is newer than the associated file resource.
This brings us to the next section: saving changes.
We continue with our test file above, now we are going to change the handler’s state:
>>> print test.dirty
None
>>> test.set_data(u'The king is naked.\n')
>>> print test.dirty
2008-03-27 14:25:54.080461
>>> print test.to_str()
The king is naked.
# From another console...
$ cat test.txt
Bye
To know whether the handler has been modified to become newer than the associated file resource we just check the dirty variable. To save the changes made to the associated file resource we use save_state:
>>> test.save_state()
>>> print test.dirty
None
# From another console...
$ cat test.txt
The king is naked.
This is the programming interface for save operations:
dirty
Read-only datetime variable tells when the handler has been modified or None.
save_state()
Saves the handler’s state to its associated file. So the handler and its file resource are synchronized again.
save_state_to(uri)
Saves the handler’s state to the file resource identified by the given URI.
save_state_to_file(file)
Saves the handler’s state to the given open file.
Note that the last two methods do not set the dirty variable to None, since the handler’s state has not been saved to its associated file resource, but to some other file.
So far we have explicitly choosed which handler class we want to use to work with some file. It is also possible to let itools.handlers to choose the better handler class available for us, with the get_handler function:
>>> from itools.handlers import get_handler
>>>
>>> get_handler('itools.pdf')
<itools.handlers.file.File object at 0x2b65c5f01910>
Here the get_handler function did not found an specific handler class for the PDF document, so it chose the basic File class. But we can do it better:
>>> import itools.pdf
>>>
>>> get_handler('itools.pdf')
<itools.pdf.pdf.PDFFile object at 0xf5d450>
The itools.handlers package provides the basic infrastructure, and a few handler classes. For most specific handler classes the right package must be imported, like itools.pdf, itools.xml or itools.odf.
To find out the best available handler class for a file itools uses the file’s mimetype2, and keeps a registry from mimetype to handler class.
The programming interface of the registry is:
register_handler_class(handler_class)
Registers the given handler class into the registry. The class must define the variable class_mimetypes, which must be a list with the mimetypes the handler class is able to manage.
get_handler_class(uri)
Returns the handler class that better fits for the resource identified by the given uri.
To illustrate the register interface, this is how a handler class looks like:
from itools.handlers import File
from itools.handlers import register_handler_class
class PDFFile(File):
class_mimetypes = ['application/pdf']
register_handler_class(PDFFile)
So far we have seen how to load a file handler for a file resource that already exists, in the local filesystem or somewhere else. But sometimes we want to create new files, or just to work with temporary files that will never be stored anywhere:
>>> from itools.html import HTMLFile
>>>
>>> file = HTMLFile()
>>> print file.uri
None
Note that we have created the handler calling to the handler class, but without passing any arguments. This creates a new handler that is not associated to any resource, the value of handler.uri is None. The general prototype for a handler class is:
<handler_class>(uri=None, **kw)
If a URI reference is given, build a handler instance for it.
If a URI reference is not given, create a new handler that is not associated to any resource. Named parameters may be passed, they will be used to initialize the handler’s state (which named parameters are accepted depends on the handler class).
For instance, we are going to build an HTML handler with some title:
>>> file = HTMLFile(title=u'Hello World')
>>> print file.to_str()
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; ...
<title>Hello World</title>
</head>
<body></body>
</html>
When writing a new handler class the method new must be implemented, it initializes the handler’s state for handlers not associated to a file resource. For example, the handler class for a PDF file may look like:
from itools.handlers import File
class PDFFile(File):
class_mimetypes = ['application/pdf']
def new(self):
self.data = '%PDF-1.4\n'
Note that the example above only intent is to show the prototype of the new method, don’t expect it to work properly (I don’t really know the PDF file format).
Footnotes