16.1 The parser

The lowest-level layer offered by itools.xml is an event driven parser. See this usage example:

    >>> from itools.xml import (XMLParser, START_ELEMENT,
    ...     END_ELEMENT, TEXT)
    >>>
    >>> data = 'Hello <em>Baby</em>'
    >>> for type, value, line in XMLParser(data):
    ...     if type == START_ELEMENT:
    ...         tag_uri, tag_name, attributes = value
    ...         print 'START TAG :', tag_name
    ...     elif type == END_ELEMENT:
    ...         tag_uri, tag_name = value
    ...         print 'END TAG   :', tag_name
    ...     elif type == TEXT:
    ...         print 'TEXT      :', value
    ...
    TEXT      : Hello
    START TAG : em
    TEXT      : Baby
    END TAG   : em

This example just prints a message to the standard output each time the start of an element, the end of an element or a text node is found.

The parser returns a list of events, where every event is a tuple of three values: the event type, the value (which depends on the event type) and the line number. The events implemented are:

Event

Value

XML_DECL

(version, encoding, standalone)

DOCUMENT_TYPE

(name, doctype)

START_ELEMENT

(tag uri, tag name, attributes)

END_ELEMENT

(tag uri, tag name)

TEXT

value

COMMENT

value

PI

(name, value)

CDATA

value

All values (text nodes, comments, attribute values, etc.) are returned as byte strings, in the source encoding. doctype is an instance of a DocType object.

Attributes

The element attributes are returned as a dictionary where the key is a tuple of the namespace URI and the local name of the attribute, and the value is the value of the attribute.

For example, when processing the XML fragment:

<x xmlns="namespace1" xmlns:n2="namespace2" >
  <test a="1" n2:b="2" />
</x>

For the tag "test", the parser will return the attributes this way:

('namespace1', 'test',
{('namespace2', 'b'): '2', (None, 'a'): '1'})

The parser always resolves the element and attribute prefixes and returns the namespace URIs instead. The namespace declarations are returned as attributes.