15.1 The HTML parser

The package itools.html includes a parser for HTML documents. Its programming interface is similar, but not exactly the same, to that of the XML parser from the itools.xml package (see Section 16.1).

Example:

    >>> from itools.html import HTMLParser
    >>> from itools.xml import START_ELEMENT, END_ELEMENT, TEXT
    >>>
    >>> data = 'Hello <em>Baby</em>'
    >>> for type, value, line in HTMLParser(data):
    ...     if type == START_ELEMENT:
    ...         tag_uri, tag_name, attributes = value
    ...         print 'START TAG :', tag_name
    ...     elif type == END_ELEMENT:
    ...         tag_uri, tag_name = value
    ...         print 'END TAG   :', tag_name
    ...     elif type == TEXT:
    ...         print 'TEXT      :', value
    ...
    TEXT      : Hello
    START TAG : em
    TEXT      : Baby
    END TAG   : em

This example just prints a message to the standard output each time the start of an element, the end of an element or a text node is found.

The parser returns a list of events, where every event is a tuple of three values: the event type, the value (which depends on the event type) and the line number. The events implemented are:

Event

Value

DOCUMENT_TYPE

(tag name, system_id, public_id, intSubSet?)

START_ELEMENT

(tag uri, tag name, attributes)

END_ELEMENT

(tag uri, tag name)

TEXT

value

COMMENT

value

All values (text nodes, comments, attribute values, etc.) are returned as byte strings, in the source encoding.

Attributes

The element attributes are returned as a dictionary where the key is the name of the attribute and the value is the value of the attribute.

For example, when processing the XML fragment:

    <a href="http://www.gnu.org/"
      title="GNU's Not Unix">GNU</a>

The parser will return the attributes this way:

    {'href': 'http://www.gnu.org/',
     'title': "GNU's Not Unix"}