Author: | Ian Bicking |
---|
Since version 2.0, lxml comes with a dedicated Python package for dealing with HTML: lxml.html. It is based on lxml's HTML parser, but provides a special Element API for HTML elements, as well as a number of utilities for common HTML processing tasks.
Contents
The main API is based on the lxml.etree API, and thus, on the ElementTree API.
There are several functions available to parse HTML:
Parses the named file or url, or if the object has a .read() method, parses from that.
If you give a URL, or if the object has a .geturl() method (as file-like objects from urllib.urlopen() have), then that URL is used as the base URL. You can also provide an explicit base_url keyword argument.
The normal HTML parser is capable of handling broken HTML, but for pages that are far enough from HTML to call them 'tag soup', it may still fail to parse the page in a useful way. A way to deal with this is ElementSoup, which deploys the well-known BeautifulSoup parser to build an lxml HTML tree.
However, note that the most common problem with web pages is the lack of (or the existence of incorrect) encoding declarations. It is therefore often sufficient to only use the encoding detection of BeautifulSoup, called UnicodeDammit, and to leave the rest to lxml's own HTML parser, which is several times faster.
HTML elements have all the methods that come with ElementTree, but also include some extra methods:
One of the interesting modules in the lxml.html package deals with doctests. It can be hard to compare two HTML pages for equality, as whitespace differences aren't meaningful and the structural formatting can differ. This is even more a problem in doctests, where output is tested for equality and small differences in whitespace or the order of attributes can let a test fail. And given the verbosity of tag-based languages, it may take more than a quick look to find the actual differences in the doctest output.
Luckily, lxml provides the lxml.doctestcompare module that supports relaxed comparison of XML and HTML pages and provides a readable diff in the output when a test fails. The HTML comparison is most easily used by importing the usedoctest module in a doctest:
>>> import lxml.html.usedoctest
Now, if you have an HTML document and want to compare it to an expected result document in a doctest, you can do the following:
>>> import lxml.html >>> html = lxml.html.fromstring('''\ ... <html><body color="white"> ... <p>Hi !</p> ... </body></html> ... ''') >>> print lxml.html.tostring(html) <html><body color="white"><p>Hi !</p></body></html> >>> print lxml.html.tostring(html) <html> <body color="white" > <p>Hi !</p> </body> </html> >>> print lxml.html.tostring(html) <html> <body color="white" > <p>Hi !</p> </body> </html>
In documentation, you would likely prefer the pretty printed HTML output, as it is the most readable. However, the three documents are equivalent from the point of view of an HTML tool, so the doctest will silently accept any of the above. This allows you to concentrate on readability in your doctests, even if the real output is a straight ugly HTML one-liner.
Note that there is also an lxml.usedoctest module which you can import for XML comparisons. The HTML parser notably ignores namespaces and some other XMLisms.
lxml.html comes with a predefined HTML vocabulary for the E-factory, originally written by Fredrik Lundh. This allows you to quickly generate HTML pages and fragments:
>>> from lxml.html import builder as E >>> from lxml.html import usedoctest >>> html = E.HTML( ... E.HEAD( ... E.LINK(rel="stylesheet", href="great.css", type="text/css"), ... E.TITLE("Best Page Ever") ... ), ... E.BODY( ... E.H1(E.CLASS("heading"), "Top News"), ... E.P("World News only on this page", style="font-size: 200%"), ... "Ah, and here's some more text, by the way.", ... lxml.html.fromstring("<p>... and this is a parsed fragment ...</p>") ... ) ... ) >>> print lxml.html.tostring(html) <html> <head> <link href="great.css" rel="stylesheet" type="text/css"> <title>Best Page Ever</title> </head> <body> <h1 class="heading">Top News</h1> <p style="font-size: 200%">World News only on this page</p> Ah, and here's some more text, by the way. <p>... and this is a parsed fragment ...</p> </body> </html>
Note that you should use lxml.html.tostring and not lxml.tostring. lxml.tostring(doc) will return the XML representation of the document, which is not valid HTML. In particular, things like <script src="..."></script> will be serialized as <script src="..." />, which completely confuses browsers.
A handy method for viewing your HTML: lxml.html.open_in_browser(lxml_doc) will write the document to disk and open it in a browser (with the webbrowser module).
There are several methods on elements that allow you to see and modify the links in a document.
This yields (element, attribute, link, pos) for every link in the document. attribute may be None if the link is in the text (as will be the case with a <style> tag with @import).
This finds any link in an action, archive, background, cite, classid, codebase, data, href, longdesc, profile, src, usemap, dynsrc, or lowsrc attribute. It also searches style attributes for url(link), and <style> tags for @import and url().
This function does not pay attention to <base href>.
This makes all links in the document absolute, assuming that base_href is the URL of the document. So if you pass base_href="http://localhost/foo/bar.html" and there is a link to baz.html that will be rewritten as http://localhost/foo/baz.html.
If resolve_base_href is true, then any <base href> tag will be taken into account (just calling self.resolve_base_href()).
This rewrites all the links in the document using your given link replacement function. If you give a base_href value, all links will be passed in after they are joined with this URL.
For each link link_repl_func(link) is called. That function then returns the new link, or None to remove the attribute or tag that contains the link. Note that all links will be passed in, including links like "#anchor" (which is purely internal), and things like "mailto:bob@example.com" (or javascript:...).
If you want access to the context of the link, you should use .iterlinks() instead.
In addition to these methods, there are corresponding functions:
These functions will parse html if it is a string, then return the new HTML as a string. If you pass in a document, the document will be copied (except for iterlinks()), the method performed, and the new document returned.
Any <form> elements in a document are available through the list doc.forms (e.g., doc.forms[0]). Form, input, select, and textarea elements each have special methods.
Input elements (including <select> and <textarea>) have these attributes:
The value of an input, the content of a textarea, the selected option(s) of a select. This attribute can be set.
In the case of a select that takes multiple options (<select multiple>) this will be a set of the selected options; you can add or remove items to select and unselect the options.
Select attributes:
Input attributes:
The form itself has these attributes:
A dictionary-like object used to access values by their name. form.inputs returns elements, this only returns values. Setting values in this dictionary will effect the form inputs. Basically form.fields[x] is equivalent to form.inputs[x].value and form.fields[x] = y is equivalent to form.inputs[x].value = y. (Note that sometimes form.inputs[x] returns a compound object, but these objects also have .value attributes.)
If you set this attribute, it is equivalent to form.fields.clear(); form.fields.update(new_value)
Note that you can change any of these attributes (values, method, action, etc) and then serialize the form to see the updated values. You can, for instance, do:
>>> from lxml.html import fromstring, tostring >>> form_page = fromstring('''<html><body><form> ... Your name: <input type="text" name="name"> <br> ... Your phone: <input type="text" name="phone"> <br> ... Your favorite pets: <br> ... Dogs: <input type="checkbox" name="interest" value="dogs"> <br> ... Cats: <input type="checkbox" name="interest" value="cats"> <br> ... Llamas: <input type="checkbox" name="interest" value="llamas"> <br> ... <input type="submit"></form></body></html>''') >>> form = form_page.forms[0] >>> form.fields = dict( ... name='John Smith', ... phone='555-555-3949', ... interest=set(['cats', 'llamas'])) >>> print(tostring(form)) <html> <body> <form> Your name: <input name="name" type="text" value="John Smith"> <br>Your phone: <input name="phone" type="text" value="555-555-3949"> <br>Your favorite pets: <br>Dogs: <input name="interest" type="checkbox" value="dogs"> <br>Cats: <input checked name="interest" type="checkbox" value="cats"> <br>Llamas: <input checked name="interest" type="checkbox" value="llamas"> <br> <input type="submit"> </form> </body> </html>
You can submit a form with lxml.html.submit_form(form_element). This will return a file-like object (the result of urllib.urlopen()).
If you have extra input values you want to pass you can use the keyword argument extra_values, like extra_values={'submit': 'Yes!'}. This is the only way to get submit values into the form, as there is no state of "submitted" for these elements.
You can pass in an alternate opener with the open_http keyword argument, which is a function with the signature open_http(method, url, values).
Example:
>>> from lxml.html import parse, submit_form >>> page = parse('http://tinyurl.com').getroot() >>> page.forms[0].fields['url'] = 'http://lxml.de/' >>> result = parse(submit_form(page.forms[0])).getroot() >>> [a.attrib['href'] for a in result.xpath("//a[@target='_blank']")] ['http://tinyurl.com/2xae8s', 'http://preview.tinyurl.com/2xae8s']
The module lxml.html.diff offers some ways to visualize differences in HTML documents. These differences are content oriented. That is, changes in markup are largely ignored; only changes in the content itself are highlighted.
There are two ways to view differences: htmldiff and html_annotate. One shows differences with <ins> and <del>, while the other annotates a set of changes similar to svn blame. Both these functions operate on text, and work best with content fragments (only what goes in <body>), not complete documents.
Example of htmldiff:
>>> from lxml.html.diff import htmldiff, html_annotate >>> doc1 = '''<p>Here is some text.</p>''' >>> doc2 = '''<p>Here is <b>a lot</b> of <i>text</i>.</p>''' >>> doc3 = '''<p>Here is <b>a little</b> <i>text</i>.</p>''' >>> print htmldiff(doc1, doc2) <p>Here is <ins><b>a lot</b> of <i>text</i>.</ins> <del>some text.</del> </p> >>> print html_annotate([(doc1, 'author1'), (doc2, 'author2'), ... (doc3, 'author3')]) <p><span title="author1">Here is</span> <b><span title="author2">a</span> <span title="author3">little</span></b> <i><span title="author2">text</span></i> <span title="author2">.</span></p>
As you can see, it is imperfect as such things tend to be. On larger tracts of text with larger edits it will generally do better.
The html_annotate function can also take an optional second argument, markup. This is a function like markup(text, version) that returns the given text marked up with the given version. The default version, the output of which you see in the example, looks like:
def default_markup(text, version): return '<span title="%s">%s</span>' % ( cgi.escape(unicode(version), 1), text)
This example parses the hCard microformat.
First we get the page:
>>> import urllib >>> from lxml.html import fromstring >>> url = 'http://microformats.org/' >>> content = urllib.urlopen(url).read() >>> doc = fromstring(content) >>> doc.make_links_absolute(url)
Then we create some objects to put the information in:
>>> class Card(object): ... def __init__(self, **kw): ... for name, value in kw: ... setattr(self, name, value) >>> class Phone(object): ... def __init__(self, phone, types=()): ... self.phone, self.types = phone, types
And some generally handy functions for microformats:
>>> def get_text(el, class_name): ... els = el.find_class(class_name) ... if els: ... return els[0].text_content() ... else: ... return '' >>> def get_value(el): ... return get_text(el, 'value') or el.text_content() >>> def get_all_texts(el, class_name): ... return [e.text_content() for e in els.find_class(class_name)] >>> def parse_addresses(el): ... # Ideally this would parse street, etc. ... return el.find_class('adr')
Then the parsing:
>>> for el in doc.find_class('hcard'): ... card = Card() ... card.el = el ... card.fn = get_text(el, 'fn') ... card.tels = [] ... for tel_el in card.find_class('tel'): ... card.tels.append(Phone(get_value(tel_el), ... get_all_texts(tel_el, 'type'))) ... card.addresses = parse_addresses(el)