SourceForge.net Logo

DOMForm

This module is unmaintained. Maybe someday...

DOMForm is a Python module for web scraping and web testing. It knows how to evaluate embedded JavaScript code in response to appropriate events. DOMForm supports both the ClientForm 0.1.x HTML form interface and the HTML DOM level 2 interface (note that ATM the DOM is written to an out-of-date version of the specification, and has some hacks to get it to work with "DOM as deployed"). The ClientForm interface makes it easy to parse HTML forms, fill them in and return them to the server. The DOM interface makes it easy to get at other parts of the document, and makes JavaScript support possible. The ability to switch back and forth between the two interfaces allows simpler code than would result from using either interface alone. DOMForm is partly derived from several third-party libraries. JavaScript support currently depends on Mozilla's GPLed spidermonkey JavaScript interpreter (which is available separately from Mozilla itself), and a Python interface to spidermonkey.

This package allows you to use web pages containing JavaScript code, have that code automatically executed at appropriate times, and have the results reflected both in an HTML DOM tree and in a higher-level browser-like object model (only the ClientForm part of this browser interface is implemented so far). Of course, automatic execution of much code depends on the use of either the browser-like interface or equivalent DOM methods: otherwise, the code can't know when the JavaScript should be executed. XXX lots of stuff not implemented yet: eg., javascript: URLs (easy to do, though).

It's easy to switch between the ClientForm API and the DOM, thus making it hard to get stuck in a position where further progress requires disproportionate coding effort:

from urllib2 import urlopen
from DOMForm import ParseResponse

response = urlopen("http://www.example.com/")
window = ParseResponse(response)
window.document  # HTML DOM Level 2 HTMLDocument interface
forms = window._htmlforms  # list of objects supporting ClientForm.HTMLForm i/face
form = forms[0]

assert form.name == "some_form"
domform = form.node  # level 2 HTML DOM HTMLFormElement interface
control = form.find_control("some_control")  # ClientForm.Control i/face
domcontrol = control.node  # corresponding level 2 HTML DOM HTMLElement i/face
doc.some_form._htmlform  # back to the ClientForm.HTMLForm interface again
doc.some_form.some_control._control  # ClientForm.Control interface again

response = urlopen(form.click())  # domform.submit() also works

Note that the level 2 HTML DOM interface is currently based on an old version of the specification, with some imperfect changes to provide some support for XHTML.

To interpret JavaScript, you need to pass the interpret argument to ParseResponse or ParseFile:

window = ParseResponse(response, interpret=["javascript"])

The HTML DOM should allow you to get at anything you need to know. Still, since the DOM does some normalisation and is only created after the original HTML has been fed through HTMLTidy, you may sometimes need or want access to the original HTML. ClientCookie's SeekableProcessor is one way of doing that:

from ClientCookie import build_opener, SeekableProcessor
opener = build_opener(SeekableProcessor)
response = opener.open("http://www.example.com/")
window = ParseResponse(response)
html = response.read()
response.seek(0)
# carry on using response object as if it hadn't been .read()

Or you can store the html somewhere, then use ParseFile instead of ParseResponse.

If you want the HTML after the Javascript has been interpreted, use

from xml.dom.ext import XHtmlPrint
XHtmlPrint(doc, fileobj)

XHtmlPrettyPrint makes nicer output. Both functions will print any DOM node, not just an HTMLDocument.

There's some more documentation in the docstrings.

Thanks to Andrew Clover for advice and code on DOM 'liveness', all the PyXML contributors, and Gisle Aas, for the HTML::Form Perl code from which ClientForm was originally derived.

Major known bugs and surprises

Most of the bugs are in JavaScript support (which is very dodgy) and the DOM implementation. The ClientForm work-alike stuff is relatively stable (but see the entities and select_default bugs listed below).

Download

For installation instructions, see the INSTALL file included in the distribution.

Python 2.3 and PyXML 0.8.3 are required (earlier versions may work, but are untested). Currently mxTidy is required (I may switch to uTidylib at some point). The spidermonkey Python module is required if you want JavaScript interpretation.

Development release. This is the first alpha release: there are many known bugs, and interfaces will change.

FAQs

I prefer questions and comments to be sent to the mailing list rather than direct to me.

John J. Lee, May 2006.