DOMForm

This module is unmaintained. Maybe someday...

DOMForm is a Python module for web scraping and web testing. It knows how to evaluate embedded JavaScript code in response to appropriate events. DOMForm supports both the ClientForm 0.1.x HTML form interface and the HTML DOM level 2 interface (note that ATM the DOM is written to an out-of-date version of the specification, and has some hacks to get it to work with "DOM as deployed"). The ClientForm interface makes it easy to parse HTML forms, fill them in and return them to the server. The DOM interface makes it easy to get at other parts of the document, and makes JavaScript support possible. The ability to switch back and forth between the two interfaces allows simpler code than would result from using either interface alone. DOMForm is partly derived from several third-party libraries. JavaScript support currently depends on Mozilla's GPLed spidermonkey JavaScript interpreter (which is available separately from Mozilla itself), and a Python interface to spidermonkey.

This package allows you to use web pages containing JavaScript code, have that code automatically executed at appropriate times, and have the results reflected both in an HTML DOM tree and in a higher-level browser-like object model (only the ClientForm part of this browser interface is implemented so far). Of course, automatic execution of much code depends on the use of either the browser-like interface or equivalent DOM methods: otherwise, the code can't know when the JavaScript should be executed. XXX lots of stuff not implemented yet: eg., javascript: URLs (easy to do, though).

It's easy to switch between the ClientForm API and the DOM, thus making it hard to get stuck in a position where further progress requires disproportionate coding effort:

from urllib2 import urlopen
from DOMForm import ParseResponse

response = urlopen("http://www.example.com/")
window = ParseResponse(response)
window.document  # HTML DOM Level 2 HTMLDocument interface
forms = window._htmlforms  # list of objects supporting ClientForm.HTMLForm i/face
form = forms[0]

assert form.name == "some_form"
domform = form.node  # level 2 HTML DOM HTMLFormElement interface
control = form.find_control("some_control")  # ClientForm.Control i/face
domcontrol = control.node  # corresponding level 2 HTML DOM HTMLElement i/face
doc.some_form._htmlform  # back to the ClientForm.HTMLForm interface again
doc.some_form.some_control._control  # ClientForm.Control interface again

response = urlopen(form.click())  # domform.submit() also works

Note that the level 2 HTML DOM interface is currently based on an old version of the specification, with some imperfect changes to provide some support for XHTML.

To interpret JavaScript, you need to pass the interpret argument to ParseResponse or ParseFile:

window = ParseResponse(response, interpret=["javascript"])

The HTML DOM should allow you to get at anything you need to know. Still, since the DOM does some normalisation and is only created after the original HTML has been fed through HTMLTidy, you may sometimes need or want access to the original HTML. ClientCookie's SeekableProcessor is one way of doing that:

from ClientCookie import build_opener, SeekableProcessor
opener = build_opener(SeekableProcessor)
response = opener.open("http://www.example.com/")
window = ParseResponse(response)
html = response.read()
response.seek(0)
# carry on using response object as if it hadn't been .read()

Or you can store the html somewhere, then use ParseFile instead of ParseResponse.

If you want the HTML after the Javascript has been interpreted, use

from xml.dom.ext import XHtmlPrint
XHtmlPrint(doc, fileobj)

XHtmlPrettyPrint makes nicer output. Both functions will print any DOM node, not just an HTMLDocument.

There's some more documentation in the docstrings.

Thanks to Andrew Clover for advice and code on DOM 'liveness', all the PyXML contributors, and Gisle Aas, for the HTML::Form Perl code from which ClientForm was originally derived.

Major known bugs and surprises

Most of the bugs are in JavaScript support (which is very dodgy) and the DOM implementation. The ClientForm work-alike stuff is relatively stable (but see the entities and select_default bugs listed below).

Error / exception handling across JS/Python boundary is pathetic. IIRC, I'm waiting for Pyrex's except * feature to be fixed. There are a few print statements scattered about, as a result of this. Note that code listed with JavaScript error messages can be the WRONG CODE! Don't take it seriously.
Adding or removing form controls (eg. by JavaScript) doesn't get reflected in the ClientForm API. You have to call decorate_DOM(window) after this happens, to regenerate the HTMLForm and all its Controls, and rebind them to the DOM. I probably won't fix this (I'm guessing it won't cause problems).
Much of the Window class is still just stubs. This will be fixed, gradually. ATM, you can likely quite easily derive your own Window class with stubs that suit your application, and pass it to one of the Parse* functions through the window_class argument.
Stuff like javascript: scheme URLs, external JavaScript loading, etc. aren't implemented yet (but they're easy to add).
HTML DOM is based on old specification version (see above). This probably won't be fixed all in one go. Instead, changes will gradually be applied to improve compatibility with real-world JS code. For a while at least, this will mean a mixture of the two versions of the standard.
There are various other problems with the DOM -- eg. innerHTML isn't implemented. Thanks to my hacks (for live-ness, IE compatibility, bug fixes, changes to match newest DOM standard etc.), it's probably quite buggy, too.
Entities in attribute values aren't decoded. Seems to be a bug in sgmlop.
select_default argument is broken for RADIO controls. This should be fixed soon.

No browser class yet. This means that, for example, it's a pain to get some event handlers - such as onclick - executed. You just have to fire your own events:

from DOMForm import fireHTMLEvent, fireMouseEvent
# Say we've got a DOM node, domnode, representing a button, and we want to
# simulate clicking it.
fireHTMLEvent(domnode, "focus")
fireMouseEvent(domnode, "click")
fireHTMLEvent(domnode, "blur")
# Of course, this is missing events like mouseover, which would be fired
# by a browser, but we probably don't even need the focus or blur either.

No frame support yet. Nothing to prevent using them, just no actual support.
No Java support. Probably won't be 'fixed', because I don't want this feature. Java's HttpUnit (accessible from Jython) supports this, as do Mozilla, Konqueror and MSIE.
spidermonkey bridge probably leaks memory like crazy ATM.
It's slooow!

Download

For installation instructions, see the INSTALL file included in the distribution.

Python 2.3 and PyXML 0.8.3 are required (earlier versions may work, but are untested). Currently mxTidy is required (I may switch to uTidylib at some point). The spidermonkey Python module is required if you want JavaScript interpretation.

Development release. This is the first alpha release: there are many known bugs, and interfaces will change.

FAQs

Why not use Mozilla, Konqueror or MSIE through their automation interfaces (see General FAQs)?
Good question. I wanted something smaller, not dependant on any browser, and also liked the idea of an easy-to-understand implementation of the browser object model in pure Python.
Which version of Python do I need?
2.3 (earlier versions may work, but are untested).
Which license?
The BSD license (included in distribution). Note that spidermonkey and its Python interface are under the GPL.
Why do attributes like _htmlforms begin with an underscore?
Because attributes that start with an underscore ("_") are not exposed to JavaScript by the spidermonkey module.
I'm having trouble debugging my code.
The ClientCookie package makes it easy to get seek()able response objects, which is convenient for debugging. See also here for few relevant tips. Also see General FAQs.
Where can I find out more about JavaScript?
- XXX Mozilla docs.
- XXX that page in defence of JavaScript with good intro...
Where can I find out more about the DOM?
- XXX W3C level 2 spec.
- XXX Mozilla docs.
Where can I find out more about the browser object model?
- XXX O'Reilly book (Dynamic HTML: The Definitive Guide)
- XXX MS docs
- XXX Mozilla docs.

I prefer questions and comments to be sent to the mailing list rather than direct to me.

John J. Lee, May 2006.