This module is unmaintained. Maybe someday...
DOMForm is a Python module for web scraping and web testing. It knows how to evaluate embedded JavaScript code in response to appropriate events. DOMForm supports both the ClientForm 0.1.x HTML form interface and the HTML DOM level 2 interface (note that ATM the DOM is written to an out-of-date version of the specification, and has some hacks to get it to work with "DOM as deployed"). The ClientForm interface makes it easy to parse HTML forms, fill them in and return them to the server. The DOM interface makes it easy to get at other parts of the document, and makes JavaScript support possible. The ability to switch back and forth between the two interfaces allows simpler code than would result from using either interface alone. DOMForm is partly derived from several third-party libraries. JavaScript support currently depends on Mozilla's GPLed spidermonkey JavaScript interpreter (which is available separately from Mozilla itself), and a Python interface to spidermonkey.
This package allows you to use web pages containing JavaScript code, have
that code automatically executed at appropriate times, and have the results
reflected both in an HTML DOM tree and in a higher-level browser-like object
model (only the ClientForm part of this browser interface is implemented so
far). Of course, automatic execution of much code depends on the use of either
the browser-like interface or equivalent DOM methods: otherwise, the code can't
know when the JavaScript should be executed. XXX lots of stuff not implemented
yet: eg., javascript:
URLs (easy to do, though).
It's easy to switch between the ClientForm API and the DOM, thus making it hard to get stuck in a position where further progress requires disproportionate coding effort:
from urllib2 import urlopen from DOMForm import ParseResponse response = urlopen("http://www.example.com/") window = ParseResponse(response) window.document # HTML DOM Level 2 HTMLDocument interface forms = window._htmlforms # list of objects supporting ClientForm.HTMLForm i/face form = forms[0] assert form.name == "some_form" domform = form.node # level 2 HTML DOM HTMLFormElement interface control = form.find_control("some_control") # ClientForm.Control i/face domcontrol = control.node # corresponding level 2 HTML DOM HTMLElement i/face doc.some_form._htmlform # back to the ClientForm.HTMLForm interface again doc.some_form.some_control._control # ClientForm.Control interface again response = urlopen(form.click()) # domform.submit() also works
Note that the level 2 HTML DOM interface is currently based on an old version of the specification, with some imperfect changes to provide some support for XHTML.
To interpret JavaScript, you need to pass the interpret
argument to ParseResponse
or ParseFile
:
window = ParseResponse(response, interpret=["javascript"])
The HTML DOM should allow you to get at anything you need to know. Still,
since the DOM does some normalisation and is only created after the original
HTML has been fed through HTMLTidy, you may sometimes need or want access to
the original HTML. ClientCookie's SeekableProcessor
is one way of
doing that:
from ClientCookie import build_opener, SeekableProcessor opener = build_opener(SeekableProcessor) response = opener.open("http://www.example.com/") window = ParseResponse(response) html = response.read() response.seek(0) # carry on using response object as if it hadn't been .read()
Or you can store the html somewhere, then use ParseFile instead of ParseResponse.
If you want the HTML after the Javascript has been interpreted, use
from xml.dom.ext import XHtmlPrint XHtmlPrint(doc, fileobj)
XHtmlPrettyPrint
makes nicer output. Both functions will print
any DOM node, not just an HTMLDocument
.
There's some more documentation in the docstrings.
Thanks to Andrew Clover for advice and code on DOM 'liveness', all the PyXML
contributors, and Gisle Aas, for the HTML::Form
Perl code from
which ClientForm was originally derived.
Most of the bugs are in JavaScript support (which is very dodgy)
and the DOM implementation. The ClientForm work-alike stuff is
relatively stable (but see the entities and
select_default
bugs listed below).
except *
feature to be fixed.
There are a few print statements scattered about, as a result of this. Note
that code listed with JavaScript error messages can be the
WRONG CODE! Don't take it seriously.
decorate_DOM(window)
after this happens, to regenerate the
HTMLForm
and all its Controls, and rebind them to the DOM. I
probably won't fix this (I'm guessing it won't cause problems).
Window
class is still just stubs. This will be
fixed, gradually. ATM, you can likely quite easily derive your own
Window
class with stubs that suit your application, and pass it
to one of the Parse*
functions through the
window_class
argument.
javascript:
scheme URLs, external JavaScript loading,
etc. aren't implemented yet (but they're easy to add).
innerHTML
isn't implemented. Thanks to my hacks (for live-ness, IE compatibility,
bug fixes, changes to match newest DOM standard etc.), it's probably
quite buggy, too.
sgmlop
.
RADIO
controls. This
should be fixed soon.
onclick
- executed. You just
have to fire your own events:
from DOMForm import fireHTMLEvent, fireMouseEvent # Say we've got a DOM node, domnode, representing a button, and we want to # simulate clicking it. fireHTMLEvent(domnode, "focus") fireMouseEvent(domnode, "click") fireHTMLEvent(domnode, "blur") # Of course, this is missing events like mouseover, which would be fired # by a browser, but we probably don't even need the focus or blur either.
For installation instructions, see the INSTALL file included in the distribution.
Python 2.3 and PyXML 0.8.3 are required (earlier versions may work, but are untested). Currently mxTidy is required (I may switch to uTidylib at some point). The spidermonkey Python module is required if you want JavaScript interpretation.
Development release. This is the first alpha release: there are many known bugs, and interfaces will change.
Good question. I wanted something smaller, not dependant on any browser, and also liked the idea of an easy-to-understand implementation of the browser object model in pure Python.
2.3 (earlier versions may work, but are untested).
The BSD license (included in distribution). Note that spidermonkey and its Python interface are under the GPL.
_htmlforms
begin with an underscore?
Because attributes that start with an underscore ("_") are not
exposed to JavaScript by the spidermonkey
module.
The ClientCookie package makes it
easy to get seek()
able response objects, which is
convenient for debugging. See also here for few
relevant tips. Also see General
FAQs.
I prefer questions and comments to be sent to the mailing list rather than direct to me.
John J. Lee, May 2006.