Another bit of code put together: this time, an automated web browser for Python. It’s something like Perl’s WWW:Mechanize — use it to navigate to a page, follow links, fill out forms, and the like. Get the code, or look at the documentation or the syntax-highlighted code.
It’s not working for me – what do I need to install for the “from xml.dom.ext.reader import HtmlLib” bit to work?
Darn, I thought that came with Python.
It’s the Python/XML distribution (Debian package python2.3-xml) — I think the HtmlLib stuff is from 4DOM, the FourThought Python XML stuff. It appears that the Debian people collect together a load of Python stuff and bung it into one package, which is nice of them but makes it difficult to recommend a package to non-Debian-Linux-people.
I’m open to other suggestions for an HTML parser; all I really need is something that can parse even broken HTML and give me a DOM tree out of it. I tried using the Twisted people’s @microdom@ first but gave it up in favour of HtmlLib, and even that I had to patch twice in the code (once to cope with bad namespaced elements, and once to fix a bug). Any better suggestions for an HTML -> DOM parser in Python, say the word, especially if it’s either easy to distribute with browser.py or comes with Python by default!
You might want to take a look at Fredrik Lundh’s ElementTidy, or mxTidy which both use a library version of Dave Raggett’s HTML Tidy utility to fix any problems with the HTML before converting it to XHTML (the XML version of HTML)
http://www.effbot.org/zone/element-tidylib.htm
The webunit package contains a module called SimpleDOM:
http://www.mechanicalcat.net/tech/webunit/README.html#simpledom
It seems useful. Besides webunit seems
in some ways similar to Browser.py
SimpleDOM.NestingError: Open tags<html>, <body>, <table>,
<tr>, <td>, <table>,
<tr>, <td>, <script> do not
match close tag </iframe>, at line 76, column
290
The major problem here is needing to parse invalid HTML, which most of the web is. The effbot interface to tidylib would work but requires a C extension. mxTidy certainly didn’t used to be an interface to tidylib — instead, it called the tidy executable directly. There are other tidylib interfaces, some of which need ctypes or similar…
The twisted guys have microdom, which is specifically designed to correct nonstandard HTML, and can probably be easily lifted out of the package.
This is great stuff. Thanks. The XML parser from: http://pyxml.sourceforge.net/topics/download.html
works well.
I added a little patch for titles:
I used PyXML as well and indeed it works fine. Just noticed that if I install it with the setup.py’s defaults, it gets installed in site-packages/_xmlplus… python can’t find it there, it must sit in site-packages/xml. (I may be wrong, I’m a newbie at Python)
I have had similar problems. I choose to compile tidy as a standalone tool, pipe the data i needed to parse through it using popen2 and then parse it. No problem, except figuring out encoding stuff for strange sweedish letters.
Geoff: I did use microdom first, but I stopped using it in favour of HtmlLib (although annoyingly I can’t remember why!).
Sounds like the debian package is PyXML — non-debian people can simply download and install PyXML from the link Kevin gives.
Stefan: PyXML should work fine installed in libs/site-packages/xmlplus; the base Python distribution’s libs/xml/__init_.py has a special hook which loads it in place of the basic libraries.
Being the person who maintains the python-xml package, I feel obliged to say that there’s no “collection of different stuff” in debian packages. What often happens is one upstream package ending up splitted in several binary packages (which is the case for the Python interpreter). python-xml is PyXML as released on http://pyxml.sf.net/ (with the xbel parts split off in separate packages.)
I used the following modification to be able to use images as submit buttons :
Getting an error I can’t resolve. I am guessing that this arises when a returned page is over a certain size (64k).
I am using the standard packages distributions from OpenBSD 3.4. python 2.21 and pyXML 0.7.1
Traceback (most recent call last):
File “<stdin>”, line 1, in ?
File “browser.py”, line 233, in get
self.__htmldom = self.__reader.fromString(self.__data)
File “/usr/obj/i386/py-xml-0.7.1/fake-i386/usr/local/lib/python2.2/site-packages/_xmlplus/dom/ext/reader/HtmlLib.py”, line 70, in fromString
File “/usr/obj/i386/py-xml-0.7.1/fake-i386/usr/local/lib/python2.2/site-packages/_xmlplus/dom/ext/reader/HtmlLib.py”, line 28, in fromStream
File “/usr/obj/i386/py-xml-0.7.1/fake-i386/usr/local/lib/python2.2/site-packages/_xmlplus/dom/ext/reader/Sgmlop.py”, line 57, in parse
ValueError: character reference too large
Any thoughts anyone?
Hi, another newbie I’m afraid. I’ve picked up on Browser.py so that I could process my yahoo email (I’m using a Perl program to do so at the moment). Unfortunately I am getting errors that I don’t understand. The first occurs when simply changing the example in the documentation from http://www.yahoo.com to http://www.yahoo.co.uk, I get the following error …
>>> from browser import Browser
>>> b=Browser()
>>> b.get(‘http://www.yahoo.co.uk/’)
Traceback (most recent call last):
File “<stdin>”, line 1, in ?
File “browser.py”, line 233, in get
self.__htmldom = self.__reader.fromString(self.__data)
File “C:Python23Libsite-packages_xmlplusdomextreaderHtmlLib.py”, line
69, in fromString
return self.fromStream(stream, ownerDoc, charset)
File “C:Python23Libsite-packages_xmlplusdomextreaderHtmlLib.py”, line
27, in fromStream
self.parser.parse(stream)
File “C:Python23Libsite-packages_xmlplusdomextreaderSgmlop.py”, line 5
7, in parse
self._parser.parse(stream.read())
File “C:Python23Libsite-packages_xmlplusdomextreaderSgmlop.py”, line 1
60, in finish_starttag
unicode(value, self._charset))
File “browser.py”, line 95, in newSetAttributeNS
Element.setAttributeNS(self,ns,qname.upper(),value)
File “C:Python23Libsite-packages_xmlplusdomElement.py”, line 170, in set
AttributeNS
raise InvalidCharacterErr()
xml.dom.InvalidCharacterErr: Invalid or illegal character
>>>
The error is way beyond my knowledge so any comments would be welcome.
The second hurdle comes when,using yahoo.com I get to the login page. I can correctly load the form, but when setting field values and submitting the following error occurs. This appears to be related to https but again the error is beyond me
>>> from browser import Browser
>>> b=Browser()
>>> b.get(‘http://www.yahoo.com/’)
>>> b.follow_link(‘Mail’)
>>> b.dump_forms()
Form login_form
Action: https://login.yahoo.com/config/login?1c907898i8vmr
Method: POST
Hidden: .tries 1
Hidden: .src ym
Hidden: .md5 (no value)
Hidden: .hash (no value)
Hidden: .js (no value)
Hidden: .last (no value)
Hidden: promo (no value)
Hidden: .intl us
Hidden: .bypass (no value)
Hidden: .partner (no value)
Hidden: .u 06ta7l8vvr0ds
Hidden: .v 0
Hidden: .challenge hmb1p1cZHX45VKiGm9wQchCdirYw
Hidden: .yplus (no value)
Hidden: .emailCode (no value)
Hidden: pkg (no value)
Hidden: stepid (no value)
Hidden: .ev (no value)
Hidden: hasMsgr 0
Hidden: .chkP Y
Hidden: .done http://mail.yahoo.com
Textbox: login (no value)
Password: passwd (no value)
Checkbox: .persistent y (off)
Button: .save Sign In
>>> b.form(‘login_form’)
>>> b.field(‘login’,'fred’)
>>> b.submit()
Traceback (most recent call last):
File “<stdin>”, line 1, in ?
File “browser.py”, line 340, in submit
self.get(action,method,self.__form.fieldValues)
File “browser.py”, line 221, in get
fp = ClientCookie.urlopen(newuri,urllib.urlencode(data))
File “C:Python23Libsite-packagesClientCookie_urllib2_support.py”, line 82
9, in urlopen
return _opener.open(url, data)
File “C:Python23Libsite-packagesClientCookie_urllib2_support.py”, line 52
0, in open
response = urllib2.OpenerDirector.open(self, req, data)
File “C:Python23liburllib2.py”, line 338, in open
‘unknown_open’, req)
File “C:Python23liburllib2.py”, line 313, in _call_chain
result = func(*args)
File “C:Python23liburllib2.py”, line 862, in unknown_open
raise URLError(‘unknown url type: %s’ % type)
urllib2.URLError: <urlopen error unknown url type: https>
>>>
Comments and help appreciated
can someone e-mail me and tell me what kind of program python 2.21 is and what is it used for in XP?
Gary,
Python is a programming language. Browser.py, the subject of this page, is an automated web testing tool which is written in Python (and therefore needs Python present on your system to run). Get Python from http://www.python.org/. If you do any programming then you will find it a simpler and more powerful way to work that whatever you’re currently using.
This program cannot be interpreted correctly at all. It keeps telling me that it cannot find a specific module xml.dom.ext.reader. just an FYO
i cant go to yahoo email and yahoo mssenger. Can u tell me what’s the reason?