scrape.py

Ka-Ping Yee

scrape.py is a Python module for scraping content from webpages. Using it, you can easily fetch pages, follow links, and submit forms. Cookies, redirections, and SSL are handled automatically. (For SSL, you either need a version of Python with the socket.ssl function, or the curl command-line utility.)

scrape.py does not parse the page into a complete parse tree, so it can handle pages with sloppy syntax. You are free to locate content in the page according to nearby text, tags, or even comments.

You can download the module or read the documentation page. This code is released under the Apache License, version 2.0.

Here's a quick walkthrough.

Fetching a page

To fetch a page, you call the go(url) method on a Session object. The module provides a default session object in a variable called s.

>>> from scrape import *
>>> s.go('http://zesty.ca/')
<Region 0:25751>

The result is a Region object spanning the entire retrieved document (all 25751 bytes). Region objects are what you use to get around inside an HTML document; they represent a region of text in the HTML source code, with a starting point and an ending point.

After any successful fetch, the session's doc attribute also contains the document. The headers attribute contains the headers that were received, and the url attribute contains the URL that was retrieved (which might be different from the URL you requested, if redirection took place).

>>> s.doc
<Region 0:25751>
>>> s.headers
{'content-length': '25751',
 'accept-ranges': 'bytes',
 'server': 'Apache/2.2.8'
 'last-modified': 'Tue, 10 Sep 2013 21:38:28 GMT',
 'connection': 'close',
 'etag': '"5f4b02-6497-4e60e5347fd00"',
 'date': 'Tue, 10 Sep 2013 21:55:37 GMT',
 'content-type': 'text/html'}
>>> s.url
'http://zesty.ca/'

On a Region, the raw content is available in the content attribute, and the plain text is available in the text attribute. (In this case, both of these are Unicode strings because a Unicode encoding was specified by the server.)

>>> d = s.doc
>>> print d.content[:70]
<!doctype html public "-//W3C//DTD HTML 4.01 Transitional//EN">
<html
>>> d.text[:30]
u'Ka-Ping Yee\nKa-Ping Yee pingze'

Extracting content

A Region object can be sliced, just like a string. The object remembers its starting and ending positions in the original document, but any indices you supply are with respect to the region itself.

>>> d
<Region 0:25751>
>>> r = d[1400:1450]
>>> r
<Region 1400:1450>
>>> len(r)
50
>>> r.start
1400
>>> r.end
1450
>>> r.content
'g-flowers-crop.jpg" alt="picture of Ping among som'
>>> r[-15:]
<Region 1435:1450>

Call first(tagname) on a Region object to find the first block (inside the region) with matching start and end tags of the tag name you specify. The resulting Region object retains information about the tag; you get dictionary-style access to the attributes. The region starts just after the start tag and ends just before the end tag.

>>> title = d.first('title')
>>> title
<Region 311:322 title>
>>> title.tagname
'title'
>>> title.text
u'Ka-Ping Yee'
>>> span = d.first('span')
>>> span
<Region 2791:2792 span class='flag'>
>>> span.keys()
['class']
>>> span['class']
'flag'

last(tagname) finds the last block inside the region; next(tagname) finds the first block that starts after the end of the region; and previous(tagname) finds the last block that ends before the start of the region.

Basic navigation

If you know the exact text of a link anchor, follow(anchor) will find the link, resolve it, and follow it. There happens to be a link on my home page that says "CV".

>>> s.follow('CV')
<Region 0:22594>
>>> s.headers
{'date': 'Tue, 10 Sep 2013 21:59:31 GMT',
 'accept-ranges': 'bytes',
 'content-type': 'text/html',
 'connection': 'close',
 'server': 'Apache/2.2.8'}
>>> s.url
'http://zesty.ca/cv.html'
>>> s.doc
<Region 0:22594>
The doc attribute contains the retrieved document (the same thing returned by go(), submit(), or follow()).

Instead of the exact anchor text, you can supply a regular expression for the anchor. There's a link on my CV to the University of Waterloo, but the text isn't exactly "Waterloo". It ends in "Waterloo", though.

>>> s.follow('Waterloo')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ping/python/scrape.py", line 243, in follow
    raise ScrapeError('link %r not found' % anchor)
scrape.ScrapeError: link 'Waterloo' not found
>>> s.follow(re.compile('.*waterloo', re.I))
<Region 0:41738>

Calling back() takes us back to my CV page.

>>> s.url
'http://www.uwaterloo.ca/'
>>> s.doc
<Region 0:41738>
>>> s.back()
'http://zesty.ca/cv.html'
>>> s.doc
<Region 0:22594>

A Region object can be associated with an HTML element, in which case the starting point is just after the start tag, and the ending point is just before the end tag; or it can be associated with an individual tag, in which case the starting point is just before the "<" and the ending point is just after the ">".

>>> from scrape import *
>>> s.go('http://zesty.ca/')
<Region 0:25751>