scrape.py
is a Python module
for scraping content from webpages.
Using it, you can easily fetch pages, follow links, and submit forms.
Cookies, redirections, and SSL are handled automatically.
(For SSL, you either need a version of Python
with the socket.ssl
function,
or the curl
command-line utility.)
scrape.py
does not
parse the page into a complete parse tree,
so it can handle pages with sloppy syntax.
You are free to locate content in the page
according to nearby text, tags, or even comments.
You can download the module or read the documentation page. This code is released under the Apache License, version 2.0.
Here's a quick walkthrough.
Fetching a page
To fetch a page, you call the go(url)
method
on a Session
object.
The module provides a default session object
in a variable called s
.
>>> from scrape import * >>> s.go('http://zesty.ca/') <Region 0:25751>
The result is a Region
object
spanning the entire retrieved document (all 25751 bytes).
Region
objects are
what you use to get around inside an HTML document;
they represent a region of text in the HTML source code,
with a starting point and an ending point.
After any successful fetch,
the session's doc
attribute
also contains the document.
The headers
attribute
contains the headers that were received,
and the url
attribute
contains the URL that was retrieved
(which might be different from the URL you requested,
if redirection took place).
>>> s.doc <Region 0:25751> >>> s.headers {'content-length': '25751', 'accept-ranges': 'bytes', 'server': 'Apache/2.2.8' 'last-modified': 'Tue, 10 Sep 2013 21:38:28 GMT', 'connection': 'close', 'etag': '"5f4b02-6497-4e60e5347fd00"', 'date': 'Tue, 10 Sep 2013 21:55:37 GMT', 'content-type': 'text/html'} >>> s.url 'http://zesty.ca/'
On a Region
,
the raw content is available in the content
attribute,
and the plain text is available in the text
attribute.
(In this case, both of these are Unicode strings
because a Unicode encoding was specified by the server.)
>>> d = s.doc >>> print d.content[:70] <!doctype html public "-//W3C//DTD HTML 4.01 Transitional//EN"> <html >>> d.text[:30] u'Ka-Ping Yee\nKa-Ping Yee pingze'
Extracting content
A Region
object can be sliced, just like a string.
The object remembers its starting and ending positions in the original document,
but any indices you supply are with respect to the region itself.
>>> d <Region 0:25751> >>> r = d[1400:1450] >>> r <Region 1400:1450> >>> len(r) 50 >>> r.start 1400 >>> r.end 1450 >>> r.content 'g-flowers-crop.jpg" alt="picture of Ping among som' >>> r[-15:] <Region 1435:1450>
Call first(tagname)
on a Region
object
to find the first block (inside the region)
with matching start and end tags of the tag name you specify.
The resulting Region
object
retains information about the tag;
you get dictionary-style access to the attributes.
The region starts just after the start tag
and ends just before the end tag.
>>> title = d.first('title') >>> title <Region 311:322 title> >>> title.tagname 'title' >>> title.text u'Ka-Ping Yee' >>> span = d.first('span') >>> span <Region 2791:2792 span class='flag'> >>> span.keys() ['class'] >>> span['class'] 'flag'
last(tagname)
finds the last block inside the region;
next(tagname)
finds the first block
that starts after the end of the region; and
previous(tagname)
finds the last block
that ends before the start of the region.
Basic navigation
If you know the exact text of a link anchor,
follow(anchor)
will find the link,
resolve it, and follow it.
There happens to be a link on my home page that says "CV".
>>> s.follow('CV') <Region 0:22594> >>> s.headers {'date': 'Tue, 10 Sep 2013 21:59:31 GMT', 'accept-ranges': 'bytes', 'content-type': 'text/html', 'connection': 'close', 'server': 'Apache/2.2.8'} >>> s.url 'http://zesty.ca/cv.html' >>> s.doc <Region 0:22594>The
doc
attribute contains the retrieved document
(the same thing returned by go()
,
submit()
, or follow()
).
Instead of the exact anchor text, you can supply a regular expression for the anchor. There's a link on my CV to the University of Waterloo, but the text isn't exactly "Waterloo". It ends in "Waterloo", though.
>>> s.follow('Waterloo') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/ping/python/scrape.py", line 243, in follow raise ScrapeError('link %r not found' % anchor) scrape.ScrapeError: link 'Waterloo' not found >>> s.follow(re.compile('.*waterloo', re.I)) <Region 0:41738>
Calling back()
takes us back to my CV page.
>>> s.url 'http://www.uwaterloo.ca/' >>> s.doc <Region 0:41738> >>> s.back() 'http://zesty.ca/cv.html' >>> s.doc <Region 0:22594>
A Region
object can be associated
with an HTML element,
in which case the starting point is just after the start tag,
and the ending point is just before the end tag;
or it can be associated with an individual tag,
in which case the starting point is just before the "<"
and the ending point is just after the ">".
>>> from scrape import * >>> s.go('http://zesty.ca/') <Region 0:25751>