Today we're going to take a closer look at collections in Python, and two special features (iterators and generators) that let you create your own collection-like objects with special properties.
1. The Collection Protocol (again)
You probably recall from an earlier lab exercise that there are some special methods you can implement to emulate the behaviour of sequences. Here they are once more:
len(x) can be implemented by x.__len__() x[i] can be implemented by x.__getitem__(i) x[i] = y can be implemented by x.__setitem__(i, y) del x[i] can be implemented by x.__delitem__(i) y in x can be implemented by x.__contains__(y)
This time, i'll actually ask you to give this a whirl.
Have a look at the following, which is just the beginning of a class whose instances represent a collection of files in a directory, using dictionary-like behaviour.
import os class DictDir: def __init__(self, path): self.path = path if not os.path.exists(path): os.makedirs(path) # see help(os.makedirs) def keys(self): return os.listdir(self.path) ...
A DictDir
instance should be keyed on the names of
files in its directory; getting or setting the value associated
with a key should get or set the contents of the file.
For example, if the directory /tmp/spam
already
contains a text file named Manuel
,
and the file contains the string “Barcelona”,
we would want the following behaviour:
>>> dd = DictDir('/tmp/spam') >>> dd['manuel'] 'Barcelona' >>> 'basil' in dd 0 >>> dd['basil'] = 'Fawlty' >>> 'basil' in dd 1 >>> dd['basil'] 'Fawlty' >>> del dd['manuel'] >>> dd.keys() ['basil'] >>>
(And at this point,
/tmp/spam/manuel
would be gone,
and there would be a new text file at
/tmp/spam/basil
containing the string
“Fawlty”.)
Q1. Fill in the DictDir
class
by implementing the methods of the collection protocol.
Tip: Instead of manipulating strings yourself to
assemble the directory name and filename into a path,
use the handy os.path.join()
function,
which works on a variety of platforms.
2. The Default Iteration Protocol
What happens when you try to loop over a DictDir
?
If your implementation is anything like mine, you'll probably get an error.
>>> dd = DictDir('/tmp/spam') >>> for filename in dd: ... print filename ... Traceback (most recent call last): File "<stdin>", line 2, in ? File "dictdir.py", line 13, in __getitem__ path = os.path.join(self.path, name) File "/usr/lib/python2.2/posixpath.py", line 48, in join if b[:1] == '/': TypeError: unsubscriptable object >>>
What's happening here?
Q2.
What is causing this error?
Tip: Try adding print
statements to your __getitem__
method.
Check back at the bottom of Explore 6
for a description of what the for
loop is doing here.
3. Iterators
For situations like this,
Python provides a special iterator protocol
to give you more control over how for
loops
operate on your objects.
The iterator protocol also affects iteration performed
by other built-in functions,
including list()
, map()
,
filter()
, and reduce()
.
In Python, an iterator is an object whose job is to obtain the next item from a collection of items. Iterators pretty much have to be mutable, because every time an iterator returns an item, it has to move forward to the next item. And iterators are usually a separate object from the collection being iterated over, since you usually want to be able to iterate over a collection without destroying it.
When Python starts a loop over an object,
it looks for a method named __iter__()
on the object. It calls this method to obtain
an iterator. Then, each time through the loop,
it calls the next()
method on the
iterator to get the next item.
Iteration stops when the iterator raises a
StopIteration
exception.
In order to be a proper iterator,
an object must implement just two methods:
__iter__()
,
which should just return self
(since an iterator is its own iterator),
and next()
.
Here's an example of an iterator that would work
for the above DictDir
object.
class ListIterator: def __init__(self, items): self.items = items[:] # make a copy of the list self.index = -1 def __iter__(self): return self def next(self): self.index += 1 if self.index < len(self.items): return self.items[self.index] else: raise StopIteration
DictDir
object like this:
class DictDir: ... def __iter__(self): return ListIterator(self.keys()) ...
for filename in dd
where dd
is a DictDir
object,
and thereby loop over the filenames in the directory.
I showed you ListIterator
so you would see an example of how iterators are implemented.
But note that in real life, ListIterator
is actually unnecessary, because it implements exactly
the normal iteration behaviour of a list.
Python has a built-in function iter()
that already implements the default iteration behaviour.
So it would also work to add just this to DictDir
:
class DictDir: ... def __iter__(self): return iter(self.keys()[:]) ...
In general, there are two ways to use any iterator.
- Loop over it:
for item in iterator: ...
. - Call its
next()
method repeatedly:while blah: ... iterator.next() ...
The first way is simpler; the second way gives you more control. You might want to use the second way if you have to go off and do something else before coming back to deal with the next item.
Okay. Now here's a little iterator exercise.
Recall that the built-in range()
function
produces a list of numbers, which you can then iterate over.
But this can be wasteful of time and space;
if you write for i in range(1000000): print i
,
Python will actually construct a list with a million numbers
in it before starting the loop.
Iterators are one way to improve the situation.
If you have an object that stores only the necessary
information about the range
(range()
takes three arguments:
start, stop, and step), it can easily compute the
numbers one by one as they are needed,
instead of wasting a lot of space on the entire list.
Here's a start.
class Range: def __init__(self, start, stop, step): self.start, self.stop, self.step = start, stop, step def __iter__(self): ... class RangeIterator: def __init__(self, ...): ... def __iter__(self): return self def next(self): ...
Q3.
Fill in the parts marked with ...
in the above example,
so that for i in Range(x, y, z)
has the same effect as for i in range(x, y, z)
,
only without ever constructing the entire list of numbers.
Q4. Write an iterator that takes any sequence or iterator as an argument, and produces every other item of the given sequence (starting with the first item, then the third, and so on).
Generators
Sometimes when you're trying to write an iterator,
it isn't practical to design a next()
method
that can return after each item.
There may be too much going on to store and
remember in the iterator object.
So Python recently gained the ability to do something truly remarkable: it can suspend functions in the middle of execution, and resume them later. This really changes what you can do with an iterator. It also means you can have multiple suspended functions waiting around, and you can pick which one you want to resume.
This ability is triggered by the yield
keyword.
It is aptly named, for it both yields a value to
its caller, and yields control back to its caller.
A function which generates values in this way is called
a generator.
Generators are defined using def
just like
normal functions, but if the keyword yield
appears anywhere in the function, you get a generator
object instead of a normal function.
And generator objects can be used as iterators.
For example, here is a generator that produces the sequence 1, -1, 2, -2, 3, -3, 4, -4, 5, -5.
def alternate(): for i in range(1, 6): yield i yield -i
Writing the same thing without generators would be much more tedious, in part because the class now has to remember whether the next number should be positive or negative:
class alternate(): def __init__(self): self.i = 1 self.negative = 0 def __iter__(self): return self def next(self): if self.negative: item = -self.i self.i += 1 else: if self.i >= 6: raise StopIteration item = self.i self.negative = not self.negative return item
The generator gains a lot of simplicity because
the interpreter remembers where it was in the generator
each time the generator pauses; it remembers that it
was inside a loop, and it remembers which yield
statement it was on. In the non-generator version,
both of these things have to be explicitly stored as variables.
Note that generators are such a new feature that they are enabled by default only in Python 2.3. To use generators in Python 2.2, you need to say
from __future__ import generators
Q5. Write an generator that takes any sequence or iterator as an argument, and produces every other item of the given sequence (starting with the first item, then the third, and so on).
Generators are particularly powerful when you want to do iteration over a hierarchy instead of a list, since generators can call other generators recursively.
Q6. Write a generator that returns the list of all files in a subdirectory tree.
Here's a start:
import os def allfiles(directory): for filename in os.listdir(directory): path = os.path.join(directory, filename) if os.path.isdir(path): ... if os.path.isfile(path): ...
Fill in the parts marked with ...
.
There's no assignment this week. Please spend the time working out your project's design and completing the test suite, which will be due on April 7. On April 7, the following are due:
- Outline of the modules and classes in your project.
- Documentation explaining what each function and method is supposed to do. (It doesn't have to be a lot. But write enough so that I can see how all the pieces work and fit together.)
- A test suite (in
unittest
style) containing tests for any functions and methods that can be tested independently of the whole project. (You don't have to write tests for things that can only be tested by a user. But try to separate as much functionality as possible out of the user interface and into modules.) There should be one test file corresponding to each module (if the module is calledspam.py
, put the tests intest_spam.py
).