Beautiful Code, Spring 2003: Explore 11

Today we're going to take a closer look at collections in Python, and two special features (iterators and generators) that let you create your own collection-like objects with special properties.

1. The Collection Protocol (again)

You probably recall from an earlier lab exercise that there are some special methods you can implement to emulate the behaviour of sequences. Here they are once more:

len(x)       can be implemented by     x.__len__()
x[i]         can be implemented by     x.__getitem__(i)
x[i] = y     can be implemented by     x.__setitem__(i, y)
del x[i]     can be implemented by     x.__delitem__(i)
y in x       can be implemented by     x.__contains__(y)

This time, i'll actually ask you to give this a whirl.

Have a look at the following, which is just the beginning of a class whose instances represent a collection of files in a directory, using dictionary-like behaviour.

import os

class DictDir:
    def __init__(self, path):
        self.path = path
        if not os.path.exists(path):
            os.makedirs(path)           # see help(os.makedirs)

    def keys(self):
        return os.listdir(self.path)

    ...

A DictDir instance should be keyed on the names of files in its directory; getting or setting the value associated with a key should get or set the contents of the file. For example, if the directory /tmp/spam already contains a text file named Manuel, and the file contains the string “Barcelona”, we would want the following behaviour:

>>> dd = DictDir('/tmp/spam')
>>> dd['manuel']
'Barcelona'
>>> 'basil' in dd
0
>>> dd['basil'] = 'Fawlty'
>>> 'basil' in dd
1
>>> dd['basil']
'Fawlty'
>>> del dd['manuel']
>>> dd.keys()
['basil']
>>>

(And at this point, /tmp/spam/manuel would be gone, and there would be a new text file at /tmp/spam/basil containing the string “Fawlty”.)

Q1. Fill in the DictDir class by implementing the methods of the collection protocol. Tip: Instead of manipulating strings yourself to assemble the directory name and filename into a path, use the handy os.path.join() function, which works on a variety of platforms.

2. The Default Iteration Protocol

What happens when you try to loop over a DictDir?

If your implementation is anything like mine, you'll probably get an error.

>>> dd = DictDir('/tmp/spam')
>>> for filename in dd:
...     print filename
...
Traceback (most recent call last):
  File "<stdin>", line 2, in ?
  File "dictdir.py", line 13, in __getitem__
    path = os.path.join(self.path, name)
  File "/usr/lib/python2.2/posixpath.py", line 48, in join
    if b[:1] == '/':
TypeError: unsubscriptable object
>>>

What's happening here?

Q2. What is causing this error? Tip: Try adding print statements to your __getitem__ method.

Check back at the bottom of Explore 6 for a description of what the for loop is doing here.

3. Iterators

For situations like this, Python provides a special iterator protocol to give you more control over how for loops operate on your objects. The iterator protocol also affects iteration performed by other built-in functions, including list(), map(), filter(), and reduce().

In Python, an iterator is an object whose job is to obtain the next item from a collection of items. Iterators pretty much have to be mutable, because every time an iterator returns an item, it has to move forward to the next item. And iterators are usually a separate object from the collection being iterated over, since you usually want to be able to iterate over a collection without destroying it.

When Python starts a loop over an object, it looks for a method named __iter__() on the object. It calls this method to obtain an iterator. Then, each time through the loop, it calls the next() method on the iterator to get the next item. Iteration stops when the iterator raises a StopIteration exception.

In order to be a proper iterator, an object must implement just two methods: __iter__(), which should just return self (since an iterator is its own iterator), and next().

Here's an example of an iterator that would work for the above DictDir object.

class ListIterator:
    def __init__(self, items):
        self.items = items[:]           # make a copy of the list
        self.index = -1

    def __iter__(self):
        return self

    def next(self):
        self.index += 1
        if self.index < len(self.items):
            return self.items[self.index]
        else:
            raise StopIteration

This iterator would work on any list. Then we would add this iterator to the DictDir object like this:

class DictDir:
    ...
    def __iter__(self):
        return ListIterator(self.keys())
    ...

After making the above change, we would be able to write for filename in dd where dd is a DictDir object, and thereby loop over the filenames in the directory.

I showed you ListIterator so you would see an example of how iterators are implemented. But note that in real life, ListIterator is actually unnecessary, because it implements exactly the normal iteration behaviour of a list. Python has a built-in function iter() that already implements the default iteration behaviour. So it would also work to add just this to DictDir:

class DictDir:
    ...
    def __iter__(self):
        return iter(self.keys()[:])
    ...

In general, there are two ways to use any iterator.

Loop over it: for item in iterator: ....
Call its next() method repeatedly: while blah: ... iterator.next() ...

The first way is simpler; the second way gives you more control. You might want to use the second way if you have to go off and do something else before coming back to deal with the next item.

Okay. Now here's a little iterator exercise.

Recall that the built-in range() function produces a list of numbers, which you can then iterate over. But this can be wasteful of time and space; if you write for i in range(1000000): print i, Python will actually construct a list with a million numbers in it before starting the loop.

Iterators are one way to improve the situation. If you have an object that stores only the necessary information about the range (range() takes three arguments: start, stop, and step), it can easily compute the numbers one by one as they are needed, instead of wasting a lot of space on the entire list.

Here's a start.

class Range:
    def __init__(self, start, stop, step):
        self.start, self.stop, self.step = start, stop, step

    def __iter__(self):
        ...

class RangeIterator:
    def __init__(self, ...):
        ...

    def __iter__(self):
        return self

    def next(self):
        ...

Q3. Fill in the parts marked with ... in the above example, so that for i in Range(x, y, z) has the same effect as for i in range(x, y, z), only without ever constructing the entire list of numbers.

Q4. Write an iterator that takes any sequence or iterator as an argument, and produces every other item of the given sequence (starting with the first item, then the third, and so on).

Generators

Sometimes when you're trying to write an iterator, it isn't practical to design a next() method that can return after each item. There may be too much going on to store and remember in the iterator object.

So Python recently gained the ability to do something truly remarkable: it can suspend functions in the middle of execution, and resume them later. This really changes what you can do with an iterator. It also means you can have multiple suspended functions waiting around, and you can pick which one you want to resume.

This ability is triggered by the yield keyword. It is aptly named, for it both yields a value to its caller, and yields control back to its caller. A function which generates values in this way is called a generator. Generators are defined using def just like normal functions, but if the keyword yield appears anywhere in the function, you get a generator object instead of a normal function. And generator objects can be used as iterators.

For example, here is a generator that produces the sequence 1, -1, 2, -2, 3, -3, 4, -4, 5, -5.

def alternate():
    for i in range(1, 6):
        yield i
        yield -i

Writing the same thing without generators would be much more tedious, in part because the class now has to remember whether the next number should be positive or negative:

class alternate():
    def __init__(self):
        self.i = 1
        self.negative = 0
    def __iter__(self):
        return self
    def next(self):
        if self.negative:
            item = -self.i
            self.i += 1
        else:
            if self.i >= 6:
                raise StopIteration
            item = self.i
        self.negative = not self.negative
        return item

The generator gains a lot of simplicity because the interpreter remembers where it was in the generator each time the generator pauses; it remembers that it was inside a loop, and it remembers which yield statement it was on. In the non-generator version, both of these things have to be explicitly stored as variables.

Note that generators are such a new feature that they are enabled by default only in Python 2.3. To use generators in Python 2.2, you need to say

from __future__ import generators

at the top of your Python program, or at the beginning of your interpreter session.

Q5. Write an generator that takes any sequence or iterator as an argument, and produces every other item of the given sequence (starting with the first item, then the third, and so on).

Generators are particularly powerful when you want to do iteration over a hierarchy instead of a list, since generators can call other generators recursively.

Q6. Write a generator that returns the list of all files in a subdirectory tree.

Here's a start:

import os

def allfiles(directory):
    for filename in os.listdir(directory):
        path = os.path.join(directory, filename)
        if os.path.isdir(path):
            ...
        if os.path.isfile(path):
            ...

Fill in the parts marked with ....

There's no assignment this week. Please spend the time working out your project's design and completing the test suite, which will be due on April 7. On April 7, the following are due:

Outline of the modules and classes in your project.
Documentation explaining what each function and method is supposed to do. (It doesn't have to be a lot. But write enough so that I can see how all the pieces work and fit together.)
A test suite (in unittest style) containing tests for any functions and methods that can be tested independently of the whole project. (You don't have to write tests for things that can only be tested by a user. But try to separate as much functionality as possible out of the user interface and into modules.) There should be one test file corresponding to each module (if the module is called spam.py, put the tests in test_spam.py).