Beautiful Code, Spring 2003: Explore 3

Threeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee.

Make sure you find your partner before starting this. You and your partner will need to arrange a time to meet before the next homework assignment is due.

1. Optional and Keyword Arguments

This is a feature of function calls in Python that isn't present in most other programming languages. When you define a function, the arguments have names:

>>> def parrot(first, second, third):
...     print first, second, third
...
>>>

You can use those names when you call the function, to specify the arguments in any order you want. You simply write the argument name with an equals sign preceding the argument:

>>> parrot(1, 2, 3)
1 2 3
>>> parrot(first='we', third='pythons', second='like')
we like pythons
>>>

This can help make your program more resilient to change when you're using a function that has lots of arguments. Someone else can rearrange the arguments and your program will still work.

You can use the same syntax when defining a function to specify an optional argument. The value after the equals sign is the default value that the argument will get if the caller doesn't specify that argument. You can have as many optional arguments as you like, but they must go at the end of the list.

In the following example, the exponent argument is optional and defaults to 2.

>>> def root(x, exponent=2):
...     return x ** (1.0 / exponent)
... 
>>> root(4)
2.0
>>> root(5)
2.2360679774997898
>>> root(5, 3)
1.7099759466766968
>>>

These two features are very handy together when you're describing an operation that has many small options. For example, if you were writing a drawing program, it might have a function like this:

def draw_circle(x, y, radius=100, thickness=1, colour='black', fill='white'):
    ...

Then you could call this function as simply draw_circle(200, 200), or as draw_circle(200, 200, colour='red'), or as draw_circle(200, 200, radius=50, fill='blue'), depending how specific you wanted to be.

The default values of the arguments can be expressions. The expressions are evaluated and wrapped up in the function at the time the function is defined, not when it is called.

2. Iteration

The for keyword lets you run a loop that iterates over the elements of any sequence.

>>> for c in 'abcd':
...     print c
... 
a
b
c
d
>>> for item in [3, [], 'ooga', 5.7]:
...     print item
... 
3
[]
ooga
5.7
>>>

In the first example above, we're iterating over the four elements of a string. (Each character is a string of length 1.) In the second example, we're iterating over the four elements of a list.

We can also use for to iterate over the lines of a text file. Files are opened using the built-in open function.

Suppose the file foo.txt contains:

Ah. I'd like to have an argument, please.
Certainly sir. Have you been here before?
No, I haven't, this is my first time.
I see. Well, do you want to have just one argument,
or were you thinking of taking a course?

Then we could read the contents like this:

>>> file = open('foo.txt')
>>> for line in file:
...     print line
... 
Ah. I'd like to have an argument, please.

Certainly sir. Have you been here before?

No, I haven't, this is my first time.

I see. Well, do you want to have just one argument, 

or were you thinking of taking a course?

>>> line
'or were you thinking of taking a course?\n'
>>>

Notice a couple of things here. The printed output is double-spaced because each line we read from the file is a string with a newline character ('\n') on the end. The print statement adds its own newline character at the end, so we get an extra blank line.

Also notice that the variable line retains its last value after the for loop has ended. You can see from the above that it contains a trailing newline character.

Iterating with a for statement always requires that you have a sequence to iterate over. So, in order to run a loop to count numbers, we have to generate a sequence of numbers. The range() function takes care of this for us. This function can take one, two, or three arguments:

range(end) produces the list of integers from 0 to end-1.
range(start, end) produces the list of integers from start to end-1.
range(start, end, step) produces the list of integers starting from start, incrementing by step each time, not including any numbers beyond end-1.

>>> range(5)
[0, 1, 2, 3, 4]
>>> range(3, 8)
[3, 4, 5, 6, 7]
>>> range(3, 30, 7)
[3, 10, 17, 24]
>>> range(20, 10, -1)
[20, 19, 18, 17, 16, 15, 14, 13, 12, 11]
>>>

Q1. What happens if you ask for the range() of a single negative number?

Q2. What happens if the start argument is bigger than the end argument?

Q3. What happens if some of the arguments contain fractions?

When you get to this point, get out of your chair, find someone else in the room, and ask them if they have any Grey Poupon. No, really. I mean it. Take a break.

3. List methods

You've already seen the .append() method on lists: it alters the list in place, adding one element on the end.

You can use dir() to discover the other methods on lists:

>>> dir([])
['__add__', '__class__', '__contains__', '__delattr__', '__delitem__',
'__delslice__', '__eq__', '__ge__', '__getattribute__', '__getitem__',
'__getslice__', '__gt__', '__hash__', '__iadd__', '__imul__',
 '__init__', '__le__', '__len__', '__lt__', '__mul__', '__ne__',
 '__new__', '__reduce__', '__repr__', '__rmul__', '__setattr__',
 '__setitem__', '__setslice__', '__str__', 'append', 'count', 'extend',
 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']
>>>

You can ignore all of the names that start and end with underscores, for now. Those are all special methods. (Special methods are methods that are called automatically instead of by name. For example, some of them correspond to operators: x + y corresponds to calling x.__add__(y). But we won't worry about this until later.)

The ones we care about are the nine "normal" methods:

x.append(item) adds a single item onto the end of list x, increasing the length of x by 1.
x.count(y) tells you how many times y occurs in the list x.
x.extend(items) expects another sequence as the argument, and appends all the elements of that sequence onto the end of x.
x.index(y) tells you the first position at which y occurs as an element of x.
x.insert(index, item) inserts an item into the list x at a particular index, and shoves everything after it to the right.
x.pop(index) removes a single item from the list x, and returns it.
x.remove(y) finds the first occurrence of y in the list and removes it.
x.reverse() reverses the list x in place.
And x.sort() sorts the list x in place.

Give them all a whirl.

Q4. What happens if you try to insert at a negative index?

Q5. What happens if you try to append a list to itself?

Q6. What happens if you try to extend a list with itself?

Q7. Write a little function to find the median of a list of numbers.

4. String methods

Strings also have a lot of methods, which will come in handy as you're doing the assignment. Again, you can get a list of them using dir():

>>> dir('')
['__add__', '__class__', '__contains__', '__delattr__', '__eq__',
'__ge__', '__getattribute__', '__getitem__', '__getslice__', '__gt__',
'__hash__', '__init__', '__le__', '__len__', '__lt__', '__mul__',
'__ne__', '__new__', '__reduce__', '__repr__', '__rmul__',
'__setattr__', '__str__', 'capitalize', 'center', 'count', 'decode',
'encode', 'endswith', 'expandtabs', 'find', 'index', 'isalnum',
'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper',
'join', 'ljust', 'lower', 'lstrip', 'replace', 'rfind', 'rindex',
'rjust', 'rstrip', 'split', 'splitlines', 'startswith', 'strip',
'swapcase', 'title', 'translate', 'upper']
>>>

Here are the most commonly used ones:

x.strip() produces a new string by removing all the whitespace characters (spaces, tabs, newlines) from the beginning and end of x. The similar method x.lstrip() strips spaces only off the beginning, and x.rstrip() strips spaces only off the end.
x.split() produces a list of words by splitting x on whitespace. Any whitespace at the beginning and end of the string is ignored.
x.split(substring) splits the string x on every occurrence of substring.
x.join(list) takes a list of strings, and joins them all together with repetitions of x. If the list has length n, then n - 1 copies of x will be used to join it together.
x.replace(old, new) replaces every occurrence of old with new, and returns the new string.
x.lower() and x.upper() change all the characters to lowercase or uppercase.
x.startswith(string) and x.endswith(string) compare the beginning or ending part of x to a given string.
x.find(substring) searches for a substring within the string x, and returns the index of the first occurrence. It returns -1 if the substring is not found. x.rfind(substring) searches from right to left, returning the last occurrence. Both these methods also accept an optional second argument, an index at which to start looking.

Try them out.

There is also an operator called in that you can use to test if something is a member of a sequence.

>>> 'a' in 'abc'
1
>>> 'd' in 'abc'
0
>>> 3 in [1, 2, 3]
1
>>> 3 in [1, 2, 4]
0
>>>

Curiously enough, the opposite of in is not in.

>>> 3 not in [1, 2, 4]
1
>>>

Q8. Write a little function that takes a sentence and converts it (very simplistically) into Pig Latin. Each word that begins with a vowel should have "way" appended in the resulting sentence; each word that begins with a consonant should have the consonant moved to the end of the word and "ay" appended after that. Don't worry about punctuation, combined consonants, or capital letters; just handle these two simple cases:

>>> def piglatin(sentence):
...     <you fill this in>
...
>>> piglatin('  ethel the aardvark goes quantity surveying.')
'ethelway hetay aardvarkway oesgay uantityqay urveying.say'
>>>

When you get here, find someone you haven't met yet and show them your Pig Latin program.

5. Dictionaries

The dictionary type in Python is a totally new type of collection. Like lists, dictionaries contain things, but they don't order them in sequence. Instead, dictionaries contain associations between pairs of things. Each pair consists of a key and a value. You look up things in a dictionary by providing the key, and the dictionary returns the value.

To write a dictionary, you use curly braces, and join each key-value pair with a colon.

>>> d = {'a': 'aardvark', 'b': 'balloon'}
>>> print d
{'a': 'aardvark', 'b': 'balloon'}
>>> len(d)
2
>>> d['a']
'aardvark'
>>> d['b']
'balloon'
>>> d['c']
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
KeyError: c
>>> d['x'] = 'xylophone'
>>> print d
{'a': 'aardvark', 'b': 'balloon', 'x': 'xylophone'}
>>> d['x']
'xylophone'
>>>

As you can see from the example, len() returns the number of pairs in the dictionary. You use square brackets to look up things in a dictionary, just like a list, except the thing inside the square brackets is a key instead of a numeric index. You also use square brackets to add a new pair to a dictionary, like we did with the key 'x' and the value 'xylophone'.

There will be more to say about dictionaries later, but for now you only need to know two methods:

>>> d.keys()
['a', 'b']
>>> print d.get('c')
None
>>> print d.get('c', 'flugelhorn')
flugelhorn
>>>

The keys() method returns a list of the keys in a dictionary, which is helpful for getting at the contents. The keys will be returned in no particular order. Printing a dictionary also prints the key-value pairs in no particular order.

The get() method provides you a safe way of looking up a value, when you don't know whether the key is present. If you simply ask to get(x) and x is not a key in the dictionary, you will get None. You can also specify a second argument to get, which will be the default value returned if the key is missing.

You can also test to see if a key is in a dictionary using the in operator. Note that this only checks for the existence a key; it doesn't check for a value.

>>> 'a' in d
1
>>> 'c' in d
0
>>> 'aardvark' in d
0
>>> for key in d: print key
...
a
b
>>>

The looping statement for x in d loops over the keys. It does the same thing as for x in d.keys().

Q9. Write a little function that will count how many times each element occurs in a sequence. The result should be a dictionary; in each pair, the key should be one of the elements, and the value should be the number of times it occurred.

>>> count([3,7,6,5,5,6,7,3,5])
{3: 2, 5: 3, 6: 2, 7: 2}
>>> count('bcbabbebcbbabeba')
{'a': 3, 'c': 2, 'b': 9, 'e': 2}
>>>

6. Regular Expressions

The re module lets you search for more interesting patterns in strings, using regular expressions we described in class.

A regular expression is a string that specifies a pattern for matching against other strings. Regular expressions use a special syntax (for example, to allow for wildcards).

Here are some of the common constructs in regular expressions:

. matches any character
spam matches the exact string "spam"
[abc] matches the character a, b, or c
[a-m] matches any lowercase letter from a to m
[^aeiou] matches any character except for a lowercase vowel

You can specify repetitions or optional parts using the following operators:

x* matches zero or any number of repetitions of x
x? matches zero or one occurrence of x
x+ matches at least one repetition of x

To make these operators apply to more than one character, you group parts of the expression in parentheses:

abc+ matches ab followed by at least one repetition of c
(abc)+ matches at least one repetition of abc

You can also combine parts of the expression:

spam|eggs matches spam or eggs

Finally, you can specify whether your pattern has to occur at the beginning or end of the string:

^pres will match any string that starts with pres
ing$ will match any string that ends with ing

There are more features and operators in regular expressions, but these should suffice for now. Here are some examples.

[aeiou]+ will match the entire string eeeee
[aeiou]+ will match the first three letters of the string auireu
[aeiou]+$ will match the last two letters of the string auireu
^[aeiou]+$ will not match the string auireu
[aeiou]+ will match the last two letters of the string zoo
^[aeiou]+ will not match the string zoo

To use regular expressions in a program, you must import the re module and use it to compile your regular expressions. This will produce a pattern object. The pattern object has a method, search(), that you can then call to search a string for a match. Here are some of the above examples in Python:

>>> import re
>>> pat = re.compile('[aeiou]+')
>>> pat.search('eeeee')
<_sre.SRE_Match object at 0x81c2b90>
>>> pat.search('auireu')
<_sre.SRE_Match object at 0x8193b38>
>>>

The weird-looking SRE_Match object represents the results of the match:

>>> match = pat.search('auireu')
>>> match.start()
0
>>> match.end()
3
>>> match.group()
'aui'
>>> match = pat.search('zoo')
>>> match.start()
1
>>> match.end()
3
>>> match.group()
'oo'
>>>

The group() method tells you what part matched, and the start() and end() methods return its position in the string.

The search() method on a pattern will return None if there is no match.

>>> pat = re.compile('^[aeiou]+')
>>> print pat.search('zoo')
None

Patterns also have a method called sub(replacement, string) that will find all occurrences of the pattern in the string and substitute in a replacement.

>>> pat = re.compile('[aeiou]+')
>>> pat.sub('', 'zozoozieiou')
'zzz'
>>> pat.sub('ee', 'zozoozieiou')
'zeezeezee'
>>>

The replacement string can refer to what was matched in the original string. Every time you use a pair of parentheses to group together part of a regular expression, that part is called a group. When a pattern search is performed, the part of the original string that matches each group is saved in the match object. The first left-parenthesis starts group 1, the second left-parenthesis starts group 2, and so on.

Wherever the replacement string contains a backslash followed by a number, the result will substitute a copy of the group referenced by that number. An example will probably help make this clearer:

>>> pat = re.compile('(a..) (b..)')
>>> pat.sub('\\2 \\1', 'art bat cat ack bop')
'bat art cat bop ack'
>>>

In the above example, the pattern looks for a three-letter word starting with 'a', followed by a space, followed by a three-letter word starting with 'b'. The pattern matches the string "art bat cat ack bop" twice: it matches "art bat" and also matches "ack bop". Both of these matches are replaced in the result. The replacement string is the second group, followed by a space, followed by the first group. So the effect is to swap the two words "art" and "bat", and to swap the two words "bop" and "ack".

The special pattern '\b' matches the boundary at the beginning or end of a word, and the special pattern '\w' matches a character in a word (either a letter or a number).

Given this, you can do the Pig Latin transformation of a sentence in just two steps. First, you handle all the words that start with vowels; then you handle all the words that start with consonants. Here's the first step:

>>> aardvark = 'ethel the aardvark goes quantity surveying.')
>>>
>>> pat = re.compile('\\b([aeiouAEIOU]\\w*)')
>>> pat.sub('\\1way', aardvark)
'ethelway the aardvarkway goes quantity surveying.'
>>>

Q10. Try writing the regular expression and substitution for the second step, words that start with consonants.

Q11. (Optional, open-ended.) Adjust your regular expression to handle more interesting cases, like words that start with "tr" or "qu".

I know regular expressions are kind of hairy-looking. Please feel free to ask me about them if you find them confusing.

Onward to the third assignment.

If you have any questions about these exercises or the assignment, feel free to send me e-mail at bczestyca.