DekGenius.com
[ Team LiB ] Previous Section Next Section

27.4 Data Structure Manipulations

One of Python's greatest features is that it provides the list, tuple, and dictionary built-in types. They are so flexible and easy to use that once you've grown used to them, you'll find yourself reaching for them automatically. While we covered all of the operations on each data structure as we introduced them, now's a good time to go over tasks that can apply to all data structures, such as how to make copies, sort objects, randomize sequences, etc. Many functions and algorithms (theoretical procedures describing how to implement a complex task in terms of simpler basic tasks) are designed to work regardless of the type of data being manipulated. It is therefore useful to know how to do generic things for all data types.

27.4.1 Making Copies

Making copies of objects is a reasonable task in many programming contexts. Often, the only kind of copy that's needed is just another reference to an object, as in:

x = 'tomato'
y = x                   # y is now 'tomato'.
x = x + ' and cucumber' # x is now 'tomato and cucumber', but y is unchanged.

Due to Python's reference management scheme, the statement a = b doesn't make a copy of the object referenced by b; instead, it makes a new reference to that same object. When the object being copied is an immutable object (e.g., a string), there is no real difference. When dealing with mutable objects like lists and dictionaries, however, sometimes a real, new copy of the object, not just a shared reference, is needed. How to do this depends on the type of the object in question. The simplest way of making a copy is to use the list( ) or tuple( ) constructors:

newList = list(myList)
newTuple = tuple(myTuple)

As opposed to the simplest, the most common way to make copies of sequences like lists and tuples is somewhat odd. If myList is a list, then to make a copy of it, you can use:

newList = myList[:]

which you can read as "slice from beginning to end," since the default index for the start of a slice is the beginning of the sequence (0), and the default index for the end of a slice is the end of sequence (see Chapter 3). Since tuples support the same slicing operation as lists, this same technique can also be applied to tuples, except that if x is a tuple, then x[:] is the same object as x, since tuples are immutable. Dictionaries, on the other hand, don't support slicing. To make a copy of a dictionary myDict, you can either use:

newDict = myDict.copy(  )

or the dict( ) constructor:

newDict = dict(myDict)

For a different kind of copying, if you have a dictionary oneDict, and want to update it with the contents of a different dictionary otherDict, simply type oneDict.update(otherDict). This is the equivalent of:

for key in otherDict.keys(  ):
    oneDict[key] = otherDict[key]

If oneDict shared some keys with otherDict before the update( ) operation, the old values associated with the keys in oneDict are obliterated by the update. This may be what you want to do (it usually is). If it isn't, the right thing to do might be to raise an exception. To do this, make a copy of one dictionary, then look over each entry in the second. If we find shared keys, we raise an exception, if not, we just add the key-value mapping to the new dictionary.

def mergeWithoutOverlap(oneDict, otherDict):
    newDict = oneDict.copy(  )
    for key in otherDict:
        if key in oneDict:
            raise ValueError, "the two dictionaries share keys!"
        newDict[key] = otherDict[key]
    return newDict

or, alternatively, combine the values of the two dictionaries, with a tuple, for example. Using the same logic as in mergeWithoutOverlap, but combining the values instead of throwing an exception:

def mergeWithOverlap(oneDict, otherDict):
    newDict = oneDict.copy(  )
    for key in otherDict:
        if key in oneDict:
            newDict[key] = oneDict[key], otherDict[key]
        else:
            newDict[key] = otherDict[key]
    return newDict

To illustrate the differences between the preceding three algorithms, consider the following two dictionaries:

phoneBook1 = {'michael': '555-1212', 'mark': '554-1121', 'emily': '556-0091'}
phoneBook2 = {'latoya': '555-1255', 'emily': '667-1234'}

If phoneBook1 is possibly out of date, and phoneBook2 is more up to date but less complete, the right usage is probably phoneBook1.update(phoneBook2). If the two phoneBooks are supposed to have nonoverlapping sets of keys, using newBook = mergeWithoutOverlap(phoneBook1, phoneBook2) lets you know if that assumption is wrong. Finally, if one is a set of home phone numbers and the other a set of office phone numbers, chances are newBook = mergeWithOverlap(phoneBook1, phoneBook2) is what you want, as long as the subsequent code that uses newBook can deal with the fact that newBook['emily'] is the tuple ('556-0091', '667-1234').

27.4.1.1 The copy module

The [:] and .copy( ) tricks will get you copies in 90% of the cases. If you are writing functions that, in true Python spirit, can deal with arguments of any type, it's sometimes necessary to make copies of x, regardless of what x is. In comes the copy module. It provides two functions, copy and deepcopy. The first is just like the [:] sequence slice operation or the copy method of dictionaries. The second is more subtle and has to do with deeply nested structures (hence the term deepcopy). Take the example of copying a list (listOne) by slicing it from beginning to end using the [:] construct. This technique makes a new list that contains references to the same objects contained in the original list. If the contents of that original list are immutable objects, such as numbers or strings, the copy is as good as a "true" copy. However, suppose that the first element in listOne is itself a dictionary (or any other mutable object). The first element of the copy of listOne is a new reference to the same dictionary. So if you then modify that dictionary, the modification is evident in both listOne and the copy of listOne. An example makes it much clearer:

>>> import copy
>>> listOne = [{"name": "Willie", "city": "Providence, RI"}, 1, "tomato", 3.0]
>>> listTwo = listOne[:]                   # Or listTwo=copy.copy(listOne)
>>> listThree = copy.deepcopy(listOne)
>>> listOne.append("kid")
>>> listOne[0]["city"] = "San Francisco, CA"
>>> print listOne, listTwo, listThree
[{'name': 'Willie', 'city': 'San Francisco, CA'}, 1, 'tomato', 3.0, 'kid']
[{'name': 'Willie', 'city': 'San Francisco, CA'}, 1, 'tomato', 3.0]
[{'name': 'Willie', 'city': 'Providence, RI'}, 1, 'tomato', 3.0]

As you can see, modifying listOne directly modified only listOne. Modifying the first entry of the list referenced by listOne led to changes in listTwo, but not in listThree; that's the difference between a shallow copy ([:]) and a deep copy. The copy module functions know how to copy all the built-in types that are reasonably copyable,[4] including classes and instances.

[4] Some objects don't qualify as "reasonably copyable," such as modules, file objects, and sockets. Remember that file objects are different from files on disk as they are opened at a particular point, and are possibly not even fully written to disk yet. For copying files on disk, the shutil module is introduced later in this chapter.

27.4.2 Sorting

Lists have a sort method that does an in-place sort. Sometimes you want to iterate over the sorted contents of a list, without disturbing the contents of this list. Or you may want to list the sorted contents of a tuple. Because tuples are immutable, an operation such as sort, which modifies it in place, is not allowed. The only solution is to make a list copy of the elements, sort the list copy, and work with the sorted copy, as in:

listCopy = list(myTuple)
listCopy.sort(  )
for item in listCopy:
    print item                             # Or whatever needs doing

This solution is also the way to deal with data structures that have no inherent order, such as dictionaries. One of the reasons that dictionaries are so fast is that the implementation reserves the right to change the order of the keys in the dictionary. It's really not a problem, however, given that you can iterate over the keys of a dictionary using an intermediate copy of the keys of the dictionary:

keys = myDict.keys(  )                          # Returns an unsorted list of
                                           # the keys in the dict.
keys.sort(  )
for key in keys:                           # Print key/value pairs 
    print key, myDict[key]                 # sorted by key.

The sort method on lists uses the standard Python comparison scheme. Sometimes, however, that scheme isn't what's needed, and you need to sort according to some other procedure. For example, when sorting a list of words, case (lower versus UPPER) may not be significant. The standard comparison of text strings, however, says that all uppercase letters come before all lowercase letters, so 'Baby' is less than 'apple', but 'baby' is greater than 'apple'. In order to do a case-independent sort, you need to define a comparison function that takes two arguments, and returns -1, 0, or 1 depending on whether the first argument is smaller than, equal to, or greater than the second argument. So, for case-independent sorting, you can use:

>>> def caseIndependentSort(something, other):
...    something, other = something.lower(  ), other.lower(  )
...    return cmp(something, other)
... 
>>> testList = ['this', 'is', 'A', 'sorted', 'List']
>>> testList.sort(  )
>>> print testList
['A', 'List', 'is', 'sorted', 'this']
>>> testList.sort(caseIndependentSort)
>>> print testList
['A', 'is', 'List', 'sorted', 'this']

We're using the built-in function cmp, which does the hard part of figuring out that 'a' comes before 'b', 'b' before 'c', etc. Our sort function simply converts both items to lowercase and compares the lowercase versions. Also note that the conversion to lowercase is local to the comparison function, so the elements in the list aren't modified by the sort.

27.4.3 Randomizing

What about randomizing a sequence, such as a list of lines? The easiest way to randomize a sequence is to call the shuffle function in the random module, which randomizes a sequence in-place:[5]

[5] Another useful function in the random module is the choice function, which returns a random element in the sequence passed in as argument.

random.shuffle(myList)

If you need to shuffle a nonlist object, it's usually easiest to convert that object to a list and shuffle the list version of the same data, rather than come up with a new strategy for each data type. This might seem a wasteful strategy, given that it involves building intermediate lists that might be quite large. In general, however, what seems large to you probably won't seem so to the computer, thanks to the reference system. Also, consider the time saved by not having to come up with a different strategy for each data type! Python is designed to save programmer time; if that means running a slightly slower or bigger program, so be it. If you're handling enormous amounts of data, it may be worthwhile to optimize. But never optimize until the need for optimization is clear; that would be a waste of your time.

27.4.4 Making New Data Structures

This chapter emphasizes the silliness involved in reinventing wheels. This point is especially important when it comes to data structures. For example, Python lists and dictionaries might not be the lists and dictionaries or mappings you're used to, but you should avoid designing your own data structure if these structures will suffice. The algorithms they use have been tested under wide ranges of conditions, and they're fast and stable. Sometimes, however, the interface to these algorithms isn't convenient for a particular task.

For example, computer science textbooks often describe algorithms in terms of other data structures such as queues and stacks. To use these algorithms, it may make sense to come up with a data structure that has the same methods as these data structures (such as pop and push for stacks or enqueue/dequeue for queues). However, it also makes sense to reuse the built-in list type in the implementation of a stack. In other words, you need something that acts like a stack but is based on a list. A simple solution is to use a class wrapper around a list. For a minimal stack implementation, you can do this:

class Stack:
    def __init__(self, data):
        self._data = list(data)
        self.push = data.append
        self.pop = data.pop

The following is simple to write, to understand, to read, and to use:

>>> thingsToDo = Stack(['write to mom', 'invite friend over', 'wash the kid'])
>>> thingsToDo.push('do the dishes')
>>> print thingsToDo.pop(  )
do the dishes
>>> print thingsToDo.pop(  )
wash the kid

Two standard Python naming conventions are used in the Stack class above. The first is that class names start with an uppercase letter, to distinguish them from functions. The other is that the _data attribute starts with an underscore. This is a half-way point between public attributes (which don't start with an underscore), private attributes (which start with two underscores; see Chapter 7), and Python-reserved identifiers (which both start and end with two underscores). What it means is that _data is an attribute of the class that shouldn't be needed by clients of the class. The class designer expects such pseudo-private attributes to be used only by the class methods and by the methods of any eventual subclass.

27.4.5 Making New Lists and Dictionaries

The Stack class presented earlier does its minimal job just fine. It assumes a fairly minimal definition of what a stack is, specifically, something that supports just two operations, a push and a pop. Some of the features of lists would be nice to use, such as the ability to iterate over all the elements using the for . . . in . . . construct. While you could continue in the style of the previous class and delegate to the "inner" list object, at some point it makes more sense to simply reuse the implementation of list objects directly, through subclassing. In this case, you should derive a class from the list base class. The dict base class can also be used to create dictionary-like classes.

# Subclass the list class.
class Stack(list):
    push = list.append

This Stack is a subclass of the list class. The list class implements the pop methods among others. You don't need to define your own __init__ method because list defines a perfectly good default. The push method is defined just by saying that it's the same as list's append method. Now we can do list-like things as well as stack-like things:

>>> thingsToDo = Stack(['write to mom', 'invite friend over', 'wash the kid'])
>>> print thingsToDo                  # Inherited from list base class
['write to mom', 'invite friend over', 'wash the kid']
>>> thingsToDo.pop(  )        
'wash the kid'
>>> thingsToDo.push('change the oil')
>>> for chore in thingsToDo:          # We can also iterate over the contents.
...    print chore 
...
write to mom
invite friend over
change the oil
    [ Team LiB ] Previous Section Next Section