DekGenius.com
[ Team LiB ] Previous Section Next Section

27.5 Manipulating Files and Directories

So far so good—we know how to create objects, we can convert between different data types, and we can perform various kinds of operations on them. In practice, however, as soon as one leaves the computer science classroom one is faced with tasks that involve manipulating data that lives outside of the program and performing processes that are external to Python. That's when it becomes very handy to know how to talk to the operating system, explore the filesystem, read and modify files.

27.5.1 The os and os.path Modules

The os module provides a generic interface to the operating system's most basic set of tools. Different operating systems have different behaviors. This is true at the programming interface as well. This makes it hard to write so-called "portable" programs, which run well regardless of the operating system. Having generic interfaces independent of the operating system helps, as does using an interpreted language like Python. The specific set of calls it defines depend on which platform you use. (For example, the permission-related calls are available only on platforms that support them, such as Unix and Windows.) Nevertheless, it's recommended that you always use the os module, instead of the platform-specific versions of the module (called by such names as posix, nt, and mac). Table 27-4 lists some of the most often used functions in the os module. When referring to files in the context of the os module, one is referring to filenames, not file objects.

Table 27-4. Most frequently used functions from the os module

Function name

Behavior

getcwd( )

Returns a string referring to the current working directory (cwd):

>>> print os.getcwd(  )
h:\David\book

listdir(path)

Returns a list of all of the files in the specified directory:

>>> os.listdir(os.getcwd(  ))
['preface.doc', 'part1.doc', 'part2.doc']

chown(path, uid, gid)

Changes the owner ID and group ID of specified file

chmod(path, mode)

Changes the permissions of specified file with numeric mode mode (e.g., 0644 means read/write for owner, read for everyone else)

rename(src, dest)

Renames file named src with name dest

remove(path) or unlink(path)

Deletes specified file (see rmdir( ) to remove directories)

rmdir(path)

Deletes specified directory

removedirs(path)

Works like rmdir( ) except that if the leaf directory is successfully removed, directories corresponding to rightmost path segments will be pruned away.

mkdir(path[, mode])

Creates a directory named path with numeric mode mode (see os.chmod):

>>> os.mkdir('newdir')

makedirs(path[, mode])

Like mkdir( ), but makes all intermediate-level directories needed to contain the leaf directory:

>>> os.makedirs('newdir/newsubdir/newsubsubdir')

system(command)

Executes the shell command in a subshell; the return value is the return code of the command

symlink(src, dest)

Creates soft link from file src to file dest

link(src, dest)

Creates hard link from file src to file dest

stat(path)

Returns data about the file, such as size, last modified time, and ownership:

>>> os.stat('TODO.txt') 
# It returns something like a tuple.
(33206, 0L, 3, 1, 0, 0, 1753L, 1042186004, 
1042186004, 1042175785)
>>> os.stat('TODO.txt').st_size 
# Just look at the size.
1753L
>>> time.asctime(time.localtime
              (os.stat('TODO.txt').st_mtime))
'Fri Jan 10 00:06:44 2003'

walk(top, topdown=True, onerror=None) (Python 2.3 and later)

For each directory in the directory tree rotted at top (including top itself, but excluding '.' and '..'), yield a 3-tuple:

dirpath, dirnames, filenames

With just these modules, you can find out a lot about the current state of the filesystem, as well as modify it:

>>> print os.getcwd(  )        # Where am I?
C:\Python22
>>> print os.listdir('.')    # What's here?
['DLLs', 'Doc', 'include', 'Lib', 'libs', 'License.txt', ...]
>>> os.chdir('Lib')          # Let's go explore the library.
>>> print os.listdir('.')    # What's here?
['aifc.py', 'anydbm.py', 'anydbm.pyc', 'asynchat.py',
'asyncore.py', 'atexit.py', 'atexit.pyc', 'atexit.pyo',
'audiodev.py', 'base64.py', ...]
>>> os.remove('atexit.pyc')  # We can remove .pyc files safely.
>>>

There are many other functions in the os module; in fact, just about any function that's part of the POSIX standard and widely available on most Unix and Unix-like platforms is supported by Python on Unix. The interfaces to these routines follow the POSIX conventions. You can retrieve and set UIDs, PIDs, and process groups; control nice levels; create pipes; manipulate file descriptors; fork processes; wait for child processes; send signals to processes; use the execv variants; etc (if you don't know what half of the words in this paragraph mean, don't worry, you probably don't need to).

The os module also defines some important attributes that aren't functions:

  • The os.name attribute defines the current version of the platform-specific operating-system interface. Registered values for os.name are 'posix', 'nt', 'dos', and 'mac'. It's different from sys.platform, primarily in that it's less specific—for example, Solaris and Linux will have the same value ('posix') for os.name, but different values of sys.platform.

  • os.error defines an exception class used when calls in the os module raise errors. It's the same thing as OSError, one of the built-in exception classes. When this exception is raised, the value of the exception object contains two variables. The first is the number corresponding to the error (known as errno), and the second is a string message explaining it (known as strerror):

    >>> os.rmdir('nonexistent_directory')      # How it usually shows up
    Traceback (innermost last):
      File "<stdin>", line 1, in ?
    os.error: (2, 'No such file or directory')
    >>> try:                                   # We can catch the error and take
    ...    os.rmdir('nonexistent directory')   # it apart.
    ... except os.error, value:
    ...     print value[0], value[1]
    ...
    2 No such file or directory
  • The os.environ dictionary contains key/value pairs corresponding to the environment variables of the shell from which Python was started. Because this environment is inherited by the commands that are invoked using the os.system call, modifying the os.environ dictionary modifies the environment:

    >>> print os.environ['SHELL']
    /bin/sh
    >>> os.environ['STARTDIR'] = 'MyStartDir'
    >>> os.system('echo $STARTDIR')           # 'echo %STARTDIR%' on DOS/Win
    MyStartDir                                # Printed by the shell
    0                                         # Return code from echo

The os module also includes a set of strings that define portable ways to refer to directory-related parts of filename syntax, as shown in Table 27-5.

Table 27-5. String attributes of the os module

Attribute name

Meaning and values

curdir

A string that denotes the current directory: '.' on Unix, DOS, and Windows; ':' on the Mac

pardir

A string that denotes the parent directory: '..' on Unix, DOS, and Windows; '::' on the Mac

sep

The character that separates pathname components: '/' on Unix, '\' on DOS and Windows, ':' on the Mac

altsep

An alternate character to sep when available; set to None on all systems except DOS and Windows, where it's '/'

pathsep

The character that separates path components: ':' on Unix, ';' on DOS and Windows

These strings are used by the functions in the os.path module, which manipulate file paths in portable ways (see Table 27-6). Note that the os.path module is an attribute of the os module, not a sub-module of an os package; it's imported automatically when the os module is loaded, and (unlike packages) you don't need to import it explicitly. The outputs of the examples in Table 27-6 correspond to code run on a Windows or DOS machine. On another platform, the appropriate path separators would be used instead. A useful relevant bit of knowledge is that the forward slash (/) can be used safely in Windows to indicate directory traversal, even though the native separator is the backwards slash (\)—Python and Windows both do the right thing with it.

Table 27-6. Most frequently used functions from the os.path module

Function name

Behavior

split(path) is equivalent to the tuple: (dirname(path), basename(path))

Splits the given path into a pair consisting of a head and a tail; the head is the path up to the directory, and the tail is the filename:

>>> os.path.split("h:/David/book/part2.doc"
('h:/David/book', 'part2.doc')

splitdrive(p)

Splits a pathname into drive and path specifiers:

>>> os.path.splitdrive(r"C:\foo\bar.txt")
('C:', '\\foo\\bar.txt')

splitext(p)

Splits the extension from a pathname:

>>> os.path.splitext(r"C:\foo\bar.txt")
('C:\\foo\\bar', '.txt')

splitunc(p)

Splits a pathname into UNC mount point and relative path specifiers:

>>> os.path.splitunc(r"\\machine\mount\directory
                      \file.txt")
('\\\\machine\\mount', '\\directory\\file.txt')

join(path, ...)

Joins path components intelligently:

>>> print os.path.join(os.getcwd(  ),
... os.pardir, 'backup', 'part2.doc')
h:\David\book\..\backup\part2.doc

exists(path)

Returns true if path corresponds to an existing path

expanduser(path)

Expands the argument with an initial argument of ~ followed optionally by a username:

>>> print os.path.expanduser('~/mydir')
h:\David\mydir

expandvars(path)

Expands the path argument with the variables specified in the environment:

>>> print os.path.expandvars('$TMP')
C:\TEMP

isfile(path)isdir(path)islink(path)ismount(path)isabs(path)

Returns true if the specified path is a file, directory, link, mount point, or an absolute path, respectively

getatime(filename)getmtime(filename)getsize(filename)

Gets the last access time, last modification time, and size of a file, respectively

normpath(path)

Normalizes the given path, collapsing redundant separators and uplevel references:

>>> print os.path.normpath("/foo/bar\\../tmp")
\foo\tmp

normcase(s)

Normalizes case of pathname; makes all characters lowercase and all slashes into backslashes:

>>> print os.path.normcase(r'c:/foo\BAR.txt')
c:\foo\bar.txt

samefile(p, q)

Returns true if both arguments refer to the same file

walk(p, visit, arg)

Calls the function visit with arguments (arg, dirname, names) for each directory in the directory tree rooted at p (including p itself, if it's a directory); the argument dirname specifies the visited directory; the argument names lists the files in the directory:

>>> def test_walk(arg, dirname, names):
...     print arg, dirname, names
...
>>> os.path.walk('..', test_walk, 'show')
show ..\logs ['errors.log', 'access.log']
show ..\cgi-bin ['test.cgi']
...

27.5.2 Copying Files and Directories: The shutil Module

The keen-eyed reader might have noticed that the os module, while it provides lots of file-related functions, doesn't include a copy function. In DOS, copying a file is basically the same thing as opening one file in read/binary mode, reading all its data, opening a second file in write/binary mode, and writing the data to the second file. On Unix and Windows, making that kind of copy fails to copy the stat bits (permissions, modification times, etc.) associated with the file. On the Mac, that operation won't copy the resource fork, which contains data such as icons and dialog boxes. In other words, copying files is just more complicated than one could reasonably believe. Nevertheless, often you can get away with a fairly simple function that works on Windows, DOS, Unix, and Mac, as long as you're manipulating just data files with no resource forks. That function, called copyfile, lives in the shutil module. This module includes a few generally useful functions, shown in Table 27-7.

Table 27-7. Functions of the shutil module

Function name

Behavior

copyfile(src, dest)

Makes a copy of the file src and calls it dest (straight binary copy).

copymode(src, dest)

Copies mode information (permissions) from src to dest.

copystat(src, dest)

Copies all stat information (mode, utime) from src to dest.

copy(src, dest)

Copies data and mode information from src to dest (doesn't include the resource fork on Macs).

copy2(src, dest)

Copies data and stat information from src to dest (doesn't include the resource fork on Macs).

copytree(src, dest, symlinks=0)

Copies a directory recursively using copy2. The symlinks flag specifies whether symbolic links in the source tree must result in symbolic links in the destination tree, or whether the files being linked to must be copied. The destination directory must not already exist.

rmtree(path, ignore_errors=0, onerror=None)

Recursively deletes the directory indicated by path. If ignore_error is set to 0 (the default behavior), errors are ignored. Otherwise, if onerror is set, it's called to handle the error; if not, an exception is raised on error.

27.5.3 Filenames and Directories

While the previous section lists common functions for working with files, many tasks require more than a single function call.

Let's take a typical example: you have lots of files, all of which have a space in their name, and you'd like to replace the spaces with underscores. All you need is the os.curdir attribute (which returns an operating-system specific string that corresponds to the current directory), the os.listdir function (which returns the list of filenames in a specified directory), and the os.rename function:

import os, string
if len(sys.argv) == 1:                     # If no filenames are specified,
    filenames = os.listdir(os.curdir)      # use current dir;
else:                                      # otherwise, use files specified
    filenames = sys.argv[1:]               # on the command line.
for filename in filenames:
    if ' ' in filename:
        newfilename = filename.replace(' ', '_')
        print "Renaming", filename, "to", newfilename, "..."
        os.rename(filename, newfilename)

This program works fine, but it reveals a certain Unix-centrism. That is, if you call it with wildcards, such as:

python despacify.py *.txt

you find that on Unix machines, it renames all the files with names with spaces in them and that end with .txt. In a DOS-style shell, however, this won't work because the shell normally used in DOS and Windows doesn't convert from *.txt to the list of filenames; it expects the program to do it. This is called globbing, because the * is said to match a glob of characters. Luckily, Python helps us make the code portable.

27.5.4 Matching Sets of Files

The glob module exports a single function, also called glob, which takes a filename pattern and returns a list of all the filenames that match that pattern (in the current working directory):

import sys, glob
print sys.argv[1:]
sys.argv = [item for arg in sys.argv for item in glob.glob(arg)]
print sys.argv[1:]

Running this on Unix and DOS shows that on Unix, the Python glob didn't do anything because the globbing was done by the Unix shell before Python was invoked, and in DOS, Python's globbing came up with the same answer:

/usr/python/book$ python showglob.py *.py
['countlines.py', 'mygrep.py', 'retest.py', 'showglob.py', 'testglob.py']
['countlines.py', 'mygrep.py', 'retest.py', 'showglob.py', 'testglob.py']

C:\python\book> python showglob.py *.py
['*.py']
['countlines.py', 'mygrep.py', 'retest.py', 'showglob.py', 'testglob.py']

It's worth looking at the bold line in showglob.py and understanding exactly what happens there, especially if you're new to the list comprehension concept (discussed in Chapter 14).

27.5.5 Using Temporary Files

If you've ever written a shell script and needed to use intermediary files for storing the results of some intermediate stages of processing, you probably suffered from directory litter. You started out with 20 files called log_001.txt, log_002.txt, etc., and all you wanted was one summary file called log_sum.txt. In addition, you had a whole bunch of log_001.tmp, log_001.tm2, etc. files that, while they were labeled temporary, stuck around. To put order back into your directories, use temporary files in specific directories and clean them up afterwards.

To help in this temporary file management problem, Python provides a nice little module called tempfile that publishes two functions: mktemp( ) and TemporaryFile( ). The former returns the name of a file not currently in use in a directory on your computer reserved for temporary files (such as /tmp on Unix or C:\TEMP on Windows). The latter returns a new file object directly. For example:

# Read input file
inputFile = open('input.txt', 'r')

import tempfile
# Create temporary file
tempFile = tempfile.TemporaryFile(  )                   # We don't even need to 
first_process(input = inputFile, output = tempFile)   # know the filename...

# Create final output file
outputFile = open('output.txt', 'w')
second_process(input = tempFile, output = outputFile)

Using tempfile.TemporaryFile( ) works well in cases where the intermediate steps manipulate file objects. One of its nice features is that when the file object is deleted, it automatically deletes the file it created on disk, thus cleaning up after itself. One important use of temporary files, however, is in conjunction with the os.system call, which means using a shell, hence using filenames, not file objects. For example, let's look at a program that creates form letters and mails them to a list of email addresses (on Unix only):

formletter = """Dear %s,\nI'm writing to you to suggest that ..."""    # etc. 
myDatabase = [('Michael Jackson', 'michael@neverland.odd'),
              ('Bill Gates', 'bill@microsoft.com'),
              ('Bob', 'bob@subgenius.org')]
for name, email in myDatabase:
    specificLetter = formletter % name
    tempfilename = tempfile.mktemp(  )
    tempfile = open(tempfilename, 'w')
    tempfile.write(specificLetter)
    tempfile.close(  )
    os.system('/usr/bin/mail %(email)s -s "Urgent!" < %(tempfilename)s' % vars(  )) 
    os.remove(tempfilename)

The first line in the for loop returns a customized version of the form letter based on the name it's given. That text is then written to a temporary file that's emailed to the appropriate email address using the os.system call. Finally, to clean up, the temporary file is removed.

The vars( ) function is a built-in function that returns a dictionary corresponding to the variables defined in the current local namespace. The keys of the dictionary are the variable names, and the values of the dictionary are the variable values. vars( ) comes in quite handy for exploring namespaces. It can also be called with an object as an argument (such as a module, a class, or an instance), and it will return the namespace of that object. Two other built-ins, locals( ) and globals( ), return the local and global namespaces, respectively. In all three cases, modifying the returned dictionaries doesn't guarantee any effect on the namespace in question, so view these as read-only and you won't be surprised. You can see that the vars( ) call creates a dictionary that is used by the string interpolation mechanism; it's thus important that the names inside the %(...)s bits in the string match the variable names in the program.

27.5.6 Modifying Input and Outputs

The argv attribute of the sys module holds one set of inputs to the current program—the command-line arguments, more precisely a list of the words input on the command line, excluding the reference to Python itself if it exists. In other words, if you type at the shell:

csh> python run.py a x=3 foo

then when run.py starts, the value of the sys.argv attribute is ['run.py', 'a', 'x=3', 'foo']. The sys.argv attribute is mutable (after all, it's just a list). Common usage involves iterating over the arguments of the Python program, that is, sys.argv[1:]; slicing from index 1 till the end gives all of the arguments to the program itself, but doesn't include the name of the program (module) stored in sys.argv[0]. There are two modules that help you process command line options. The first, an older module called getopt, is replaced in Python 2.3 by a similar but more powerful module called optparse. Check the library reference for further details on how to use them.

Experienced programmers will know that there are other inputs to a program, especially the standard input stream, with siblings for output and error messages. Python lets the programmer access and modify these through three file attributes in the sys module: sys.stdin, sys.stdout, and sys.stderr. Standard input is generally associated by the operating system with the user's keyboard; standard output and standard error are usually associated with the console. The print statement in Python outputs to standard output (sys.stdout), while error messages such as exceptions are output on the standard error stream (sys.stderr). Python lets you modify these on the fly: you can redirect the output of a Python program to a file simply by assigning to sys.stdout:

sys.stdout = open('log.out', 'w')

After this line, any output will be written to the file log.out instead of showing up on the console. Note that if you don't save it first, the reference to the "original" standard out stream is lost. It's generally a good idea to save a reference before reallocating any of the standard streams, as in:

old_stdout = sys.stdout
sys.stdout = open('log.out', 'w')

27.5.7 Using Standard I/O to Process Files

Why have a standard input stream? After all, it's not that hard to type open('input.txt') in the program. The major argument for reading and writing with standard streams is that you can chain programs so that the standard output from one becomes the standard input of the next, with no file used in the transfer. This facility, known as piping, is at the heart of the Unix philosophy. Using standard I/O this way means that you can write a program to do a specific task once, and then use it to process files or the intermediate results of other programs at any time in the future. As an example, a simple program that counts the number of lines in a file could be written as:

import sys
data = sys.stdin.readlines(  )
print "Counted", len(data), "lines."

On Unix, you could test it by doing something like:

% cat countlines.py | python countlines.py 
Counted 3 lines.

On Windows or DOS, you'd do:

C:\> type countlines.py | python countlines.py 
Counted 3 lines.

You can get each line in a file simply by iterating over a file object. This comes in very handy when implementing simple filter operations. Here are a few examples of such filter operations.

27.5.7.1 Finding all lines that start with a #
# Show comment lines (lines that start with a #, like this one).
import sys
for line in sys.stdin:
    if line[0] == '#':
        print line,

Note that a final comma is added after the print statement to indicate that the print operation should not add a newline, which would result in double-spaced output since the line string already includes a newline character as its last character.

The last two programs can easily be combined using pipes to combine their power. To count the number of comment lines in commentfinder.py:

C:> type commentfinder.py | python commentfinder.py | python countlines.py
Counted 1 lines.

Some other filtering tasks that take from standard input and write to standard output follow.

27.5.7.2 Extracting the fourth column of a file (where columns are defined by whitespace)
import sys, string
for line in sys.stdin:
    words = line.split(  ) 
    if len(words) >= 4:
        print words[3]

We look at the length of the words list to find if there are indeed at least four words. The last two lines could also be replaced by the try/except statement, which is quite common in Python:

try:
    print words[3]
except IndexError:                     # There aren't enough words.
    pass
27.5.7.3 Extracting the fourth column of a file, where columns are separated by colons, and making it lowercase
import sys, string
for line in sys.stdin:
    words = line.split(':') 
    if len(words) >= 4:
        print words[3].lower(  )

If iterating over all of the lines isn't what you want, just use the readlines( ) or read( ) methods of file objects.

27.5.7.4 Printing the first 10 lines, the last 10 lines, and every other line
import sys
lines = sys.stdin.readlines(  )
sys.stdout.writelines(lines[:10])          # First 10 lines
sys.stdout.writelines(lines[-10:])         # Last 10 lines
for lineIndex in range(0, len(lines), 2):  # Get 0, 2, 4, ...
    sys.stdout.write(lines[lineIndex])     # Get the indexed line.
27.5.7.5 Counting the number of times the word "Python" occurs in a file
text = open(fname).read(  )
print text.count('Python')
27.5.7.6 Changing a list of columns into a list of rows

In this more complicated example, the task is to transpose a file; imagine you have a file that looks like:

Name:   Willie   Mark   Guido   Mary  Rachel   Ahmed
Level:    5       4      3       1     6        4
Tag#:    1234   4451   5515    5124   1881    5132

And you really want it to look like the following instead:

Name:  Level:  Tag#:
Willie 5       1234
Mark   4       4451
...

You could use code like the following:

import sys
lines = sys.stdin.readlines(  )
wordlists = [line.split(  ) for line in lines]
for row in zip(*wordlists):
    print '\t'.join(row)

Of course, you should really use much more defensive programming techniques to deal with the possibility that not all lines have the same number of words in them, that there may be missing data, etc. Those techniques are task-specific and are left as an exercise to the reader.

27.5.7.7 Choosing chunk sizes

All the preceding examples assume you can read the entire file at once. In some cases, however, that's not possible, for example, when processing really huge files on computers with little memory, or when dealing with files that are constantly being appended to (such as log files). In such cases, you can use a while/readline combination, where some of the file is read a bit at a time, until the end of file is reached. In dealing with files that aren't line-oriented, you must read the file a character at a time:

# Read character by character.
while 1:
    next = sys.stdin.read(1)            # Read a one-character string
    if not next:                        # or an empty string at EOF.
        break
        # Process character 'next'.

Notice that the read( ) method on file objects returns an empty string at end of file, which breaks out of the while loop. Most often, however, the files you'll deal with consist of line-based data and are processed a line at a time:

# Read line by line.
while 1:
    next = sys.stdin.readline(  )            # Read a one-line string
    if not next:                        # or an empty string at EOF.
        break
    # Process line 'next'.

27.5.8 Doing Something to a Set of Files Specified on the Command Line

Being able to read stdin is a great feature; it's the foundation of the Unix toolset. However, one input is not always enough: many tasks need to be performed on sets of files. This is usually done by having the Python program parse the list of arguments sent to the script as command-line options. For example, if you type:

% python myScript.py input1.txt input2.txt input3.txt output.txt

you might think that myScript.py wants to do something with the first three input files and write a new file, called output.txt. Let's see what the beginning of such a program could look like:

import sys
inputfilenames, outputfilename = sys.argv[1:-1], sys.argv[-1]
for inputfilename in inputfilenames:
    inputfile = open(inputfilename, "r")
    do_something_with_input(inputfile)
inputfile.close(  )
outputfile = open(outputfilename, "w")
write_results(outputfile)
outputfile.close(  )

The second line extracts parts of the argv attribute of the sys module. Recall that it's a list of the words on the command line that called the current program. It starts with the name of the script. So, in the example above, the value of sys.argv is:

['myScript.py', 'input1.txt', 'input2.txt', 'input3.txt', 'output.txt'].

The script assumes that the command line consists of one or more input files and one output file. So the slicing of the input file names starts at 1 (to skip the name of the script, which isn't an input to the script in most cases), and stops before the last word on the command line, which is the name of the output file. The rest of the script should be pretty easy to understand (but won't work until you provide the do_something_with_input( ) and write_results( ) functions).

Note that the preceding script doesn't actually read in the data from the files, but passes the file object down to a function to do the real work. A generic version of do_something_with_input( ) is:

def do_something_with_input(inputfile):
    for line in inputfile:
        process(line)

27.5.9 Processing Each Line of One or More Files

The combination of this idiom with the preceding one regarding opening each file in the sys.argv[1:] list is so common that there is a module, fileinput, to do just this task:

import fileinput
for line in fileinput.input(  ):
    process(line)

The fileinput.input( ) call parses the arguments on the command line, and if there are no arguments to the script, uses sys.stdin instead. It also provides several useful functions that let you know which file and line number you're currently manipulating, as we can see in the following script:

import fileinput, sys, string
# Take the first argument out of sys.argv and assign it to searchterm.
searchterm, sys.argv[1:] = sys.argv[1], sys.argv[2:]
for line in fileinput.input(  ):
   num_matches = line.count(searchterm)
   if num_matches:                     # A nonzero count means there was a match.
       print "found '%s' %d times in %s on line %d." % (searchterm, num_matches, 
           fileinput.filename(  ), fileinput.filelineno(  ))

Running mygrep.py on a few Python files produces:

% python mygrep.py in *.py
found 'in' 2 times in countlines.py on line 2.
found 'in' 2 times in countlines.py on line 3.
found 'in' 2 times in mygrep.py on line 1.
found 'in' 4 times in mygrep.py on line 4.
found 'in' 2 times in mygrep.py on line 5.
found 'in' 2 times in mygrep.py on line 7.
found 'in' 3 times in mygrep.py on line 8.
found 'in' 3 times in mygrep.py on line 12.

27.5.10 Dealing with Binary Data: The struct Module

A file is considered a binary file if it's not a text file or a file written in a format based on text, such as HTML and XML. Image and sound files are prototypical examples of binary files. A frequent question about file manipulation is "How do I process binary files in Python?" The answer to that question usually involves the struct module. It has a simple interface, since it exports just three functions: pack, unpack, and calcsize.

Let's start with the task of decoding a binary file. Imagine a binary file bindat.dat that contains data in a specific format: first there's a float corresponding to a version number, then a long integer corresponding to the size of the data, and then the number of unsigned bytes corresponding to the actual data. The key to using the struct module is to define a format string, which corresponds to the format of the data you wish to read, and find out which subset of the file corresponds to that data. For example:

import struct
data = open('bindat.dat').read(  )
start, stop = 0, struct.calcsize('fl')
version_number, num_bytes = struct.unpack('fl', data[start:stop])
start, stop = stop, start + struct.calcsize('B'*num_bytes)
bytes = struct.unpack('B'*num_bytes, data[start:stop])

'f' is a format string for a single floating-point number (a C float, to be precise), 'l' is for a long integer, and 'B' is a format string for an unsigned char. The available unpack format strings are listed in Table 27-8. Consult the library reference manual for usage details.

Table 27-8. Common format codes used by the struct module

Format

C type

Python

x

pad byte

No value

c

char

String of length 1

b

signed char

Integer

B

unsigned char

Integer

h

short

Integer

H

unsigned short

Integer

i

int

Integer

I

unsigned int

Integer

l

long

Integer

L

unsigned long

Integer

f

float

Float

d

double

Float

s

char[ ]

String

p

char[ ]

String

P

void *

Integer

At this point, bytes is a tuple of num_bytes Python integers. If we know that the data is in fact storing characters, we could use chars = map(chr, bytes). To be more efficient, we could change the last unpack to use 'c' instead of 'B', which would do the conversion and return a tuple of num_bytes single-character strings. More efficiently still, we could use a format string that specifies a string of characters of a specified length, such as:

chars = struct.unpack(str(num_bytes)+'s', data[start:stop])

The packing operation (struct.pack) is the exact converse; instead of taking a format string and a data string, and returning a tuple of unpacked values, it takes a format string and a variable number of arguments and packs those arguments using that format string into a new packed string.

Note that the struct module can process data that's encoded with either kind of byte-ordering,[6] thus allowing you to write platform-independent binary file manipulation code. For large files, also consider using the array module.

[6] The order with which computers list multibyte words depends on the chip used (so much for standards). Intel and DEC systems use little-endian ordering, while Motorola and Sun-based systems use big-endian ordering. Network transmissions also use big-endian ordering, so the struct module comes in handy when doing network I/O on PCs.

    [ Team LiB ] Previous Section Next Section