[ Team LiB ] |
27.3 Manipulating StringsThe vast majority of programs perform string operations. We've covered most of the properties and variants of string objects in Chapter 5, but there are two areas that we haven't touched on thus far, the string module, and regular expressions. As we'll see the first is simple and mostly a historical note, while the second is complex and powerful. 27.3.1 The string ModuleThe string module is somewhat of a historical anomaly. If Python were being designed today, the string module would not exist—it is mostly a remnant of a less civilized age before everything was a first-class object. Nowadays, string objects have methods like split and join, which replace the functions that are still defined in the string module. The string module does define a convenient function, maketrans, used to automatically do string "mapping" operations with the translate method of string objects. maketrans/translate is useful when you want to translate several characters in a string at once. For example, if you want to replace all occurrences of the space character with an underscore, change underscores to minus signs, and change minus signs to plus signs. Doing so with repeated .replace( ) operations is in fact quite tricky, but doing it with maketrans is trivial: >>> import string >>> conversion = string.maketrans(" _-", "_-+") >>> input_string = "This is a two_part - one_part" >>> input_string.translate(conversion) 'This_is_a_two-part_+_one-part' In addition, the string module defines a few useful constants, which haven't been implemented as string attributes yet. These are shown in Table 27-2.
The constants in Table 27-2 are useful to test whether specific characters fit a criterion—for example, x in string.whitespace returns true only if x is one of the whitespace characters. Note that the values given above aren't always the values you'll find—for example, the definition of 'uppercase' depends on the locale: if you're running a French operating system, string.lowercase will include ç and ê. 27.3.2 Complicated String Matches with Regular ExpressionsIf strings and their methods aren't enough (and they do get clumsy in many perfectly normal use cases), Python provides a specialized string-processing tool in the form of a regular expression engine. Regular expressions are strings that let you define complicated pattern matching and replacement rules for strings. The syntax for regular expressions emphasizes compact notation over mnemonic value. For example, the single character . means "match any single character." The character + means "one or more of what just preceded me." Table 27-3 lists some of the most commonly used regular expression symbols and their meanings in English. Describing the full set of regular expression tokens and their meaning would take quite a few pages—instead, we'll cover a simple use case and walk through how to solve the problem using regular expressions.
27.3.2.1 A real regular expression problemSuppose you need to write a program to replace the strings "green pepper" and "red pepper" with "bell pepper" if and only if they occur together in a paragraph before the word "salad" and not if they are followed (with no space) by the string "corn." Although the specific requirements are silly, the general kind (conditional replacement of subparts of text based on specific contextual constraints) is surprisingly common in computing. We will explain each step of the program that solves this task. Assume that the file you need to process is called pepper.txt. Here's an example of such a file: This is a paragraph that mentions bell peppers multiple times. For one, here is a red pepper and dried tomato salad recipe. I don't like to use green peppers in my salads as much because they have a harsher flavor. This second paragraph mentions red peppers and green peppers but not the "s" word (s-a-l-a-d), so no bells should show up. This third paragraph mentions red peppercorns and green peppercorns, which aren't vegetables but spices (by the way, bell peppers really aren't peppers, they're chilies, but would you rather have a good cook or a good botanist prepare your salad?). The first task is to open the file and read in the text: file = open('pepper.txt') text = file.read( ) We read the entire text at once and avoid splitting it into lines, since we will assume that paragraphs are defined by two consecutive newline characters. This is easy to do using the split function of the string module: paragraphs = text.split('\n\n') At this point we've split the text into a list of paragraph strings, and all there is left to do is perform the actual replacement operation. Here's where regular expressions come in: import re matchstr = re.compile( r"""\b(red|green) # 'red' or 'green' starting new words (\s+ # followed by whitespace pepper # The word 'pepper', (?!corn) # if not followed immediately by 'corn' (?=.*salad))""", # and if followed at some point by 'salad', re.IGNORECASE | # allow pepper, Pepper, PEPPER, etc. re.DOTALL | # Allow dots to match newlines as well. re.VERBOSE) # This allows the comments and the newlines above. for paragraph in paragraphs: fixed_paragraph = matchstr.sub(r'bell\2', paragraph) print fixed_paragraph+'\n' The first line is simple but key: all of Python's regular expression smarts are in the re module. The bold statement is the hardest one; it creates a regular expression pattern, which is like a program (that's the raw string), and compiles it. Such a pattern specifies two things: which parts of the strings we're interested in and how they should be grouped. Let's go over these in turn. The re.compile( ) call takes a string (although the syntax of that string is quite particular) and returns an object called a compiled regular expression object, which corresponds to that string. Defining which parts of the string we're interested in is done by specifying a pattern of characters that defines a match. This is done by concatenating smaller patterns, each of which specifies a simple matching criterion (e.g., "match the string 'pepper'," "match one or more whitespace characters," "don't match 'corn'," etc.). We're looking for the words "red" or "green" followed by the word "pepper" that is itself followed by the word "salad," as long as "pepper" isn't followed immediately by "corn." Let's take each line of the re.compile( . . . ) expression in turn. The first thing to notice about the string in the re.compile( ) is that it's a "raw" string (the quotation marks are preceded by an r). Prepending such an r to a string (single- or triple-quoted) turns off the interpretation of the backslash characters within the string.[3] We could have used a regular string instead and used \\b instead of \b and \\s instead of \s. In this case, it makes little difference; for complicated regular expressions, raw strings allow for much clearer syntax than escaped backslashes.
The first line in the pattern is \b(red|green). \b stands for "the empty string, but only at the beginning or end of a word"; using it here prevents matches that have red or green as the final part of a word (as in "tired pepper"). The (red|green) pattern specifies an alternation: either 'red' or 'green'. Ignore the left parenthesis that follows for now. \s is a special symbol that means "any whitespace character," and + means "one or more occurrence of whatever comes before me," so, put together, \s+ means "one or more whitespace characters." Then, pepper just means the string 'pepper'. (?!corn) prevents matches of "patterns that have 'corn' at this point," so we prevent the match on 'peppercorn'. Finally, (?=.*salad) says that for the pattern to match, it must be followed by any number of arbitrary characters (that's what .* means), followed by the word salad. The ?= specifies that while the pattern should determine whether the match occurs, it shouldn't be "used up" by the match process; it's a subtle point that we won't cover in detail here. At this point we've defined the pattern corresponding to the substring. Now, note that there are two parentheses—the one before \s+ and the last one. What these two do is define a "group," which starts after the red or green and go to the end of the pattern. We'll use that group in the next operation, the actual replacement. The three flags are joined by the | symbol (the bitwise "or" operation) to form the second argument to re.compile. These specify kinds of pattern matches. The first, re.IGNORECASE, says that the text comparisons should ignore whether the text and the match have similar or different cases. The second, re.DOTALL, specifies that the . character should match any character, including the newline character (that's not the default behavior). The third, re.VERBOSE, allows us to insert extra newlines and # comments in the regular expression, making it easier to read and understand. We could have written the statement more compactly as: matchstr = re.compile(r"\b(red|green)(\s+pepper(?!corn)(?=.*salad))", re.I | re.S) The actual replacement operation is done with the line: fixed_paragraph = matchstr.sub(r'bell\2', paragraph) We're calling the sub method of the matchstr object. That object is a compiled regular expression object, meaning that some of the processing of the expression has already been done (in this case, outside the loop), thus speeding up the total program execution. We use a raw string again to write the first argument to the method. The \2 is a reference to group 2 in the regular expression—the second group of parentheses in the regular expression—in our case, everything starting with whitespace followed by 'pepper' and up to and including the word 'salad'. Therefore, this line means, "Replace the occurrences of the matched substring with the string that is 'bell' followed by whatever starts with whitespace followed by 'pepper' and goes up to the end of the matched string, throughout the paragraph string." So, does it work? The pepper.txt file had three paragraphs: the first satisfied the requirements of the match twice, the second didn't because it didn't mention the word 'salad', and the third didn't because the 'red' and 'green' words are before peppercorn, not pepper. As it was supposed to, our program (saved in a file called pepper.py) modifies only the first paragraph: /home/David/book$ python pepper.py
This is a paragraph that mentions bell peppers multiple times. For
one, here is a bell pepper and dried tomato salad recipe. I don't like
to use bell peppers in my salads as much because they have a harsher
flavor.
This second paragraph mentions red peppers and green peppers but not
the "s" word (s-a-l-a-d), so no bells should show up.
This third paragraph mentions red peppercorns and green peppercorns,
which aren't vegetables but spices (by the way, bell peppers really
aren't peppers, they're chilies, but would you rather have a good cook
or a good botanist prepare your salad?).
This example, while artificial, shows how regular expressions can compactly express complicated matching rules. If this kind of problem occurs often in your line of work, mastering regular expressions can be a worthwhile investment of time and effort. A more thorough coverage of regular expressions is beyond the scope of this book. Jeffrey Friedl provides excellent coverage of regular expressions in his book Mastering Regular Expressions (O'Reilly). This book is a must-have for anyone doing serious text processing. For the casual user, the descriptions in the Library Reference or Python in a Nutshell do the job most of the time. Be sure to use the re module, not the regex, or regsub modules, which are deprecated (they probably won't be around in a later version of Python): >>> import regex __main__:1: DeprecationWarning: the regex module is deprecated; please use the re module |
[ Team LiB ] |