DekGenius.com
[ Team LiB ] Previous Section Next Section

5.1 String Literals

By and large, strings are fairly easy to use in Python. Perhaps the most complicated thing about them is that there are so many ways to write them in your code:

  • Single quotes: 'spa"m'

  • Double quotes: "spa'm"

  • Triple quotes: '''... spam ...''', """... spam ..."""

  • Escape sequences: "s\tp\na\0m"

  • Raw strings: r"C:\new\test.spm"

  • Unicode strings: u'eggs\u0020spam'

The single- and double-quoted forms are by far the most common; the others serve specialized roles. Let's take a quick look at each of these options.

5.1.1 Single- and Double-Quoted Strings Are the Same

Around Python strings, single and double quote characters are interchangeable. That is, string literals can be written enclosed in either two single or two double quotes—the two forms work the same, and return the same type of object. For example, the following two strings are identical, once coded:

>>> 'shrubbery', "shrubbery"
('shrubbery', 'shrubbery')

The reason for including both is that it allows you to embed a quote character of the other variety inside a string, without escaping it with a backslash: you may embed a single quote character in a string enclosed in double quote characters, and vice-versa:

>>> 'knight"s', "knight's"
('knight"s', "knight's")

Incidentally, Python automatically concatenates adjacent string literals, although it is almost as simple to add a + operator between them, to invoke concatenation explicitly.

>>> title = "Meaning " 'of' " Life"
>>> title
'Meaning of Life'

Notice in all of these outputs that Python prefers to print strings in single quotes, unless they embed one. You can also embed quotes by escaping them with backslashes:

>>> 'knight\'s', "knight\"s"
("knight's", 'knight"s')

But to understand why, we need to explain how escapes work in general.

5.1.2 Escape Sequences Code Special Bytes

The last example embedded a quote inside a string by preceding it with a backslash. This is representative of a general pattern in strings: backslashes are used to introduce special byte codings, known as escape sequences.

Escape sequences let us embed byte codes in strings that cannot be easily typed on a keyboard. The character \, and one or more characters following it in the string literal, are replaced with a single character in the resulting string object, which has the binary value specified by the escape sequence. For example, here is a five-character string that embeds a newline and a tab:

>>> s = 'a\nb\tc'

The two characters \n stand for a single character—the byte containing the binary value of the newline character in your character set (usually, ASCII code 10). Similarly, the sequence \t is replaced with the tab character. The way this string looks when printed depends on how you print it. The interactive echo shows the special characters as escapes, but print interprets them instead:

>>> s
'a\nb\tc'
>>> print s
a
b       c

To be completely sure how many bytes are in this string, you can use the built-in len function—it returns the actual number of bytes in a string, regardless of how it is displayed.

>>> len(s)
5

This string is five bytes long: an ASCII "a" byte, a newline byte, an ASCII "b" byte, and so on; the original backslash characters are not really stored with the string in memory.

For coding such special bytes, Python recognizes a full set of escape code sequences, listed in Table 5-2. Some sequences allow you to embed absolute binary values into the bytes of a string. For instance, here's another five-character string that embeds two binary zero bytes:

>>> s = 'a\0b\0c'
>>> s
'a\x00b\x00c'
>>> len(s)
5

Table 5-2. String backslash characters

Escape

Meaning

\newline

Ignored (continuation)

\\

Backslash (keeps a \)

\'

Single quote (keeps `)

\"

Double quote (keeps ")

\a

Bell

\b

Backspace

\f

Formfeed

\n

Newline (linefeed)

\r

Carriage return

\t

Horizontal tab

\v

Vertical tab

\N{id}

Unicode dbase id

\uhhhh

Unicode 16-bit hex

\Uhhhh...

Unicode 32-bit hex[1]

\xhh

Hex digits value hh

\ooo

Octal digits value

\0

Null (doesn't end string)

\other

Not an escape (kept)

[1] The \Uhhhh... escape sequence takes exactly eight hexadecimal digits (h); both \u and \U can be used only in Unicode string literals.

In Python, the zero (null) byte does not terminate a string the way it typically does in C. Instead Python keeps both the string's length and text in memory. In fact, no character terminates a string in Python; here's one that is all absolute binary escape codes—a binary 1 and 2 (coded in octal), followed by a binary 3 (coded in hexadecimal):

>>> s = '\001\002\x03'
>>> s
'\x01\x02\x03'
>>> len(s)
3

This becomes more important to know when you process binary data files in Python. Because their contents are represented as string in your scripts, it's okay to process binary files that contain any sort of binary byte values. More on files in Chapter 7.[2]

[2] But if you're especially interested in binary data files: the chief distinction is that you open them in binary mode (use open mode flags with a "b", such as "rb", "wb", and so on). See also the standard struct module, which can parse binary data loaded from a file.

Finally, as the last entry in Table 5-2 implies, if Python does not recognize the character after a "\" as being a valid escape code, it simply keeps the backslash in the resulting string:

>>> x = "C:\py\code"     # keeps \ literally
>>> x
'C:\\py\\code'
>>> len(x)
10

Unless you're able to commit all of Table 5-2 to memory, you probably shouldn't rely on this behavior; to code literal backslashes, double up ("\\" is an escape for "\"), or use raw strings, described in the next section.

5.1.3 Raw Strings Suppress Escapes

As we've seen, escape sequences are handy for embedding special byte codes within strings. Sometimes, though, the special treatment of backslashes for introducing escapes can lead to trouble. It's suprisingly common, for instance, to see Python newcomers in classes trying to open a file with a filename argument that looks something like this:

myfile = open('C:\new\text.dat', 'w')

thinking that they will open a file called text.dat in directory C:\new. The problem here is that \n is taken to stand for a newline character, and \t is replaced with a tab. In effect, the call tries to open a file named C:(newline)ew(tab)ext.dat, with usually less than stellar results.

This is just the sort of thing that raw strings are useful for. If the letter "r" (uppercase or lowercase) appears just before the opening quote of a string, it turns off the escape mechanism—Python retains your backslashes literally, exactly as you typed them. To fix the filename problem, just remember to add the letter "r" on Windows:

myfile = open(r'C:\new\text.dat', 'w')

Because two backslashes are really an escape sequence for one backslash, you can also keep your backslashes by simply doubling-up, without using raw strings:

myfile = open('C:\\new\\text.dat', 'w')

In fact, Python itself sometimes uses this doubled scheme when it prints strings with embedded backslashes:

>>> path = r'C:\new\text.dat'
>>> path                          # Show as Python code.
'C:\\new\\text.dat'
>>> print path                    # User-friendly format
C:\new\text.dat
>>> len(path)                     # String length
15

There really is just one backslash in the string where Python printed two in the first output of this code. As with numeric representation, the default format at the interactive prompt prints results as if they were code, but the print statement provides a more user-friendly format. To verify, check the result of the built-in len function again, to see the number of bytes in the string, independent of display formats. If you count, you'll see that there really is just one character per backslash for a total of 15.

Besides directory paths on Windows, raw strings are also commonly used for regular expressions (text pattern matching, supported with module re); you'll meet this feature later in this book. Also note that Python scripts can usually use forward slashes in directory paths on both Windows and Unix, because Python tries to interpret paths portably. Raw strings are useful if you code paths using native Windows backslashes.

5.1.4 Triple Quotes Code Multiline Block Strings

So far, you've seen single quotes, double quotes, escapes, and raw strings. Python also has a triple-quoted string literal format, sometimes called a block string, which is a syntactic convenience for coding multiline text data. This form begins with three quotes (of either the single or double variety), is followed by any number of lines of text, and is closed with the same triple quote sequence that opened it. Single and double quotes in the text may be, but do not have to be, escaped. For example:

>>> mantra = """Always look
...  on the bright
... side of life."""
>>>
>>> mantra
'Always look\n on the bright\nside of life.'

This string spans three lines (in some interfaces, the interactive prompt changes to "..." on continuation lines; IDLE simply drops down one line). Python collects all the triple-quoted text into a single multiline string, with embedded newline characters (\n) at the places that your code has line breaks. Notice that the second line in the result has a leading space as it did in the literal—what you type is truly what you get.

Triple-quoted strings are handy any time you need multiline text in your program, for example, to code error messages or HTML and XML code. You can embed such blocks directly in your script, without resorting to external text files or explicit concatenation and newline characters.

5.1.5 Unicode Strings Encode Larger Character Sets

The last way to write strings in your scripts is perhaps the most specialized, and the least commonly used. Unicode strings are sometimes called "wide" character strings. Because each character may be represented with more than one byte in memory, Unicode strings allow programs to encode richer character sets than standard strings.

Unicode strings are typically used to support internationalization of applications (sometimes referred to as "i18n", to compress the 18 characters between the first and last characters of the term). For instance, they allow programmers to directly support European or Asian character sets in Python scripts. Because such character sets have more characters than a single byte can represent, Unicode is normlly used to process these forms of text.

In Python, Unicode strings may be coded in your script by adding the letter "U" (lower or uppercase), just before the opening quote of a string:

>>> u'spam'
u'spam'

Technically, this syntax generates a Unicode string object, which is a different data type than normal strings. However, Python allows you to freely mix Unicode and normal strings in expressions, and converts up to Unicode for mixed-type results (more on + concatenation in the next section):

>>> 'ni' + u'spam'        # Mixed string types
u'nispam'

In fact, Unicode strings are defined to support all the usual string processing operations you'll meet in the next section, so the difference in types is often trivial to your code. Like normal strings, Unicode may be concatenated, indexed, sliced, matched with the re module, and so on, and cannot be changed in place. If you ever do need to convert between the two types explicitly, you can use the built-in str and unicode functions:

>>> str(u'spam')          # Unicode to normal
'spam'
>>> unicode('spam')       # Normal to unicode
u'spam'

Because Unicode is designed to handle multibyte characters, you can also use the special \u and \U escapes to encode binary character values that are larger than 8 bits:

>>> u'ab\x20cd'           # 8-bit/1-byte characters
u'ab cd'
>>> u'ab\u0020cd'         # 2-byte characters
u'ab cd'
>>> u'ab\U00000020cd'     # 4-byte characters
u'ab cd'

The first of these embeds the binary code for a space character; its binary value in hexidecimal notation is x20. The second and third do the same, but give the value in 2-byte and 4-byte Unicode escape notation.

Even if you don't think you will need Unicode, you might use them without knowing it. Because some programming interfaces (e.g., the COM API on Windows) represent text as Unicode, it may find its way into your script as API inputs or results, and you may sometimes need to convert back and forth between normal and Unicode types. Since Python treats the two string types interchangeably in most contexts, the presence of Unicode strings is often transparent to your code—you can largely ignore the fact that text is being passed around as Unicode objects, and use normal strings operations.

Unicode is a useful addition to Python; because it is built-in, it's easy to handle such data in your scripts when needed. Unfortunately, from this point forward, the Unicode story becomes fairly complex. For example:

  • Unicode objects provide an encode method that converts a Unicode string into a normal 8-bit string using a specific encoding.

  • The built-in function unicode and module codecs support registered Unicode "codecs" (for "COders and DECoders").

  • The module unicodedata provides access to the Unicode character database.

  • The sys module includes calls for fetching and setting the default Unicode encoding scheme (the default is usually ASCII)

  • You may combine the raw and unicode string formats (e.g., ur'a\b\c').

Because Unicode is a relatively advanced and rarely used tool, we will omit further details in this introductory text. See the Python standard manual for the rest of the Unicode story.

    [ Team LiB ] Previous Section Next Section