Cppin a Nutshell-8.3 Wide and Multibyte Characters

8.3 Wide and Multibyte Characters

The familiar char type is sometimes called a narrow character, as opposed to wchar_t, which is a wide character. The key difference between a narrow and wide character is that a wide character can represent any single character in any character set that an implementation supports. A narrow character, on the other hand, might be too small to represent all characters, so multiple narrow char objects can make up a single, logical character called a multibyte character.

Beyond some minimal requirements for the character sets (see Chapter 1), the C++ standard is purposely open-ended and imposes few restrictions on an implementation. Some basic behavioral requirements are that conversion from a narrow character to a wide character must produce an equivalent character. Converting back to a narrow character must restore the original character. The open nature of the standard gives the compiler and library vendor wide latitude. For example, a compiler for Japanese customers might support a variety of Japanese Industrial Standard ( JIS) character sets, but not any European character sets. Another vendor might support multiple ISO 8859 character sets for Western and Eastern Europe, but not any Asian multibyte character sets. Although the standard defines universal characters in terms of the Unicode (ISO/IEC 10646) standard, it does not require any support for Unicode character sets.

This section discusses some of the broad issues in dealing with wide and multibyte characters, but the details of specific characters and character sets are implementation-defined.

8.3.1 Wide Characters

A program that must deal with international character sets might work entirely with wide characters. Although wide characters usually require more memory than narrow characters, they are usually easier to use. Searching for substrings in a wide string is easy because you never have the problem of matching partial characters (which can happen with multibyte characters).

A common implementation of wchar_t is to use Unicode UTF-32 encoding, which means each wide character is 32 bits and represents a single Unicode character. Suppose you want to declare a wide string that contains the Greek letter pi (). You can specify the string with a universal name (see Chapter 1):

wchar_t wpi[] = "\u03c0";

Using UTF-32, the string would contain L"\x03c0". With a different wchar_t implementation, the wpi string would contain different values.

The standard wstring class supports wide strings, and all the I/O streams support wide characters (e.g., wistream, wostream).

8.3.2 Multibyte Characters

A multibyte character represents a single character as a series of one or more bytes, or narrow characters. Because a single character might occupy multiple bytes, working with multibyte strings is more difficult than working with wide strings. For example, if you search a multibyte string for the character '\x20', when you find a match, you must test whether the matching character is actually part of a multibyte character and is therefore not actually a match for the single character you want to find.

Consider the problem of comparing multibyte strings. Suppose you need to sort the strings in ascending order. If one string starts with the character '\xA1' and other starts with '\xB2', it seems that the first is smaller than the second and therefore should come before the second. On the other hand, these characters might be the first of multibyte character sequences, so the strings cannot be compared until you have analyzed the strings for multibyte character sequences.

Multibyte character sets abound, and a particular C++ compiler and library might support only one or just a few. Some multibyte character sets specifically support a particular language, such as the Chinese Big5 character set. The UTF-8 character set supports all Unicode characters using one to six narrow characters.

For example, consider how an implementation might encode the Greek letter pi (), which has a Unicode value of 0x03C0:

char    pi[]  = "\u03c0";

If the implementation's narrow character set is ISO 8859-7 (8-bit Greek), the encoding is 0xF0, so pi[] contains "\xf0". If the narrow character set is UTF-8 (8-bit Unicode), the representation is a multibyte character, and pi[] would contain "\xe0\x8f\x80". Many character sets do not have any encoding for , in which case the contents of pi[] might be "?", or some other implementation-defined marker for unknown characters.

8.3.3 Shift State

You can convert a multibyte character sequence to a wide character and back using the functions in <cwchar>. When performing such conversions, the library might need to keep track of state information during the conversion. This is known as the shift state and is stored in an mbstate_t object (also defined in <cwchar>).

For example, the Japanese Industrial Standard (JIS) encodes single-byte characters and double-byte characters. A 3-byte character sequence shifts from single- to double-byte mode, and another sequence shifts back. The shift state keeps track of the current mode. The initial shift state is single-byte. Thus, the multibyte string "\x1B$B&P\x1B(B" represents one wide character, namely, the Greek letter pi (). The first three characters switch to double-byte mode. The next two characters encode the character, and the final three characters restore single-byte mode.

Shift states are especially important when performing I/O. By definition, file I/O uses multibyte characters. That is, a file is treated as a sequence of narrow characters. When reading a wide-character stream, the narrow characters are converted to wide characters, and when writing a wide stream, wide characters are converted back to multibyte characters. Seeking to a new position in a file might seek to a position that falls in the middle of a multibyte sequence. Therefore, a file position is required to keep track of a shift state in addition to a byte position in the file. See <ios> in Chapter 13.