![]() |
![]() ![]() |
8.3 Wide and Multibyte CharactersThe familiar char type is sometimes called a narrow character, as opposed to wchar_t, which is a wide character. The key difference between a narrow and wide character is that a wide character can represent any single character in any character set that an implementation supports. A narrow character, on the other hand, might be too small to represent all characters, so multiple narrow char objects can make up a single, logical character called a multibyte character. Beyond some minimal requirements for the character sets (see Chapter 1), the C++ standard is purposely open-ended and imposes few restrictions on an implementation. Some basic behavioral requirements are that conversion from a narrow character to a wide character must produce an equivalent character. Converting back to a narrow character must restore the original character. The open nature of the standard gives the compiler and library vendor wide latitude. For example, a compiler for Japanese customers might support a variety of Japanese Industrial Standard ( JIS) character sets, but not any European character sets. Another vendor might support multiple ISO 8859 character sets for Western and Eastern Europe, but not any Asian multibyte character sets. Although the standard defines universal characters in terms of the Unicode (ISO/IEC 10646) standard, it does not require any support for Unicode character sets. This section discusses some of the broad issues in dealing with wide and multibyte characters, but the details of specific characters and character sets are implementation-defined. 8.3.1 Wide CharactersA program that must deal with international character sets might work entirely with wide characters. Although wide characters usually require more memory than narrow characters, they are usually easier to use. Searching for substrings in a wide string is easy because you never have the problem of matching partial characters (which can happen with multibyte characters). A common implementation of
wchar_t is to use Unicode UTF-32 encoding, which
means each wide character is 32 bits and represents a single Unicode
character. Suppose you want to declare a wide string that contains
the Greek letter pi ( wchar_t wpi[] = "\u03c0"; Using UTF-32, the string would contain L"\x03c0". With a different wchar_t implementation, the wpi string would contain different values. The standard wstring class supports wide strings, and all the I/O streams support wide characters (e.g., wistream, wostream). 8.3.2 Multibyte CharactersA multibyte character represents a single character as a series of one or more bytes, or narrow characters. Because a single character might occupy multiple bytes, working with multibyte strings is more difficult than working with wide strings. For example, if you search a multibyte string for the character '\x20', when you find a match, you must test whether the matching character is actually part of a multibyte character and is therefore not actually a match for the single character you want to find. Consider the problem of comparing multibyte strings. Suppose you need to sort the strings in ascending order. If one string starts with the character '\xA1' and other starts with '\xB2', it seems that the first is smaller than the second and therefore should come before the second. On the other hand, these characters might be the first of multibyte character sequences, so the strings cannot be compared until you have analyzed the strings for multibyte character sequences. Multibyte character sets abound, and a particular C++ compiler and library might support only one or just a few. Some multibyte character sets specifically support a particular language, such as the Chinese Big5 character set. The UTF-8 character set supports all Unicode characters using one to six narrow characters. For example, consider how an implementation might encode the Greek
letter pi ( char pi[] = "\u03c0"; If the implementation's narrow character set is ISO
8859-7 (8-bit Greek), the encoding is 0xF0, so
pi[] contains "\xf0". If the
narrow character set is UTF-8 (8-bit Unicode), the representation is
a multibyte character, and pi[] would contain
"\xe0\x8f\x80". Many character sets do not have
any encoding for 8.3.3 Shift StateYou can convert a multibyte character sequence to a wide character and back using the functions in <cwchar>. When performing such conversions, the library might need to keep track of state information during the conversion. This is known as the shift state and is stored in an mbstate_t object (also defined in <cwchar>). For example, the Japanese Industrial Standard (JIS) encodes
single-byte characters and double-byte characters. A 3-byte character
sequence shifts from single- to double-byte mode, and another
sequence shifts back. The shift state keeps track of the current
mode. The initial shift state is single-byte. Thus, the multibyte
string "\x1B$B&P\x1B(B" represents one wide
character, namely, the Greek letter pi ( Shift states are especially important when performing I/O. By definition, file I/O uses multibyte characters. That is, a file is treated as a sequence of narrow characters. When reading a wide-character stream, the narrow characters are converted to wide characters, and when writing a wide stream, wide characters are converted back to multibyte characters. Seeking to a new position in a file might seek to a position that falls in the middle of a multibyte sequence. Therefore, a file position is required to keep track of a shift state in addition to a byte position in the file. See <ios> in Chapter 13. |
![]() |
![]() ![]() |