1.2 TokensAll source code is divided into a stream of tokens. The compiler tries to collect as many contiguous characters as it can to build a valid token. (This is sometimes called the "max munch" rule.) It stops when the next character it would read cannot possibly be part of the token it is reading. A token can be an identifier, a reserved keyword, a literal, or an operator or punctuation symbol. Each kind of token is described later in this section. Step 3 of the compilation process reads preprocessor tokens. These tokens are converted automatically to ordinary compiler tokens as part of the main compilation in Step 7. The differences between a preprocessor token and a compiler token are small:
1.2.1 IdentifiersAn identifier is a name that you define or that is defined in a library. An identifier begins with a nondigit character and is followed by any number of digits and nondigits. A nondigit character is a letter, an underscore, or one of a set of universal characters. The exact set of nondigit universal characters is defined in the C++ standard and in ISO/IEC PDTR 10176. Basically, this set contains the universal characters that represent letters. Most programmers restrict themselves to the characters a...z, A...Z, and underscore, but the standard permits letters in other languages. Not all compilers support universal characters in identifiers. Certain identifiers are reserved for use by the standard library:
1.2.2 KeywordsA keyword is an identifier that is reserved in all contexts for special use by the language. The following is a list of all the reserved keywords. (Note that some compilers do not implement all of the reserved keywords; these compilers allow you to use certain keywords as identifiers. See Section 1.5 later in this chapter for more information.)
1.2.3 LiteralsA literal is an integer, floating-point, Boolean, character, or string constant. 1.2.3.1 Integer literalsAn integer literal can be a decimal, octal, or hexadecimal constant. A prefix specifies the base or radix: 0x or 0X for hexadecimal, 0 for octal, and nothing for decimal. An integer literal can also have a suffix that is a combination of U and L, for unsigned and long, respectively. The suffix can be uppercase or lowercase and can be in any order. The suffix and prefix are interpreted as follows:
Some compilers offer other suffixes as extensions to the standard. See Appendix A for examples. Here are some examples of integer literals: 314 // Legal 314u // Legal 314LU // Legal 0xFeeL // Legal 0ul // Legal 078 // Illegal: 8 is not an octal digit 032UU // Illegal: cannot repeat a suffix 1.2.3.2 Floating-point literalsA floating-point literal has an integer part, a decimal point, a fractional part, and an exponent part. You must include the decimal point, the exponent, or both. You must include the integer part, the fractional part, or both. The signed exponent is introduced by e or E. The literal's type is double unless there is a suffix: F for type float and L for long double. The suffix can be uppercase or lowercase. Here are some examples of floating-point literals: 3.14159 // Legal .314159F // Legal 314159E-5L // Legal 314. // Legal 314E // Illegal: incomplete exponent 314f // Illegal: no decimal or exponent .e24 // Illegal: missing integer or fraction 1.2.3.3 Boolean literalsThere are two Boolean literals, both keywords: true and false. 1.2.3.4 Character literalsCharacter literals are enclosed in single quotes. If the literal begins with L (uppercase only), it is a wide character literal (e.g., L'x'). Otherwise, it is a narrow character literal (e.g., 'x'). Narrow characters are used more frequently than wide characters, so the "narrow" adjective is usually dropped. The value of a narrow or wide character literal is the value of the character's encoding in the execution character set. If the literal contains more than one character, the literal value is implementation-defined. Note that a character might have different encodings in different locales. Consult your compiler's documentation to learn which encoding it uses for character literals. A narrow character literal with a single character has type char. With more than one character, the type is int (e.g., 'abc'). The type of a wide character literal is always wchar_t.
A character literal can be a plain character (e.g., 'x'), an escape sequence (e.g., '\b'), or a universal character (e.g., '\u03C0'). Table 1-1 lists the possible escape sequences. Note that you must use an escape sequence for a backslash or single-quote character literal. Using an escape for a double quote or question mark is optional. Only the characters shown in Table 1-1 are allowed in an escape sequence. (Some compilers extend the standard and recognize other escape sequences.)
1.2.3.5 String literalsString literals are enclosed in double quotes. A string contains characters that are similar to character literals: plain characters, escape sequences, and universal characters. A string cannot cross a line boundary in the source file, but it can contain escaped line endings (backslash followed by newline). A wide string literal is prefaced with L (always uppercase). In a wide string literal, a single universal character always maps to a single wide character. In a narrow string literal, the implementation determines whether a universal character maps to one or multiple characters (called a multibyte character). See Chapter 8 for more information on multibyte characters. Two adjacent string literals (possibly separated by whitespace, including new lines) are concatenated at compile time into a single string. This is often a convenient way to break a long string across multiple lines. Do not try to combine a narrow string with a wide string in this way. After concatenating adjacent strings, the null character ('\0' or L'\0') is automatically appended after the last character in the string literal. Here are some examples of string literals. Note that the first three form identical strings. "hello, reader" "hello, \ reader" "hello, " "rea" "der" "Alert: \a; ASCII tab: \010; portable tab: \t" "illegal: unterminated string L"string with \"quotes\"" A string literal's type is an array of const char. For example, "string"'s type is const char[7]. Wide string literals are arrays of const wchar_t. All string literals have static lifetimes (see Chapter 2 for more information about lifetimes). As with an array of const anything, the compiler can automatically convert the array to a pointer to the array's first element. You can, for example, assign a string literal to a suitable pointer object: const char* ptr; ptr = "string"; As a special case, you can also convert a string literal to a non-const pointer. Attempting to modify the string results in undefined behavior. This conversion is deprecated, and well-written code does not rely on it. 1.2.4 SymbolsNonalphabetic symbols are used as operators and as punctuation (e.g., statement terminators). Some symbols are made of multiple adjacent characters. The following are all the symbols used for operators and punctuation:
You cannot insert whitespace between characters that make up a symbol, and C++ always collects as many characters as it can to form a symbol before trying to interpret the symbol. Thus, an expression such as x+++y is read as x ++ + y. A common error when first using templates is to omit a space between closing angle brackets in a nested template instantiation. The following is an example with that space: std::list<std::vector<int> > list; Note the space here. The example is incorrect without the space character because the adjacent greater than signs would be interpreted as a single right-shift operator, not as two separate closing angle brackets. Another, slightly less common, error is instantiating a template with a template argument that uses the global scope operators: ::std::list< ::std::list<int> > list; Space here and here Again, a space is needed, this time between the angle-bracket (<) and the scope operator (::), to prevent the compiler from seeing the first token as <: rather than <. The <: token is an alternative token, as described in Section 1.5 later in this chapter. |