CSharp Cookbook-CSharp Cookbook

Recipe 8.7 A Better Tokenizer

Problem

A simple method of tokenizing—or breaking up a string into its discrete elements—was presented in Recipe 2.6. However, this is not powerful enough to handle all your string-tokenizing needs. You need a tokenizer—also referred to as a lexer—that can split up a string based on a well-defined set of characters.

Solution

Using the Split method of the Regex class, we can use a regular expression to indicate the types of tokens and separators that we are interested in gathering. This technique works especially well with equations, since the tokens of an equation are well-defined. For example, the code:

using System;
using System.Text.RegularExpressions;

public static string[] Tokenize(string equation)
{
    Regex RE = new Regex(@"([\+\-\*\(\)\^\\])");
    return (RE.Split(equation));
}

will divide up a string according to the regular expression specified in the Regex constructor. In other words, the string passed in to the Tokenize method will be divided up based on the delimiters +, -, *, (, ), ^, or \. The following method will call the Tokenize method to tokenize the equation: (y - 3)(3111*x^21 + x + 320):

public void TestTokenize( )
{
    foreach(string token in Tokenize("(y - 3)(3111*x^21 + x + 320)"))
        Console.WriteLine("String token = " + token.Trim( ));
}

which displays the following output:

String token =
String token = (
String token = y
String token = -
String token = 3
String token = )
String token =
String token = (
String token = 3111
String token = *
String token = x
String token = ^
String token = 21
String token = +
String token = x
String token = +
String token = 320
String token = )
String token =

Notice that each individual operator, parenthesis, and number has been broken out into its own separate token.

Discussion

The tokenizer created in Recipe 2.6 would be useful in specific controlled circumstances. However, in real-world projects, we do not always have the luxury of being able to control the set of inputs to our code. By making use of regular expressions, we can take the original tokenizer and make it flexible enough to allow it to be applied to any type or style of input we desire.

The key method used here is the Split instance method of the Regex class. The return value of this method is a string array whose elements include each individual token of the source string—the equation, in this case.

Notice that the static method allows RegexOptions enumeration values to be used, while the instance method allows for a starting position to be defined and a maximum amount of matches to occur. This may have some bearing on whether you choose the static or instance method.

Recipe 8.7 A Better Tokenizer

Problem

Solution

Discussion

See Also