DekGenius.com
[ Team LiB ] Previous Section Next Section

Recipe 8.1 Enumerating Matches

Problem

You need to find one or more substrings corresponding to a particular pattern within a string. You need to be able to inform the searching code to return either all matching substrings or only the matching substrings that are unique within the set of all matched strings.

Solution

Call the FindSubstrings method, which executes a regular expression and obtains all matching text. This method returns either all matching results or only the unique matches; this behavior is controlled by the findAllUnique parameter. Note that if the findAllUnique parameter is set to true, the unique matches are returned sorted alphabetically. Its source code is as follows:

using System;
using System.Collections;
using System.Text.RegularExpressions;

public static Match[] FindSubstrings(string source, string matchPattern,
                                     bool findAllUnique)
{
    SortedList uniqueMatches = new SortedList( );
    Match[] retArray = null;

    Regex RE = new Regex(matchPattern, RegexOptions.Multiline);
    MatchCollection theMatches = RE.Matches(source);

    if (findAllUnique)
    {
        for (int counter = 0; counter < theMatches.Count; counter++)
        {
            if (!uniqueMatches.ContainsKey(theMatches[counter].Value))
            {
                uniqueMatches.Add(theMatches[counter].Value, 
                                  theMatches[counter]);
            }
        }

        retArray = new Match[uniqueMatches.Count];
        uniqueMatches.Values.CopyTo(retArray, 0);
    }
    else
    {
        retArray = new Match[theMatches.Count];
        theMatches.CopyTo(retArray, 0);
    }

    return (retArray);
}

The following method searches for any tags in an XML string; it does this by searching for a block of text that begins with the < character and ends with the > character.

This method first displays all unique tag matches present in the XML string and then displays all tag matches within the string:

public static void TestFindSubstrings( )
{
    string matchPattern = "<.*>";

    string source = @"<?xml version='1.0' encoding='UTF-8'?>
             <!-- my comment -->
             <![CDATA[<escaped> <><chars>>>>>]]>
             <Window ID='Main'>
               <Control ID='TextBox'>
                 <Property Top='0' Left='0' Text='BLANK'/>
               </Control>
               <Control ID='Label'>
                 <Property Top='0' Left='0' Caption='Enter Name Here'/>
               </Control>
               <Control ID='Label'>
                 <Property Top='0' Left='0' Caption='Enter Name Here'/>
               </Control>
             </Window>";

    Console.WriteLine("UNIQUE MATCHES");
    Match[] x1 = FindSubstrings(source, matchPattern, true);
    foreach(Match m in x1)
    {
        Console.WriteLine(m.Value);
    }

    Console.WriteLine( );
    Console.WriteLine("ALL MATCHES");
    Match[] x2 = FindSubstrings(source, matchPattern, false);
    foreach(Match m in x2)
    {
        Console.WriteLine(m.Value);
    }
}

The following text will be displayed:

UNIQUE MATCHES
<!-- my comment -->
<![CDATA[<escaped> <><chars>>>>>]]>
</Control>
</Window>
<?xml version="1.0\" encoding=\"UTF-8\"?>
<Control ID="Label">
<Control ID="TextBox">
<Property Top="0" Left="0" Caption="Enter Name Here"/>
<Property Top="0" Left="0" Text="BLANK"/>
<Window ID="Main">

ALL MATCHES
<?xml version="1.0\" encoding=\"UTF-8\"?>
<!-- my comment -->
<![CDATA[<escaped> <><chars>>>>>]]>
<Window ID="Main">
<Control ID="TextBox">
<Property Top="0" Left="0" Text="BLANK"/>
</Control>
<Control ID="Label">
<Property Top="0" Left="0" Caption="Enter Name Here"/>
</Control>
<Control ID="Label">
<Property Top="0" Left="0" Caption="Enter Name Here"/>
</Control>
</Window>

Discussion

As you can see, the regular expression classes in the FCL are quite easy to use. The first step is to create an instance of the Regex object that contains the regular expression pattern along with any options for running this pattern. The second step is to get a reference to an instance of the Match object, if you only need the first found match, or a MatchCollection object, if you need more than just the first found match. To get a reference to this object, the two instance methods Match and Matches can be called from the Regex object that was created in the first step. The Match method returns a single match object (Match) and Matches returns a collection of match objects (MatchCollection).

The FindSubstrings method returns an array of Match objects that can be used by the calling code. You might have noticed that the unique elements are returned sorted, and the nonunique elements are not sorted. A SortedList, which is used by the FindSubstrings method to store unique strings that match the regular expression pattern, automatically sorts its items when they are added.

The regular expression used in the TestFindSubstrings method is very simplistic and will work in most—but not all—conditions. For example, if two tags are on the same line, as shown here:

<tagData></tagData>

the regular expression will catch the entire line, not each tag separately. You could change the regular expression from <.*> to <[^>]*> to match only up to the closing > ([^>]* matches everything that is not a >). However, this will fail in the CDATA section, matching <![CDATA[<escaped>, <>, and <chars> instead of <![CDATA[<escaped> <><chars>>>>>]]>. The more complicated @"(<!\[CDATA.*>|<[^>]*>)" will match either <!\[CDATA.*> (a greedy match for everything within the CDATA section) or <[^>]*>, described previously.

See Also

See the ".NET Framework Regular Expressions" and "SortedList Class" topics in the MSDN documentation.

    [ Team LiB ] Previous Section Next Section