Recipe 8.1 Enumerating Matches
Problem
You need to find one or more
substrings corresponding to a particular pattern within a string. You
need to be able to inform the searching code to return either all
matching substrings or only the matching substrings that are unique
within the set of all matched strings.
Solution
Call
the FindSubstrings method, which executes a
regular expression and obtains all matching text. This method returns
either all matching results or only the unique matches; this behavior
is controlled by the findAllUnique parameter. Note
that if the findAllUnique parameter is set to
true, the unique matches are returned sorted
alphabetically. Its source code is as follows:
using System;
using System.Collections;
using System.Text.RegularExpressions;
public static Match[] FindSubstrings(string source, string matchPattern,
bool findAllUnique)
{
SortedList uniqueMatches = new SortedList( );
Match[] retArray = null;
Regex RE = new Regex(matchPattern, RegexOptions.Multiline);
MatchCollection theMatches = RE.Matches(source);
if (findAllUnique)
{
for (int counter = 0; counter < theMatches.Count; counter++)
{
if (!uniqueMatches.ContainsKey(theMatches[counter].Value))
{
uniqueMatches.Add(theMatches[counter].Value,
theMatches[counter]);
}
}
retArray = new Match[uniqueMatches.Count];
uniqueMatches.Values.CopyTo(retArray, 0);
}
else
{
retArray = new Match[theMatches.Count];
theMatches.CopyTo(retArray, 0);
}
return (retArray);
}
The following method searches for any
tags in an XML string; it does this by searching for a block of text
that begins with the < character and ends with
the > character.
This method first displays all unique tag matches present in the XML
string and then displays all tag matches within the string:
public static void TestFindSubstrings( )
{
string matchPattern = "<.*>";
string source = @"<?xml version='1.0' encoding='UTF-8'?>
<!-- my comment -->
<![CDATA[<escaped> <><chars>>>>>]]>
<Window ID='Main'>
<Control ID='TextBox'>
<Property Top='0' Left='0' Text='BLANK'/>
</Control>
<Control ID='Label'>
<Property Top='0' Left='0' Caption='Enter Name Here'/>
</Control>
<Control ID='Label'>
<Property Top='0' Left='0' Caption='Enter Name Here'/>
</Control>
</Window>";
Console.WriteLine("UNIQUE MATCHES");
Match[] x1 = FindSubstrings(source, matchPattern, true);
foreach(Match m in x1)
{
Console.WriteLine(m.Value);
}
Console.WriteLine( );
Console.WriteLine("ALL MATCHES");
Match[] x2 = FindSubstrings(source, matchPattern, false);
foreach(Match m in x2)
{
Console.WriteLine(m.Value);
}
}
The following text will be displayed:
UNIQUE MATCHES
<!-- my comment -->
<![CDATA[<escaped> <><chars>>>>>]]>
</Control>
</Window>
<?xml version="1.0\" encoding=\"UTF-8\"?>
<Control ID="Label">
<Control ID="TextBox">
<Property Top="0" Left="0" Caption="Enter Name Here"/>
<Property Top="0" Left="0" Text="BLANK"/>
<Window ID="Main">
ALL MATCHES
<?xml version="1.0\" encoding=\"UTF-8\"?>
<!-- my comment -->
<![CDATA[<escaped> <><chars>>>>>]]>
<Window ID="Main">
<Control ID="TextBox">
<Property Top="0" Left="0" Text="BLANK"/>
</Control>
<Control ID="Label">
<Property Top="0" Left="0" Caption="Enter Name Here"/>
</Control>
<Control ID="Label">
<Property Top="0" Left="0" Caption="Enter Name Here"/>
</Control>
</Window>
Discussion
As you can see, the regular expression classes in the FCL are quite
easy to use. The first step is to create an instance of the
Regex object that
contains the regular expression pattern along with any options for
running this pattern. The second step is to get a reference to an
instance of the Match object, if you
only need the first found match, or a
MatchCollection
object, if you need more than just the first found match. To get a
reference to this object, the two instance methods
Match and Matches can be called
from the Regex object that was created in the
first step. The Match method returns a single
match object (Match) and
Matches returns a collection of match objects
(MatchCollection).
The
FindSubstrings method returns an array of
Match objects that can be used by the calling
code. You might have noticed that the unique elements are returned
sorted, and the nonunique elements are not sorted. A
SortedList, which is used by the
FindSubstrings method to store unique strings that
match the regular expression pattern, automatically sorts its items
when they are added.
The regular expression used in the
TestFindSubstrings method is very simplistic and
will work in most—but not all—conditions. For example, if
two tags are on the same line, as shown here:
<tagData></tagData>
the regular expression will catch the entire line, not each tag
separately. You could change the regular expression from
<.*> to <[^>]*>
to match only up to the closing >
([^>]* matches everything that is not
a >). However, this will fail in the
CDATA section, matching
<![CDATA[<escaped>,
<>, and <chars>
instead of <![CDATA[<escaped>
<><chars>>>>>]]>. The
more complicated
@"(<!\[CDATA.*>|<[^>]*>)" will
match either <!\[CDATA.*> (a greedy match
for everything within the CDATA section) or
<[^>]*>, described previously.
See Also
See the ".NET Framework Regular
Expressions" and "SortedList
Class" topics in the MSDN documentation.
|