DekGenius.com
[ Team LiB ] Previous Section Next Section

Recipe 8.12 Using Common Patterns

Problem

You need a quick list from which to choose regular expression patterns that match standard items. These standard items could be a Social Security Number, a zip code, a word containing only characters, an alphanumeric word, an email address, a URL, dates, or one of many other possible items used throughout business applications.

These patterns can be useful in making sure that a user has input the correct data and that it is well-formed. These patterns can also be used as an extra security measure to keep hackers from attempting to break your code by entering strange or malformed input (e.g., SQL injection or cross-site-scripting attacks). Note that these regular expressions are not a silver bullet that will stop all attacks on your system; rather, they are an added layer of defense.

Solution

  • Match only alphanumeric characters along with the characters -, +, ., and any whitespace:

    ^([\w\.+-]|\s)*$

    Be careful using the - character within a character class—a regular expression enclosed within [ and ]. That character is also used to specify a range of characters, as in a--z for a through z inclusive. If you want to use a literal - character, either escape it with \ or put it at the end of the expression, as shown in the previous and next examples.


  • Match only alphanumeric characters along with the characters -, +, ., and any whitespace, with the stipulation that there is at least one of these characters and no more than 10 of these characters:

    ^([\w\.+-]|\s){1,10}$
  • Match a date in the form ##/##/#### where the day and month can be a one- or two-digit value, and year can either be a two- or four-digit value:

    ^\d{1,2}\/\d{1,2}\/\d{2,4}$
  • Match a time to be entered with an optional am or pm extension (note that this regular expression also handles military time):

    ^\d{1,2}:\d{2}\s?([ap]m)?$
  • Match an IP address:

    ^([0-2]?[0-5]?[0-5]\.){3}[0-2]?[0-5]?[0-5]$
  • Verify that an email address is in the form name@address where address is not an IP address:

    ^[A-Za-z0-9_\-\.]+@(([A-Za-z0-9\-])+\.)+([A-Za-z\-])+$
  • Verify that an email address is in the form name@address where address is an IP address:

    ^[A-Za-z0-9_\-\.]+@([0-2]?[0-5]?[0-5]\.){3}[0-2]?[0-5]?[0-5]$
  • Match only a dollar amount with the optional $ and + or - preceding characters (note that any number of decimal places may be added):

    ^\$?[+-]?[\d,]*(\.\d*)?$

    This is similar to the previous regular expression except that only up to two decimal places are allowed:

    ^\$?[+-]?[\d,]*\.?\d{0,2}$
  • Match a credit card number to be entered as four sets of four digits separated with a space, -, or no character at all:

    ^((\d{4}[- ]?){3}\d{4})$
  • Match a zip code to be entered either as five digits with an optional four-digit extension:

    ^\d{5}(-\d{4})?$
  • Match a North American phone number with an optional area code and an optional - character to be used in the phone number and no extension:

    ^(\(?[0-9]{3}\)?)?\-?[0-9]{3}\-?[0-9]{4}$
  • Match a phone number similar to the previous regular expression, but allow an optional five-digit extension prefixed with either ext or extension:

    ^(\(?[0-9]{3}\)?)?\-?[0-9]{3}\-?[0-9]{4}(\s*ext(ension)?[0-9]{5})?$
  • Match a full path beginning with the drive letter and optionally match a filename with a three-character extension (note that no .. characters signifying to move up the directory hierarchy are allowed, nor is a directory name with a . followed by an extension):

    ^[a-zA-Z]:[\\/]([_a-zA-Z0-9]+[\\/]?)*([_a-zA-Z0-9]+\.[_a-zA-Z0-9]{0,3})?$

Discussion

Regular expressions are effective at finding specific information, and they have a wide range of uses. Many applications use them to locate specific information within a larger range of text, as well as to filter out bad input. The filtering action is very useful in tightening the security of an application and preventing an attacker from attempting to use carefully formed input to gain access to a machine on the Internet or a local network. By using a regular expression to allow only good input to be passed to the application, you can reduce the likelihood of many types of attacks, such as SQL injection or cross-site-scripting.

The regular expressions presented in this recipe only provide a minute cross-section of what can be accomplished with them. By taking these expressions and manipulating parts of them, you can easily modify them to work with your application. Take, for example, the following expression which allows only between 1 and 10 alphanumeric characters, along with a few symbols to be allowed as input:

^([\w\.+-]|\s){1,10}$

By changing the {1,10} part of the regular expression to {0,200}, this expression will now match a blank entry or an entry of the specified symbols up to and including 200 characters.

Note the use of the ^ character at the beginning of the expression and the $ character at the end of the expression. These characters start the match at the beginning of the text and match all the way to the end of the text. Adding these characters forces the regular expression to match the entire string or none of it. By removing these characters, you can search for specific text within a larger block of text. For example, the following regular expression matches only a string containing nothing but a U.S. zip code (there can be no leading or trailing spaces):

^\d{5}(-\d{4})?$

This version matches only a zip code with leading or trailing spaces (notice the addition of the \s* to the start and end of the expression):

^\s*\d{5}(-\d{4})?\s*$

However, this modified expression matches a zip code found anywhere within a string (including a string containing just a zip code):

\d{5}(-\d{4})?

Use the regular expressions in this recipe and modify them to suit your needs.

See Also

Two good books that cover regular expressions are Regular Expression Pocket Reference by Tony Stubblebine (O'Reilly) and Mastering Regular Expressions, Second Edition, by Jeffrey Friedl (O'Reilly).

    [ Team LiB ] Previous Section Next Section