Recipe 8.12 Using Common Patterns
Problem
You need a quick list from which to choose
regular expression patterns that match standard items. These standard
items could be a Social Security Number, a zip code, a word
containing only characters, an alphanumeric word, an email address, a
URL, dates, or one of many other possible items used throughout
business applications.
These patterns can be useful in
making sure that a user has input the correct data and that it is
well-formed. These patterns can also be used as an extra security
measure to keep hackers from attempting to break your code by
entering strange or malformed input (e.g., SQL injection or
cross-site-scripting attacks). Note that these regular expressions
are not a silver bullet that will stop all attacks on your system;
rather, they are an added layer of defense.
Solution
Match only
alphanumeric characters along with the characters -, +, ., and any
whitespace: ^([\w\.+-]|\s)*$ |
Be careful using the
- character within a character class—a
regular expression enclosed within [ and
]. That character is also used to specify a range
of characters, as in a--z for a
through z inclusive. If you want to use a literal
- character, either escape it with
\ or put it at the end of the expression, as shown
in the previous and next examples.
|
|
Match only alphanumeric characters along with the characters -, +, .,
and any whitespace, with the stipulation that there is at least one
of these characters and no more than 10 of these characters: ^([\w\.+-]|\s){1,10}$ Match a date in the form
##/##/#### where the day and month can be a one-
or two-digit value, and year can either be a two- or four-digit
value: ^\d{1,2}\/\d{1,2}\/\d{2,4}$ Match a time to be entered with an optional
am or pm extension (note that
this regular expression also handles military time): ^\d{1,2}:\d{2}\s?([ap]m)?$ Match an IP address: ^([0-2]?[0-5]?[0-5]\.){3}[0-2]?[0-5]?[0-5]$ Verify that an email
address is in the form
name@address
where address is not an IP address: ^[A-Za-z0-9_\-\.]+@(([A-Za-z0-9\-])+\.)+([A-Za-z\-])+$ Verify that an email address is in the form
name@address
where address is an IP address: ^[A-Za-z0-9_\-\.]+@([0-2]?[0-5]?[0-5]\.){3}[0-2]?[0-5]?[0-5]$ Match only a
dollar amount with the optional $ and + or - preceding characters
(note that any number of decimal places may be added): ^\$?[+-]?[\d,]*(\.\d*)?$ This is similar to the previous regular expression except that only
up to two decimal places are allowed: ^\$?[+-]?[\d,]*\.?\d{0,2}$ Match a credit card number to be entered as
four sets of four digits separated with a space, -, or no character
at all: ^((\d{4}[- ]?){3}\d{4})$ Match a zip code to be entered either as
five digits with an optional four-digit extension: ^\d{5}(-\d{4})?$ Match a North American phone number with an
optional area code and an optional - character to
be used in the phone number and no extension: ^(\(?[0-9]{3}\)?)?\-?[0-9]{3}\-?[0-9]{4}$ Match a phone number similar to the previous regular expression, but
allow an optional five-digit extension prefixed with either
ext or extension: ^(\(?[0-9]{3}\)?)?\-?[0-9]{3}\-?[0-9]{4}(\s*ext(ension)?[0-9]{5})?$
Match a full path
beginning with the drive letter and optionally match a filename with
a three-character extension (note that no .. characters signifying to
move up the directory hierarchy are allowed, nor is a directory name
with a . followed by an extension): ^[a-zA-Z]:[\\/]([_a-zA-Z0-9]+[\\/]?)*([_a-zA-Z0-9]+\.[_a-zA-Z0-9]{0,3})?$
Discussion
Regular
expressions are effective at finding specific information, and they
have a wide range of uses. Many applications use them to locate
specific information within a larger range of text, as well as to
filter out bad input. The filtering action is very useful in
tightening the security of an application and preventing an attacker
from attempting to use carefully formed input to gain access to a
machine on the Internet or a local network. By using a regular
expression to allow only good input to be passed to the application,
you can reduce the likelihood of many types of attacks, such as SQL
injection or cross-site-scripting.
The regular expressions presented in this recipe only provide a
minute cross-section of what can be accomplished with them. By taking
these expressions and manipulating parts of them, you can easily
modify them to work with your application. Take, for example, the
following expression which allows only between 1 and 10 alphanumeric
characters, along with a few symbols to be allowed as input:
^([\w\.+-]|\s){1,10}$
By changing the {1,10} part of the regular
expression to {0,200}, this expression will now
match a blank entry or an entry of the specified symbols up to and
including 200 characters.
Note the use of the ^ character at the beginning
of the expression and the $ character at the end
of the expression. These characters start the match at the beginning
of the text and match all the way to the end of the text. Adding
these characters forces the regular expression to match the entire
string or none of it. By removing these characters, you can search
for specific text within a larger block of text. For example, the
following regular expression matches only a string containing nothing
but a U.S. zip code (there can be no leading or trailing spaces):
^\d{5}(-\d{4})?$
This version matches only a zip code with leading or trailing spaces
(notice the addition of the \s* to the start and
end of the expression):
^\s*\d{5}(-\d{4})?\s*$
However, this modified expression matches a zip code found anywhere
within a string (including a string containing just a zip code):
\d{5}(-\d{4})?
Use the regular expressions in this recipe and modify them to suit
your needs.
See Also
Two good
books that cover regular expressions are Regular Expression
Pocket Reference by Tony Stubblebine
(O'Reilly) and Mastering Regular
Expressions, Second Edition, by Jeffrey Friedl
(O'Reilly).
|