A.1 What Directives Use Regular Expressions?
Two main categories of Apache directives use
regular
expressions. Any directive with a name containing the word
Match, such as FileMatch, can be assumed to use
regular expressions in its arguments. And directives supplied by the
module mod_rewrite use regular expressions to
accomplish their work.
For more about mod_rewrite, see Chapter 5.
SomethingMatch directives
each implement the same functionality as their counterpart without
the Match. For example, the
RedirectMatch directive does essentially the same
thing as the Redirect directive, except that the
first argument, rather than being a literal string, is a regular
expression, which will be compared to the incoming request URL.
A.1.1 Regular Expression Basics
To get started in writing your own regular
expressions, you'll need to know a few basic pieces
of vocabulary, such as shown in Table A-1 and
Table A-2. These constitute the bare minimum that
you need to know. Although this will hardly qualify you as an expert,
it will enable you to solve many of the regex scenarios you will find
yourself faced with.
Table A-1. A basic regex vocabulary|
.
|
Matches any character. This is the wildcard character.
|
+
|
Matches one or more of the previous character. For example,
M+ would match one or more Ms;
"+" would match one or more
characters of any kind.
|
*
|
Matches zero or more of the previous character. For example,
M* would match zero or more Ms.
This means that it will not only match M,
MM, and MMM, but it will also
match a string that doesn't have any
Ms in it at all.
|
?
|
Makes the previous character optional. For example, the regular
expression monkeys? will match a string containing
either monkey or monkeys. Note
that the ? applies only to a single character in
the absence of any enclosing parentheses.
|
^
|
Indicates that the following characters must appear at the beginning
of the string being tested. Thus, a regular expression of
^zim requires that the string being tested start
with the characters zim. ^ is
referred to as an anchor, because it anchors the match to the
beginning of the string. In the context of a character class (see
below), the ^ character has another special
meaning.
|
$
|
Indicates that the characters to be matched must appear at the end of
the string. Thus, a regular expression of gif$
requires that the string being tested end with the characters
gif. $ is referred to as an
anchor, because it anchors the match to the end of the string.
|
\
|
Escapes the following character, meaning that it removes the
"specialness" of the character. For
example, a pattern containing \. would match a
literal . character, since the
\ removes the special meaning of the
. character.
|
[]
|
Character class. Match one of the things contained in the square
brackets. For example, [abc] will match either an
a, or b, or
c. [abc]+, on the other hand,
would match a sequence of a's,
b's, and
c's, or any combination of them.
Note that within a character class, the ^
character doesn't have its normal anchor status but
means any character except those in the class.
Thus, a character class of [^abc] will match any
character that is not an a,
b, nor c.
A character class containing a - between two
characters means an entire range of characters. For example, the
character class [a-q] means all of the lowercase
letters starting from a and ending with
q. [a-zA-Z] would be all
uppercase, and all lowercase letters.
In addition to character classes that you form yourself, there are a
number of special predefined character classes to represent commonly
used groups of characters. See Table A-2 for a
list of these predefined character classes.
|
()
|
Groups a set of characters together. This allows you to consider them
as a single unit. For example, you could apply a +
or ? to an entire group of characters, rather than
just a single character. The expression
(monkeys)?, for example, would make the entire
word monkeys an optional part of the match. In
some regular expression libraries, the ( )
characters also capture the contents of the match so that they can be
used later.
|
Table A-2. Predefined regular expression character classes|
[[:alnum:]]
|
Any alphanumeric character
|
[[:alpha:]]
|
Any alphabetical character
|
[[:blank:]]
|
A space or horizontal tab
|
[[:ctrl:]]
|
A control character
|
[[:digit:]]
|
A decimal digit
|
[[:graph:]]
|
A nonspace, noncontrol character
|
[[:lower:]]
|
A lowercase letter
|
[[:print:]]
|
Same as graph, but also space and tab
|
[[:punct:]]
|
A punctuation character
|
[[:space:]]
|
Any whitespace character, including newline and return
|
[[:upper:]]
|
An uppercase letter
|
[[:xdigit:]]
|
A valid hexadecimal digit
|
[[:<:]]
|
The boundary between the left end of a word and nonword characters
|
[[:>:]]
|
The boundary between the right end of a word and nonword characters
|
A.1.2 Examples
The previous concepts can best be illustrated by a few examples of
regular
expressions in action.
A.1.2.1 Redirecting several URLs
We'll start with something fairly simple. In this
scenario, we're getting a new web server to handle
the customer support portion of our web site. So, all requests that
previously went to http://www.example.com/support/ will now go
to the new server, http://support.example.com/. Ordinarily, this
could be accomplished with a simple Redirect
statement, but it appears that our web site developer has been
careless and has been using mod_speling (see
Recipe 5.10), so there are links throughout
the site to both http://www.example.com/support/ and to
http://www.example.com/Support/,
which would actually require not one but two
Redirect statements.
So, instead of using the two Redirect statements,
we will use the following one RedirectMatch
directive:
RedirectMatch ^/[sS]upport/ http://support.example.com/
The square brackets indicate a character class, causing this one
statement to match requests with either the upper- or lowercase
s.
Note also the ^ on the front of the argument,
causing this directive to apply only to URLs that
start with the specified pattern, rather than
URLs that simply happen to contain that pattern somewhere in them.
A.1.2.2 Catching common misspellings
While watching the logfiles, we see that a number of people are
misspelling
support as suport. This is
easily fixed by slightly altering our
RedirectMatch directive:
RedirectMatch ^/[sS]upp?ort/ http://support.example.com/
The ? makes the second p
optional, thus catching those requests that are misspelled and
redirecting them to the appropriate place anyway.
A.1.3 For More Information
By far the best resources for learning about
regular
expressions are Jeffrey Friedl's book
Mastering Regular Expressions and Tony
Stubblebind's book Regular Expression
Pocket Reference, both published by
O'Reilly. They cover regular expressions in many
languages, as well as the theory behind regular expressions in
general.
For a free resource on regular expressions, you should see the Perl
documentation on the topic. Just type perldoc
perlre on any system that has Perl installed. Or
you can view this documentation online at http://www.perldoc.com/perl5.6.1/pod/perlre.html.
But be aware that there are subtle (and not-so-subtle) differences
between the regular expression vocabulary of Perl and that of
Apache.
|