Apache Cookbook-Apache Cookbook

A.1 What Directives Use Regular Expressions?

Two main categories of Apache directives use regular expressions. Any directive with a name containing the word Match, such as FileMatch, can be assumed to use regular expressions in its arguments. And directives supplied by the module mod_rewrite use regular expressions to accomplish their work.

For more about mod_rewrite, see Chapter 5.

SomethingMatch directives each implement the same functionality as their counterpart without the Match. For example, the RedirectMatch directive does essentially the same thing as the Redirect directive, except that the first argument, rather than being a literal string, is a regular expression, which will be compared to the incoming request URL.

A.1.1 Regular Expression Basics

To get started in writing your own regular expressions, you'll need to know a few basic pieces of vocabulary, such as shown in Table A-1 and Table A-2. These constitute the bare minimum that you need to know. Although this will hardly qualify you as an expert, it will enable you to solve many of the regex scenarios you will find yourself faced with.

Table A-1. A basic regex vocabulary

Character

Meaning

.

Matches any character. This is the wildcard character.

+

Matches one or more of the previous character. For example, M+ would match one or more Ms; "+" would match one or more characters of any kind.

*

Matches zero or more of the previous character. For example, M* would match zero or more Ms. This means that it will not only match M, MM, and MMM, but it will also match a string that doesn't have any Ms in it at all.

?

Makes the previous character optional. For example, the regular expression monkeys? will match a string containing either monkey or monkeys. Note that the ? applies only to a single character in the absence of any enclosing parentheses.

^

Indicates that the following characters must appear at the beginning of the string being tested. Thus, a regular expression of ^zim requires that the string being tested start with the characters zim. ^ is referred to as an anchor, because it anchors the match to the beginning of the string. In the context of a character class (see below), the ^ character has another special meaning.

$

Indicates that the characters to be matched must appear at the end of the string. Thus, a regular expression of gif$ requires that the string being tested end with the characters gif. $ is referred to as an anchor, because it anchors the match to the end of the string.

\

Escapes the following character, meaning that it removes the "specialness" of the character. For example, a pattern containing \. would match a literal . character, since the \ removes the special meaning of the . character.

[]

Character class. Match one of the things contained in the square brackets. For example, [abc] will match either an a, or b, or c. [abc]+, on the other hand, would match a sequence of a's, b's, and c's, or any combination of them. Note that within a character class, the ^ character doesn't have its normal anchor status but means any character except those in the class. Thus, a character class of [^abc] will match any character that is not an a, b, nor c.

A character class containing a - between two characters means an entire range of characters. For example, the character class [a-q] means all of the lowercase letters starting from a and ending with q. [a-zA-Z] would be all uppercase, and all lowercase letters.

In addition to character classes that you form yourself, there are a number of special predefined character classes to represent commonly used groups of characters. See Table A-2 for a list of these predefined character classes.

()

Groups a set of characters together. This allows you to consider them as a single unit. For example, you could apply a + or ? to an entire group of characters, rather than just a single character. The expression (monkeys)?, for example, would make the entire word monkeys an optional part of the match. In some regular expression libraries, the ( ) characters also capture the contents of the match so that they can be used later.

Table A-2. Predefined regular expression character classes

Character class

Meaning

[[:alnum:]]

Any alphanumeric character

[[:alpha:]]

Any alphabetical character

[[:blank:]]

A space or horizontal tab

[[:ctrl:]]

A control character

[[:digit:]]

A decimal digit

[[:graph:]]

A nonspace, noncontrol character

[[:lower:]]

A lowercase letter

[[:print:]]

Same as graph, but also space and tab

[[:punct:]]

A punctuation character

[[:space:]]

Any whitespace character, including newline and return

[[:upper:]]

An uppercase letter

[[:xdigit:]]

A valid hexadecimal digit

[[:<:]]

The boundary between the left end of a word and nonword characters

[[:>:]]

The boundary between the right end of a word and nonword characters

A.1.2 Examples

The previous concepts can best be illustrated by a few examples of regular expressions in action.

A.1.2.1 Redirecting several URLs

We'll start with something fairly simple. In this scenario, we're getting a new web server to handle the customer support portion of our web site. So, all requests that previously went to http://www.example.com/support/ will now go to the new server, http://support.example.com/. Ordinarily, this could be accomplished with a simple Redirect statement, but it appears that our web site developer has been careless and has been using mod_speling (see Recipe 5.10), so there are links throughout the site to both http://www.example.com/support/ and to http://www.example.com/Support/, which would actually require not one but two Redirect statements.

So, instead of using the two Redirect statements, we will use the following one RedirectMatch directive:

RedirectMatch ^/[sS]upport/ http://support.example.com/

The square brackets indicate a character class, causing this one statement to match requests with either the upper- or lowercase s.

Note also the ^ on the front of the argument, causing this directive to apply only to URLs that start with the specified pattern, rather than URLs that simply happen to contain that pattern somewhere in them.

A.1.2.2 Catching common misspellings

While watching the logfiles, we see that a number of people are misspelling support as suport. This is easily fixed by slightly altering our RedirectMatch directive:

RedirectMatch ^/[sS]upp?ort/ http://support.example.com/

The ? makes the second p optional, thus catching those requests that are misspelled and redirecting them to the appropriate place anyway.

A.1.3 For More Information

By far the best resources for learning about regular expressions are Jeffrey Friedl's book Mastering Regular Expressions and Tony Stubblebind's book Regular Expression Pocket Reference, both published by O'Reilly. They cover regular expressions in many languages, as well as the theory behind regular expressions in general.

For a free resource on regular expressions, you should see the Perl documentation on the topic. Just type perldoc perlre on any system that has Perl installed. Or you can view this documentation online at http://www.perldoc.com/perl5.6.1/pod/perlre.html. But be aware that there are subtle (and not-so-subtle) differences between the regular expression vocabulary of Perl and that of Apache.

[ Team LiB ]