DekGenius.com
Previous Section  < Day Day Up >  Next Section

B.1 Characters and Metacharacters

In a regular expression, some characters match themselves, such as the hyphen in the ZIP Code regex or the < in the HTML tag regex. Some characters have special meanings, such as the ? that makes something optional or the square brackets that mean "one character from the list inside the square brackets." The characters that match themselves are called literals. The characters that have special meanings are called metacharacters.

A pattern containing only literals matches strings that contain the sequence of literals in the pattern. For example, the pattern href= matches the strings <a href="/">Home</a>, schref=, and set href=12.

The metacharacter . (dot) matches any character.[2] So, the pattern d.g matches dog, d7g, adagio, digdug, and *d*g*, among other possibilities. It also matches d.g, since dot (the metacharacter) matches a literal . character. Without a quantifier (introduced in Section B.2), dot matches exactly one character. This means that d.g doesn't match ridge (it has no characters between the d and the g) or doug (it has more than one character between the d and the g).

[2] This isn't entirely true. By default, dot doesn't match a newline character. Turning on the s pattern modifier makes dot match newline, however. This and other pattern modifiers are explained later in this appendix in Section B.6.

The metacharacter | (bar) is for alternation. Use alternation to construct a pattern that matches more than one set of characters. For example, dog|cat matches strings that contain dog or cat, such as dog, cathode, redogame, and hotdog stand. The pattern dog|cat does not mean "match do, then either g or c, then at." The alternation text generally includes everything back to the beginning of the pattern or forward to the end of the pattern. However, you can restrict the reach of alternation by enclosing the choices in parentheses. For example, s(cr|in)ew means "match s, then either cr or in, then ew"—it matches screw, sinew, and my screwdriver, but not screen or deminews. Without the parentheses, the pattern scr|inew means "match scr or inew." This still matches screw and sinew, but it also matches screen and deminews. Alternation can also be used with more than just two choices. For example, s(cr|in|tr|ch)ew matches screw, sinew, strew, and eschew.

Using parentheses to group together characters for alternation is called grouping. (Some things about regular expressions are straightforward.) Grouping also applies to quantifiers, as discussed in the next section. Parentheses also capture the text inside them for subsequent use. The characters that match the part of the pattern inside a set of parentheses are stored in a special variable so you can retrieve them later. Capturing is explained later in this appendix in more detail in Section B.6.1 and Section B.6.2.

    Previous Section  < Day Day Up >  Next Section