[Nemeth10] 2.3. Regular expressions

2.3. Regular expressions

Regularexpressions are supported by most modern languages, though some takethem more to heart than others. They’re also used by UNIX commands suchas grep and vi.They are so common that the name is usually shortened to “regex.”Entire books have been written about how to harness their power, andthey have been the subject of numerous doctoral dissertations.

The filename matching and expansion performed by the shell when it interprets command lines such as wc -l *.pl is not a form of regex matching. It’s a different system called “shell globbing,” and it uses a different and simpler syntax.

Regularexpressions are powerful, but they cannot recognize all possiblegrammars. Their most notable weakness is that they cannot recognizenested delimiters. For example, it’s not possible to write a regular expression that recognizes valid arithmetic expressions when parentheses are allowed for grouping.

Regularexpressions reached the apex of their power and perfection in Perl. Infact, Perl’s pattern matching features are so elaborate that it’s notreally accurate to call them an implementation of regularexpressions. Perl patterns can match nested delimiters, recognizepalindromes, and match an arbitrary string of As followed by the samenumber of Bs—all feats beyond the reach of regular expressions. However, Perl can process vanilla regular expressions as well.

Perl’s pattern matching languageremains the industry benchmark, and it has been widely adopted by otherlanguages and tools. Philip Hazel’s PCRE (Perl-compatible regular expression) library makes it relatively easy for developers to incorporate the language into their own projects.

Regularexpressions are not themselves a scripting language, but they’re souseful that they merit featured coverage in any discussion of scripting;hence, this section.^[9] Here, we discuss them in their basic form with a few of Perl’s refinements.

^[9] Perl guru Tom Christiansen commented, “I don’t know what a ‘scripting language’ is, but I agree that regularexpressions are neither procedural nor functional languages. Rather,they are a logic-based or declarative language, a class of languagesthat also includes Prolog and Makefiles. And BNFs. One might also callthem rule-based languages. I prefer to call them declarative languagesmyself.”

The matching process

Code that evaluates a regular expressionattempts to match a single given text string to a single given pattern.The “text string” to match can be very long and can contain embeddednewlines. It’s often convenient to use a regex to match the contents ofan entire file or HTML document.

For the matcher to declaresuccess, the entire search pattern must match a contiguous section ofthe search text. However, the pattern can match at any position. After asuccessful match, the evaluator returns the text of the match alongwith a list of matches for any specially delimited subsections of thepattern.

Literal characters

In general, characters in a regular expression match themselves. So the pattern

I am the walrus

matches the string “I amthe walrus” and that string only. Since it can match anywhere in thesearch text, the pattern can be successfully matched to the string “I amthe egg man. I am the walrus. Koo koo ka-choo!” However, the actualmatch is limited to the “I am the walrus” portion. Matching is casesensitive.

Special characters

Table 2.4 shows the meanings of some common special symbols that can appear in regular expressions. These are just the basics—there are many, many more.

Table 2.4. Special characters in regular expressions (common ones)
Symbol	What it matches or does
`.`	Matches any character
`[`chars`]`	Matches any character from a given set
`[^`chars`]`	Matches any character not in a given set
`^`	Matches the beginning of a line
`$`	Matches the end of a line
`\w`	Matches any “word” character (same as `[A-Za-z0-9_]`)
`\s`	Matches any whitespace character (same as `[\f\t\n\r]`)^[a]
`\d`	Matches any digit (same as `[0-9]`)
`\|`	Matches either the element to its left or the one to its right
`(`expr)	Limits scope, groups elements, allows matches to be captured
	Allows zero or one match of the preceding element
`*`	Allows zero, one, or many matches of the preceding element
`+`	Allows one or more matches of the preceding element
`{`n}	Matches exactly n instances of the preceding element
`{`min,`}`	Matches at least min instances (note the comma)
`{`min,max`}`	Matches any number of instances from min to max

^[a] That is, a space, a form feed, a tab, a newline, or a return

Many special constructs, such as + and |,affect the matching of the “thing” to their left or right. In general, a“thing” is a single character, a subpattern enclosed in parentheses, ora character class enclosed in square brackets. For the |character, however, thingness extends indefinitely to both left andright. If you want to limit the scope of the vertical bar, enclose thebar and both things in their own set of parentheses. For example,

I am the (walrus|egg man)\.

matches either “I am thewalrus.” or “I am the egg man.”. This example also demonstrates escapingof special characters (here, the dot). The pattern

(I am the (walrus|egg man)\. ?){1,2}

matches any of the following:

I am the walrus.
I am the egg man.
I am the walrus. I am the egg man.
I am the egg man. I am the walrus.

Unfortunately,it also matches “I am the egg man. I am the egg man.”. (What kind ofsense does that make?) More importantly, it also matches “I am thewalrus. I am the egg man. I am the walrus.”, even though the number ofrepetitions is explicitly capped at two. That’s because the pattern neednot match the entire search text. Here, the regex matches two sentencesand terminates, declaring success. It simply doesn’t care that anotherrepetition is available.

It is a common error to confuse the regular expression metacharacter * (the zero-or-more quantifier) with the shell’s * globbing character. The regex version of the star needs something to modify; otherwise, it won’t do what you expect. Use .* if any sequence of characters (including no characters at all) is an acceptable match.

Example regular expressions

In the United States,postal (“zip”) codes have either five digits or five digits followed by adash and four more digits. To match a regular zip code, you must match a five-digit number. The following regular expression fits the bill:

^\d{5}$

The ^ and $match the beginning and end of the search text but do not actuallycorrespond to characters in the text; they are “zero-width assertions.”These characters ensure that only texts consisting of exactly fivedigits match the regular expression—the regex will not match five digits within a larger string. The \d escape matches a digit, and the quantifier {5} says that there must be exactly five digit matches.

To accommodate either a five-digit zip code or an extended zip+4, add an optional dash and four additional digits:

^\d{5}(-\d{4})?$

The parentheses group thedash and extra digits together so that they are considered one optionalunit. For example, the regex won’t match a five-digit zip code followedby a dash. If the dash is present, the four-digit extension must bepresent as well or there is no match.

A classic demonstration of regex matching is the following expression,

M[ou]'?am+[ae]r ([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]

which matches most of the variant spellings of the name of Libyan head of state Moammar Gadhafi, including

Muammar al-Kaddafi	(BBC)
Moammar Gadhafi	(Associated Press)
Muammar al-Qadhafi	(Al-Jazeera)
Mu’ammar Al-Qadhafi	(U.S. Department of State)

Do you see how each of these would match the pattern?

This regular expression also illustrates how quickly the limits of legibility can be reached. Many regex systems (including Perl’s) support an xoption that ignores literal whitespace in the pattern and enablescomments, allowing the pattern to be spaced out and split over multiplelines. You can then use whitespace to separate logical groups andclarify relationships, just as you would in a procedural language. Forexample:

M [ou] '? a m+ [ae] r   # First name: Mu'ammar, Moamar, etc.
\s                      # Whitespace; can't use a literal space here
(                       # Group for optional last name prefix
  [AEae] l              #   Al, El, al, or el
  [-\s]                 #   Followed by dash or space
)?
[GKQ] h? [aeu]+         # Initial syllable of last name: Kha, Qua, etc.
(                       # Group for consonants at start of 2nd syllable
  [dtz] [dhz]?          #   dd, dh, etc.
)+
af [iy]

This helps a little bit, butit’s still pretty easy to torture later readers of your code. So bekind: if you can, use hierarchical matching and multiple small matchesinstead of trying to cover every possible situation in one large regular expression.

Captures

When a match succeeds, every set ofparentheses becomes a “capture group” that records the actual text thatit matched. The exact manner in which these pieces are made available toyou depends on the implementation and context. In Perl, you can accessthe results as a list or as a sequence of numbered variables.

Since parentheses can nest,how do you know which match is which? Easy—the matches arrive in thesame order as the opening parentheses. There are as many captures asthere are opening parentheses, regardless of the role (or lack of role)that each parenthesized group played in the actual matching. When aparenthesized group is not used (e.g., Mu(')?ammar when matched against “Muammar”), its corresponding capture is empty.

If a group is matched more than once, only the contents of the last match are returned. For example, with the pattern

(I am the (walrus|egg man)\. ?){1,2}

matching the text

I am the egg man. I am the walrus.

there are two results, one for each set of parentheses:

I am the walrus.
walrus

Notethat both capture groups actually matched twice. However, only the lasttext to match each set of parentheses is actually captured.

Greediness, laziness, and catastrophic backtracking

Regular expressionsmatch from left to right. Each component of the pattern matches thelongest possible string before yielding to the next component, acharacteristic known as greediness.

If the regex evaluatorreaches a state from which a match cannot be completed, it unwinds a bitof the candidate match and makes one of the greedy atoms give up someof its text. For example, consider the regex a*aa being matched against the input text “aaaaaa”.

At first, the regex evaluator assigns the entire input to the a* portion of the regex, because the a*is greedy. When there are no more a’s to match, the evaluator goes onto try to match the next part of the regex. But oops, it’s an a, and there is no more input text that can match an a; time to backtrack. The a* has to give up one of the a’s it has matched.

Now the evaluator can match a*a, but it still cannot match the last a in the pattern. So it backtracks again and takes away a second a from the a*. Now the second and third a’s in the pattern both have a’s to pair with, and the match is complete.

This simple exampleillustrates some important general points. First, greedy matching plusbacktracking makes it expensive to match apparently simple patterns suchas <img.*></tr> when processing entire files.^[10] The .*portion starts by matching everything from the first <img to the endof the input, and only through repeated backtracking does it contractto fit the local tags.

^[10]Although this section shows HTML excerpts as examples of text to bematched, regular expressions are not really the right tool for this job.Our external reviewers were uniformly aghast. Perl and Python both haveexcellent add-ons that parse HTML documents the proper way. You canthen access the portions you’re interested in with XPath selectors. Seethe Wikipedia page for XPath and the respective languages’ modulerepositories for details.

Furthermore, the ></tr> that this pattern binds to is the last possiblevalid match in the input, which is probably not what you want. Morelikely, you meant to match an <img> followed by a </tr> tag.A better way to write this pattern is <img[^>]*></tr>,which allows the initial wild-card match to expand only to the end ofthe current tag because it cannot cross a right-angle-bracket boundary.

You can also use lazy (as opposed to greedy) wild card operators: *? instead of *, and +? instead of +.These versions match as few characters of the input as they can. Ifthat fails, they match more. In many situations, these operators aremore efficient and closer to what you want than the greedy versions.

Note, however, that they canproduce different matches than the greedy operators; the difference ismore than just one of implementation. In our HTML example, the lazypattern would be <img.*?></tr>. But even here, the .*? could eventually growto include unwanted >’s because the next tag after an <img>might not be a </tr>. Again, probably not what you want.

Patterns with multiplewild-card sections can cause exponential behavior in the regexevaluator, especially if portions of the text can match several of thewildcard expressions and especially if the search text does not in factmatch the pattern. This situation is not as unusual as it might sound,especially when pattern matching with HTML. Very often, you’ll want tomatch certain tags followed by other tags, possibly separated by evenmore tags, a recipe that may require the regex evaluator to try manypossible combinations.

Regex guru Jan Goyvaerts calls this phenomenon “catastrophic backtracking” and writes about it in his blog; see regular-expressions.info/catastrophic.html for details and some good solutions.

A couple of take-home points from all this:

If you can do pattern matching line-by-line rather than file-at-a-time, there is much less risk of poor performance.
Even though regex notation makes greedy operators the default, they probably shouldn’t be. Use lazy operators.
All instances of .* are inherently suspicious and should be scrutinized.

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。