This page (partially) represents my personal notetaking from the excellent O'Reilly book, 'Mastering Regular Expressions' by Jeffrey Freidl

see also: this on multi-line searching


Generic Examples

gr[ea]y, gr(e|a)y matches grey or gray. The [ea] demonstrates the character class construct.
<[Hh][1-6] *> matches <H1>, <H2>, ... tags, allowing for a space, and upper/lower case 'H'
[0-9A-Fa-f] matches a hex digit
q[^u] finds instances where the letter 'u' does not follow the letter 'q'
colou?r makes British and Americans happy
<HR( +size *= *[0-9]+)? *> match <HR> tags with or without the size modifier
[-+]?[0-9]+(\.[0-9]*)? match a floating point number

Fun with e?grep

grep vs. egrep
in grep, you have to escape the + and () chars. For clarity, these show the egrep examples
gnu egrep combines a DFA with an NFA engine (primarily uses a DFA for speed but then does an NFA pass to handle backreferences

\<([A-Za-z]+) +\1\> finds when the same word gets repeates twice in a row. The \1 (known as a backreference ) equates to the result in the first set of perenthesis
\<ray\> finds ray when it is a word. Matches 'ray' but not 'array' or 'stray'
\<ray finds ray when it stars a word. Matches 'raymond' and 'ray' but not 'array'
(an){2,3} matches when at least 2 and up to 3 cases of 'an' occur (e.g. 'banana', 'Montanan')
gr(e|a)y matches grey or gray. The (e|a) demonstrates alternation NOTE: alternation (|) does not work with grep - you must use egrep
(an){2,3} matches when at least 2 and up to 3 cases of 'an' occur (e.g. 'banana', 'Montanan')
(([a-z])\2).*\1 find all words with multiple sets of the same double letters (e.g. 'Mississippi', 'uselessness', 'schoolroom')

Perl Variations

Basic Syntax

Basic form: $stringToProcess ~= (m or s)/regex/Modifiers
  • 'm' means 'match', 's' to substitute. If you omit this character, it assumes match.
  • The / character can be any character (yes, Perl is weird). I like '!' too.

Useful Perl Quickies

s!<[^>]*>! g replaces <TAGS> with a space
s![^a-z0-9 ]!!ig removes all non-alphanumeric chars

Disassembling a TWiki Word...

   # Make BSFLeaders into BSF Leaders, but don't make
   #  InterfacingTCL into Interfacing TC L

   # Make BSFLeaders into BSF Leaders
   $name =~ s!([A-Z\s]+)([A-Z][^A-Z\s]+)!$1 $2!g;

   # make DogWalkers into Dog Walkers
   $name =~ s!([a-z])([A-Z])!$1 $2!g;

   # make Lotus123 into Lotus 123
   $name =~ s!([A-Z])([^A-Z\s])!$1 $2!gi;

   # Make 1999Corvette into 1999 Corvette
   $name =~ s!([^A-Z\s])([A-Z])!$1 $2!gi;

Perl's Metachars

\t tab
\n newline
\r carriage return
\s ; \S whitespace (tab, space, newline, formfeed, etc) ; anything NOT \s
\w ; \W char of a word. \w+ matches a word ; anything NOT \w
\d ; \D a digit ; anything not \d

Modifiers...

i ignore case
x allows comments and free spacing (for multi-line, long regex convenience)
g global substitution
m treat caret as start of logical line, not start of string

Non-greedy example:

This finds HTML tag pairs and replaces them with a big ugly 'found' tag. (e.g <B>this</B> becomes <-FOUND (was B)->this<-FOUND (was /B)> )

NOTE: this does not work on nested tags, (e.g. with <B>this <I>that</I> </B> , only the outer <B> tags get swapped, the inner get skipped. Only way I know how to get around this with regexs is to replace with something that won't match if done again, then loop until done. )

s!<(.*?)>(.*?)</(\1)>!<-FOUND was (\1)->\2<-/FOUND (was /\1)->!ig

If, say you don't want to use greedy chars (or have to adapt it to work with something that doesn't have non-greedy support), do something like this...

s!<([^>]+)>[^<>]*</(\1)>!<-FOUND was (\1)->\2<-/FOUND (was /\1)->!ig

Validating input: detecting floating point temperatures

print "Enter celsius temp\n";
$input = <STDIN>;
chop($input);

# !~ means NOT =~
# the /i means ignore case
# 
if ( $input !~ m/^([-+]?\d*(\.\d*)?)\s*([CF])$/i ) {
        print "validation failure\n";
} else {
        printf "Got input value $1, type = $3\n";
}

Substitution

  • $string =~ s!\bth!sh!ig -- substitutes 'th' with 'sh' when it starts a word. The /ig means substitute globlally and case-insensitvely
e.g.: This is the thirsty three thirty fifth becomes shis is she shirsty shree shirty fifth

  • $string =~ s!\b(some)\w*!$1-!g -- Here we see that you can substitute using backreferences too. This useless example converts any word that starts with some and turns it into some-. Slick IMHO.
e.g.: some someone something lonesomeone becomes some- some- some- lonesomeone

  • $line =~ s!^!|>!; -- changes $line so that it has a |> at the beginning it

  • $line =~ s!\bwow\b!$&!g - uses the $& thing to allow you to wrap the matched text in the substitution thing

Profound thoughts

  • quantified items (e.g., using +, *, {m, n}) match as much as they can without making some later part of the regex fail. We therefore call them greedy .
  • IDEA: since these regex's are all equivalent: to(ni(ght|te)|knight) , tonite|toknight|tonight , to(k?night|nite), could the regex interpreter auto generate the optimal expression from one of these for speed?
  • DFA (Deterministic Finite Automata) engines are fast & consistent (but no backtracing)

See also

-- MattWalsh - 15 Aug 2002

Topic revision: r7 - 25 Aug 2005 - MattWalsh
 
This site is powered by the TWiki collaboration platformCopyright © 2008-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback