This page (partially) represents my personal notetaking from the excellent O'Reilly book, 'Mastering Regular Expressions' by Jeffrey Freidl
see also:
this on multi-line searching
Generic Examples
gr[ea]y, gr(e|a)y |
matches grey or gray. The [ea] demonstrates the character class construct. |
<[Hh][1-6] *> |
matches <H1>, <H2>, ... tags, allowing for a space, and upper/lower case 'H' |
[0-9A-Fa-f] |
matches a hex digit |
q[^u] |
finds instances where the letter 'u' does not follow the letter 'q' |
colou?r |
makes British and Americans happy |
<HR( +size *= *[0-9]+)? *> |
match <HR> tags with or without the size modifier |
[-+]?[0-9]+(\.[0-9]*)? |
match a floating point number |
Fun with e?grep
| grep vs. egrep |
in grep, you have to escape the + and () chars. For clarity, these show the egrep examples |
| gnu egrep combines a DFA with an NFA engine (primarily uses a DFA for speed but then does an NFA pass to handle backreferences |
\<([A-Za-z]+) +\1\> |
finds when the same word gets repeates twice in a row. The \1 (known as a backreference ) equates to the result in the first set of perenthesis |
\<ray\> |
finds ray when it is a word. Matches 'ray' but not 'array' or 'stray' |
\<ray |
finds ray when it stars a word. Matches 'raymond' and 'ray' but not 'array' |
(an){2,3} |
matches when at least 2 and up to 3 cases of 'an' occur (e.g. 'banana', 'Montanan') |
gr(e|a)y |
matches grey or gray. The (e|a) demonstrates alternation NOTE: alternation (|) does not work with grep - you must use egrep |
(an){2,3} |
matches when at least 2 and up to 3 cases of 'an' occur (e.g. 'banana', 'Montanan') |
(([a-z])\2).*\1 |
find all words with multiple sets of the same double letters (e.g. 'Mississippi', 'uselessness', 'schoolroom') |
Perl Variations
Basic Syntax
Basic form:
$stringToProcess ~= (m or s)/regex/Modifiers
- 'm' means 'match', 's' to substitute. If you omit this character, it assumes match.
- The
/ character can be any character (yes, Perl is weird). I like '!' too.
Useful Perl Quickies
s!<[^>]*>! g |
replaces <TAGS> with a space |
s![^a-z0-9 ]!!ig |
removes all non-alphanumeric chars |
Disassembling a TWiki Word...
# Make BSFLeaders into BSF Leaders, but don't make
# InterfacingTCL into Interfacing TC L
# Make BSFLeaders into BSF Leaders
$name =~ s!([A-Z\s]+)([A-Z][^A-Z\s]+)!$1 $2!g;
# make DogWalkers into Dog Walkers
$name =~ s!([a-z])([A-Z])!$1 $2!g;
# make Lotus123 into Lotus 123
$name =~ s!([A-Z])([^A-Z\s])!$1 $2!gi;
# Make 1999Corvette into 1999 Corvette
$name =~ s!([^A-Z\s])([A-Z])!$1 $2!gi;
Perl's Metachars
\t |
tab |
\n |
newline |
\r |
carriage return |
\s ; \S |
whitespace (tab, space, newline, formfeed, etc) ; anything NOT \s |
\w ; \W |
char of a word. \w+ matches a word ; anything NOT \w |
\d ; \D |
a digit ; anything not \d |
Modifiers...
i |
ignore case |
x |
allows comments and free spacing (for multi-line, long regex convenience) |
g |
global substitution |
m |
treat caret as start of logical line, not start of string |
Non-greedy example:
This finds HTML tag pairs and replaces them with a big ugly 'found' tag. (e.g
<B>this</B> becomes
<-FOUND (was B)->this<-FOUND (was /B)> )
NOTE: this does not work on nested tags, (e.g. with
<B>this <I>that</I> </B> , only the outer
<B> tags get swapped, the inner get skipped. Only way I know how to get around this with regexs is to replace with something that won't match if done again, then loop until done. )
s!<(.*?)>(.*?)</(\1)>!<-FOUND was (\1)->\2<-/FOUND (was /\1)->!ig
If, say you don't want to use greedy chars (or have to adapt it to work with something that doesn't have non-greedy support), do something like this...
s!<([^>]+)>[^<>]*</(\1)>!<-FOUND was (\1)->\2<-/FOUND (was /\1)->!ig
Validating input: detecting floating point temperatures
print "Enter celsius temp\n";
$input = <STDIN>;
chop($input);
# !~ means NOT =~
# the /i means ignore case
#
if ( $input !~ m/^([-+]?\d*(\.\d*)?)\s*([CF])$/i ) {
print "validation failure\n";
} else {
printf "Got input value $1, type = $3\n";
}
Substitution
-
$string =~ s!\bth!sh!ig -- substitutes 'th' with 'sh' when it starts a word. The /ig means substitute globlally and case-insensitvely
e.g.: This is the thirsty three thirty fifth becomes shis is she shirsty shree shirty fifth
-
$string =~ s!\b(some)\w*!$1-!g -- Here we see that you can substitute using backreferences too. This useless example converts any word that starts with some and turns it into some-. Slick IMHO.
e.g.: some someone something lonesomeone becomes some- some- some- lonesomeone
-
$line =~ s!^!|>!; -- changes $line so that it has a |> at the beginning it
-
$line =~ s!\bwow\b!$&!g - uses the $& thing to allow you to wrap the matched text in the substitution thing
Profound thoughts
- quantified items (e.g., using
+, *, {m, n}) match as much as they can without making some later part of the regex fail. We therefore call them greedy .
- IDEA: since these regex's are all equivalent:
to(ni(ght|te)|knight) , tonite|toknight|tonight , to(k?night|nite), could the regex interpreter auto generate the optimal expression from one of these for speed?
- DFA (Deterministic Finite Automata) engines are fast & consistent (but no backtracing)
See also
--
MattWalsh - 15 Aug 2002