Regular expressions are objects of type Regexp.
create
a = Regexp.new(‘^\s*[a-z]‘) /^\s*[a-z]/
b = /^\s*[a-z]/ /^\s*[a-z]/
c = %r{^\s*[a-z]} /^\s*[a-z]/
options:
/i case insensitive
/o only interpolate #{} blocks once
/m multiline mode – ‘.’ will match newline
/x extended mode – whitespace is ignored
/[neus] encoding: none, EUC, UTF-8, SJIS, respectively
e.g. b = /^\s*[a-z]/i
match operators
=~ (positive match)
!~ (negative match)
name = "Fats Waller"
name =~ /a/ →1
name =~ /z/ → nil
/a/ =~ name →1
return the character position at which the match occurred.
$& receives the part of the string that was matched by the pattern
$` receives the part of the string thatpreceded the match,
$’ receives the string after the match.
The match also sets the thread-global variables $~ and $1 through $9.
$~ is a MatchData object
To illustrate how matching works, define a method:
def show_regexp(a, re)
if a =~ re
"#{$`}<<#{$&}>>#{$’}"
else
"no match"
end
end
show_regexp(‘very interesting’, /t/) very in<<t>>eresting
show_regexp(‘Fats Waller’, /a/) F<<a>>ts Waller
show_regexp(‘Fats Waller’, /ll/) Fats Wa<<ll>>er
show_regexp(‘Fats Waller’, /z/) no match
Patterns
all characters except ., |, (, ), [, ], {, }, +, \, ^, $, *, and ? match themselves.
use ‘\’ to match these characters.
regular expression may contain #{…} expression substitutions.
Anchors
By default, a regular expression will try to ?nd the ?rst match for the pattern in a string.
The patterns ^ and $ match the beginning and end of a line
\A matches the beginning of a string,
\z and \Z match the end of a string. (Actually, \Z matches the end of a string unless the string ends with a
, it which case it matches just before the
.)
show_regexp("this is\n the time", /^the/) this is\n<<the>> time
show_regexp("this is\n the time", /is$/) this <<is>>\n the time
show_regexp("this is\n the time", /\Athis/) <<this>> is \n the time
show_regexp("this is\n the time", /\Athe/) no match
\b and \B match word boundaries and nonword boundaries
Word characters are letters, numbers, and underscores.
show_regexp("this is\n the time", /\bis/) this <<is>>\n the time
show_regexp("this is\n the time", /\Bis/) th<<is>> is\n the time
Character Classes
[aeiou] will match a vowel
[,.:;!?] matches punctuation
show_regexp(‘Price $12.’, /[aeiou]/) Pr<<i>>ce $12.
show_regexp(‘Price $12.’, /[\s]/) Price<< >>$12.
show_regexp(‘Price $12.’, /[[:digit:]]/) Price $<<1>>2.
show_regexp(‘Price $12.’, /[[:space:]]/) Price<< >>$12.
show_regexp(‘Price $12.’, /[[:punct:]aeiou]/) Pr<<i>>ce $12.
POSIX Character Classes
Alphanumeric[:alnum:]
Uppercase or lowercase letter[:alpha:]
Blank and tab[:blank:]
Control characters (at least 0×00–0x1f, 0x7f)[:cntrl:]
Digit[:digit:]
Printable character excluding space[:graph:]
Lowercase letter[:lower:]
Any printable character (including space)[:print:]
Printable character excluding space and alphanumeric[:punct:]
Whitespace (same as \s)[:space:]
Uppercase letter[:upper:]
Hex digit (0–9, a–f, A–F)[:xdigit:]
sequence c1 -c2 represents all the characters between c1 and c2
a = ‘see [Design Patterns-page 123]‘
show_regexp(a, /[A-F]/) see [<<D>>esign Patterns-page 123]
show_regexp(a, /[A-Fa-f]/) s<<e>>e [Design Patterns-page 123]
show_regexp(a, /[0-9]/) see [Design Patterns-page <<1>>23]
show_regexp(a, /[0-9][0-9]/) see [Design Patterns-page <<12>>3]
If you want to include the literal characters ] and – within a character class, they must appear at the start.
Put a ^ immediately after the opening bracket to negate a character class
a = ‘see [Design Patterns-page 123]‘
show_regexp(a, /[]]/) → see [Design Patterns-page 123<<]>>
show_regexp(a, /[-]/) → see [Design Patterns<<->>page 123]
show_regexp(a, /[^a-z]/) → see<< >>[Design Patterns-page 123]
show_regexp(a, /[^a-z\s]/) → see <<[>>Design Patterns-page 123]
Table 5.1. Character class abbreviations
Sequence As [ . . . ] Meaning
[0-9] Digit character \d
[^0-9] Any character except a digit \D
[\s\t\r\n\f] Whitespace character \s
[^\s\t\r\n\f] Any character except whitespace \S
[A-Za-z0-9_] Word character \w
[^A-Za-z0-9_] Any character except a word character \W
show_regexp(‘It costs $12.’, /\s/) It<< >>costs $12.
show_regexp(‘It costs $12.’, /\d/) It costs $<<1>>2.
a period ( . ) appearing outside brackets represents any character except a newline
a = ‘It costs $12.’
show_regexp(a, /c.s/) It <<cos>>ts $12.
show_regexp(a, /./) <<I>>t costs $12.
show_regexp(a, /\./) It costs $12<<.>>
Repetition * ? {m,n}
matches zero or more occurrences of r. r*
matches one or more occurrences of r. r+
matches zero or one occurrence of r. r?
matches at least “m” and at most “n” occurrences of r. r{m,n}
matches at least “m” occurrences of r. r{m,}
matches exactly “m” occurrences of r. r{m}
matches zero or more occurrences of previous regular expression(non greedy) *?
matches one or more occurrences of previous regular expression(non greedy) +?
a = "The moon is made of cheese"
show_regexp(a, /\w+/) <<The>> moon is made of cheese
show_regexp(a, /\s.*\s/) The<< moon is made of >>cheese
show_regexp(a, /\s.*?\s/) The<< moon >>is made of cheese
show_regexp(a, /[aeiou]{2,99}/) The m<<oo>>n is made of cheese
show_regexp(a, /mo?o/) The <<moo>>n is made of cheese
Alternation |
a = "red ball blue sky"
show_regexp(a, /d|e/) r<<e>>d ball blue sky
show_regexp(a, /al|lu/) red b<<al>>l blue sky
show_regexp(a, /red ball|angry sky/) <<red ball>> blue sky
Grouping ()
Everything within the group is treated as a single regular expression.
show_regexp(‘banana’, /an*/) b<<an>>ana
show_regexp(‘banana’, /(an)*/) <<>>banana
show_regexp(‘banana’, /(an)+/) b<<anan>>a
a = ‘red ball blue sky’
show_regexp(a, /blue|red/) <<red>> ball blue sky
show_regexp(a, /(blue|red) \w+/) <<red ball>> blue sky
show_regexp(a, /(red|blue) \w+/) <<red ball>> blue sky
show_regexp(a, /red|blue \w+/) <<red>> ball blue sky
show_regexp(a, /red (ball|angry) sky/) no match
a = ‘the red angry sky’
show_regexp(a, /red (ball|angry) sky/) the <<red angry sky>>
within the pattern, the sequence \1 refers to the match of the ?rst group, \2 the second group, and so on. Outside the pattern, the special variables $1, $2, and so on, serve the same purpose.
"12:50am" =~ /(\d\d):(\d\d)(..)/ 0
"Hour is #$1, minute #$2" "Hour is 12, minute 50"
"12:50am" =~ /((\d\d):(\d\d))(..)/ 0
"Time is #$1" "Time is 12:50"
"Hour is #$2, minute #$3" "Hour is 12, minute 50"
"AM/PM is #$4" "AM/PM is am"
look for various forms of repetition.
# match duplicated letter
show_regexp(‘He said "Hello"’, /(\w)\1/) He said "He<<ll>>o"
# match duplicated substrings
show_regexp(‘Mississippi’, /(\w+)\1/) M<<ississ>>ippi
match delimiters
show_regexp(‘He said "Hello"’, /(["']).*?\1/) He said <<"Hello">>
show_regexp("He said ‘Hello’", /(["']).*?\1/) He said <<’Hello’>>
Pattern-Based Substitution
String#sub performs one replacement
String#gsub replaces every occurrence of the match
a = "the quick brown fox"
a.sub(/[aeiou]/, ‘*’) "th* quick brown fox"
a.gsub(/[aeiou]/, ‘*’) "th* q**ck br*wn f*x"
a.sub(/\s\S+/, ”) "the brown fox"
a.gsub(/\s\S+/, ”) "the"
block
a = "the quick brown fox"
a.sub(/^./) {|match| match.upcase } "The quick brown fox"
a.gsub(/[aeiou]/) {|vowel| vowel.upcase } "thE qUIck brOwn fOx"
def mixed_case(name)
name.gsub(/\b\w/) {|first| first.upcase }
end
mixed_case("fats waller") "Fats Waller"
mixed_case("louis armstrong") "Louis Armstrong"
mixed_case("strength in numbers") "Strength In Numbers"
Backslash Sequences in the Substitution
"fred:smith".sub(/(\w+):(\w+)/, ‘\2, \1′) "smith, fred"
"nercpyitno".gsub(/(.)(.)/, ‘\2\1′) "encryption"
\& (last match),
\+ (lastmatched group),
\` (string prior to match),
\’ (string after match),
\\ (a literal backslash)
str = ‘a\b\c’ "a\b\c"
str.gsub(/\\/, ‘\\\\\\\\’) "a\\b\\c"
or
str = ‘a\b\c’ "a\b\c"
str.gsub(/\\/, ‘\&\&’) "a\\b\\c"
or
str = ‘a\b\c’ "a\b\c"
str.gsub(/\\/) { ‘\\\\’ } "a\\b\\c"
example:
n modi?er(japanese)
def unescapeHTML(string)
str = string.dup
str.gsub!(/&(.*?);/n) {
match = $1.dup
case match
when /\Aamp\z/ni then ‘&’
when /\Aquot\z/ni then ‘"’
when /\Agt\z/ni then ‘>’
when /\Alt\z/ni then ‘<’
when /\A#(\d+)\z/n then Integer($1).chr
when /\A#x([0-9a-f]+)\z/ni then $1.hex.chr
end
}
str
end
puts unescapeHTML("1<2 && 4>3")
puts unescapeHTML(""A" = A = A")
produces:
1<2 && 4>3
"A" = A = A
Object-Oriented Regular Expressions
re = /(\d+):(\d+)/ # match a time hh:mm
md = re.match("Time: 12:34am")
→ MatchData
md.class
md[0] # == $& → "12:34"
md[1] # == $1 → "12"
md[2] # == $2 → "34"
md.pre_match # == $` → "Time: "
md.post_match # == $’ → "am"
re = /(\d+):(\d+)/ # match a time hh:mm
md1 = re.match("Time: 12:34am")
md2 = re.match("Time: 10:30pm")
md1[1, 2] → ["12", "34"]
md2[1, 2] → ["10", "30"]
re = /(\d+):(\d+)/
md1 = re.match("Time: 12:34am")
md2 = re.match("Time: 10:30pm")
[ $1, $2 ] # last successful match ["10", "30"]
$~ = md1
[ $1, $2 ] # previous successful match ["12", "34"]
Regex Characters List:
. any character except newline
[ ] any single character of set
[^ ] any single character NOT of set
* 0 or more previous regular expression
*? 0 or more previous regular expression(non greedy)
+ 1 or more previous regular expression
+? 1 or more previous regular expression(non greedy)
? 0 or 1 previous regular expression
| alternation
( ) grouping regular expressions
^ beginning of a line or string
$ end of a line or string
{m,n} at least m but most n previous regular expression
{m,n}? at least m but most n previous regular expression(non greedy)
\A beginning of a string
\b backspace(0×08)(inside[]only)
\b word boundary(outside[]only)
\B non-word boundary
\d digit, same as[0-9]
\D non-digit
\S non-whitespace character
\s whitespace character[ \t\n\r\f]
\W non-word character
\w word character[0-9A-Za-z_]
\z end of a string
\Z end of a string, or before newline at the end
(?# ) comment
(?: ) grouping without backreferences
(?= ) zero-width positive look-ahead assertion
(?! ) zero-width negative look-ahead assertion
(?ix-ix) turns on/off i/x options, localized in group if any.
(?ix-ix: ) turns on/off i/x options, localized in non-capturing group.
Special Character Classes:
[:alnum:] alpha-numeric characters
[:alpha:] alphabetic characters
[:blank:] whitespace – does not include tabs, carriage returns, etc
[:cntrl:] control characters
[:digit:] decimal digits
[:graph:] graph characters
[:lower:] lower case characters
[:print:] printable characters
[:punct:] punctuation characters
[:space:] whitespace, including tabs, carriage returns, etc
[:upper:] upper case characters
[:xdigit:] hexadecimal digits