Regular expressions are objects of type Regexp.
create
a = Regexp.new(‘^s*[a-z]‘) /^s*[a-z]/
b = /^s*[a-z]/ /^s*[a-z]/
c = %r{^s*[a-z]} /^s*[a-z]/
options:
/i case insensitive
/o only interpolate #{} blocks once
/m multiline mode – ‘.’ will match newline
/x extended mode – whitespace is ignored
/[neus] encoding: none, EUC, UTF-8, SJIS, respectively
e.g. b = /^s*[a-z]/i
match operators
=~ (positive match)
!~ (negative match)
name = "Fats Waller"
name =~ /a/ →1
name =~ /z/ → nil
/a/ =~ name →1
return the character position at which the match occurred.
$& receives the part of the string that was matched by the pattern
$` receives the part of the string thatpreceded the match,
$’ receives the string after the match.
The match also sets the thread-global variables $~ and $1 through $9.
$~ is a MatchData object
To illustrate how matching works, define a method:
def show_regexp(a, re)
if a =~ re
"#{$`}<<#{$&}>>#{$’}"
else
"no match"
end
end
show_regexp(‘very interesting’, /t/) very in<<t>>eresting
show_regexp(‘Fats Waller’, /a/) F<<a>>ts Waller
show_regexp(‘Fats Waller’, /ll/) Fats Wa<<ll>>er
show_regexp(‘Fats Waller’, /z/) no match
Patterns
all characters except ., |, (, ), [, ], {, }, +, , ^, $, *, and ? match themselves.
use ‘’ to match these characters.
regular expression may contain #{…} expression substitutions.
Anchors
By default, a regular expression will try to ?nd the ?rst match for the pattern in a string.
The patterns ^ and $ match the beginning and end of a line
A matches the beginning of a string,
z and Z match the end of a string. (Actually, Z matches the end of a string unless the string ends with a
, it which case it matches just before the
.)
show_regexp("this isn the time", /^the/) this isn<<the>> time
show_regexp("this isn the time", /is$/) this <<is>>n the time
show_regexp("this isn the time", /Athis/) <<this>> is n the time
show_regexp("this isn the time", /Athe/) no match
b and B match word boundaries and nonword boundaries
Word characters are letters, numbers, and underscores.
show_regexp("this isn the time", /bis/) this <<is>>n the time
show_regexp("this isn the time", /Bis/) th<<is>> isn the time
Character Classes
[aeiou] will match a vowel
[,.:;!?] matches punctuation
show_regexp(‘Price $12.’, /[aeiou]/) Pr<<i>>ce $12.
show_regexp(‘Price $12.’, /[s]/) Price<< >>$12.
show_regexp(‘Price $12.’, /[[:digit:]]/) Price $<<1>>2.
show_regexp(‘Price $12.’, /[[:space:]]/) Price<< >>$12.
show_regexp(‘Price $12.’, /[[:punct:]aeiou]/) Pr<<i>>ce $12.
POSIX Character Classes
Alphanumeric[:alnum:]
Uppercase or lowercase letter[:alpha:]
Blank and tab[:blank:]
Control characters (at least 0×00–0x1f, 0x7f)[:cntrl:]
Digit[:digit:]
Printable character excluding space[:graph:]
Lowercase letter[:lower:]
Any printable character (including space)[:print:]
Printable character excluding space and alphanumeric[:punct:]
Whitespace (same as s)[:space:]
Uppercase letter[:upper:]
Hex digit (0–9, a–f, A–F)[:xdigit:]
sequence c1 -c2 represents all the characters between c1 and c2
a = ‘see [Design Patterns-page 123]‘
show_regexp(a, /[A-F]/) see [<<D>>esign Patterns-page 123]
show_regexp(a, /[A-Fa-f]/) s<<e>>e [Design Patterns-page 123]
show_regexp(a, /[0-9]/) see [Design Patterns-page <<1>>23]
show_regexp(a, /[0-9][0-9]/) see [Design Patterns-page <<12>>3]
If you want to include the literal characters ] and – within a character class, they must appear at the start.
Put a ^ immediately after the opening bracket to negate a character class
a = ‘see [Design Patterns-page 123]‘
show_regexp(a, /[]]/) → see [Design Patterns-page 123<<]>>
show_regexp(a, /[-]/) → see [Design Patterns<<->>page 123]
show_regexp(a, /[^a-z]/) → see<< >>[Design Patterns-page 123]
show_regexp(a, /[^a-zs]/) → see <<[>>Design Patterns-page 123]
Table 5.1. Character class abbreviations
Sequence As [ . . . ] Meaning
[0-9] Digit character d
[^0-9] Any character except a digit D
[strnf] Whitespace character s
[^strnf] Any character except whitespace S
[A-Za-z0-9_] Word character w
[^A-Za-z0-9_] Any character except a word character W
show_regexp(‘It costs $12.’, /s/) It<< >>costs $12.
show_regexp(‘It costs $12.’, /d/) It costs $<<1>>2.
a period ( . ) appearing outside brackets represents any character except a newline
a = ‘It costs $12.’
show_regexp(a, /c.s/) It <<cos>>ts $12.
show_regexp(a, /./) <<I>>t costs $12.
show_regexp(a, /./) It costs $12<<.>>
Repetition * ? {m,n}
matches zero or more occurrences of r. r*
matches one or more occurrences of r. r+
matches zero or one occurrence of r. r?
matches at least “m” and at most “n” occurrences of r. r{m,n}
matches at least “m” occurrences of r. r{m,}
matches exactly “m” occurrences of r. r{m}
matches zero or more occurrences of previous regular expression(non greedy) *?
matches one or more occurrences of previous regular expression(non greedy) +?
a = "The moon is made of cheese"
show_regexp(a, /w+/) <<The>> moon is made of cheese
show_regexp(a, /s.*s/) The<< moon is made of >>cheese
show_regexp(a, /s.*?s/) The<< moon >>is made of cheese
show_regexp(a, /[aeiou]{2,99}/) The m<<oo>>n is made of cheese
show_regexp(a, /mo?o/) The <<moo>>n is made of cheese
Alternation |
a = "red ball blue sky"
show_regexp(a, /d|e/) r<<e>>d ball blue sky
show_regexp(a, /al|lu/) red b<<al>>l blue sky
show_regexp(a, /red ball|angry sky/) <<red ball>> blue sky
Grouping ()
Everything within the group is treated as a single regular expression.
show_regexp(‘banana’, /an*/) b<<an>>ana
show_regexp(‘banana’, /(an)*/) <<>>banana
show_regexp(‘banana’, /(an)+/) b<<anan>>a
a = ‘red ball blue sky’
show_regexp(a, /blue|red/) <<red>> ball blue sky
show_regexp(a, /(blue|red) w+/) <<red ball>> blue sky
show_regexp(a, /(red|blue) w+/) <<red ball>> blue sky
show_regexp(a, /red|blue w+/) <<red>> ball blue sky
show_regexp(a, /red (ball|angry) sky/) no match
a = ‘the red angry sky’
show_regexp(a, /red (ball|angry) sky/) the <<red angry sky>>
within the pattern, the sequence 1 refers to the match of the ?rst group, 2 the second group, and so on. Outside the pattern, the special variables $1, $2, and so on, serve the same purpose.
"12:50am" =~ /(dd):(dd)(..)/ 0
"Hour is #$1, minute #$2" "Hour is 12, minute 50"
"12:50am" =~ /((dd):(dd))(..)/ 0
"Time is #$1" "Time is 12:50"
"Hour is #$2, minute #$3" "Hour is 12, minute 50"
"AM/PM is #$4" "AM/PM is am"
look for various forms of repetition.
# match duplicated letter
show_regexp(‘He said "Hello"’, /(w)1/) He said "He<<ll>>o"
# match duplicated substrings
show_regexp(‘Mississippi’, /(w+)1/) M<<ississ>>ippi
match delimiters
show_regexp(‘He said "Hello"’, /(["']).*?1/) He said <<"Hello">>
show_regexp("He said ‘Hello’", /(["']).*?1/) He said <<’Hello’>>
Pattern-Based Substitution
String#sub performs one replacement
String#gsub replaces every occurrence of the match
a = "the quick brown fox"
a.sub(/[aeiou]/, ‘*’) "th* quick brown fox"
a.gsub(/[aeiou]/, ‘*’) "th* q**ck br*wn f*x"
a.sub(/sS+/, ”) "the brown fox"
a.gsub(/sS+/, ”) "the"
block
a = "the quick brown fox"
a.sub(/^./) {|match| match.upcase } "The quick brown fox"
a.gsub(/[aeiou]/) {|vowel| vowel.upcase } "thE qUIck brOwn fOx"
def mixed_case(name)
name.gsub(/bw/) {|first| first.upcase }
end
mixed_case("fats waller") "Fats Waller"
mixed_case("louis armstrong") "Louis Armstrong"
mixed_case("strength in numbers") "Strength In Numbers"
Backslash Sequences in the Substitution
"fred:smith".sub(/(w+):(w+)/, ‘2, 1′) "smith, fred"
"nercpyitno".gsub(/(.)(.)/, ‘21′) "encryption"
& (last match),
+ (lastmatched group),
` (string prior to match),
’ (string after match),
\ (a literal backslash)
str = ‘abc’ "abc"
str.gsub(/\/, ‘\\\\’) "a\b\c"
or
str = ‘abc’ "abc"
str.gsub(/\/, ‘&&’) "a\b\c"
or
str = ‘abc’ "abc"
str.gsub(/\/) { ‘\\’ } "a\b\c"
example:
n modi?er(japanese)
def unescapeHTML(string)
str = string.dup
str.gsub!(/&(.*?);/n) {
match = $1.dup
case match
when /Aampz/ni then ‘&’
when /Aquotz/ni then ‘"’
when /Agtz/ni then ‘>’
when /Altz/ni then ‘<’
when /A#(d+)z/n then Integer($1).chr
when /A#x([0-9a-f]+)z/ni then $1.hex.chr
end
}
str
end
puts unescapeHTML("1<2 && 4>3")
puts unescapeHTML(""A" = A = A")
produces:
1<2 && 4>3
"A" = A = A
Object-Oriented Regular Expressions
re = /(d+):(d+)/ # match a time hh:mm
md = re.match("Time: 12:34am")
→ MatchData
md.class
md[0] # == $& → "12:34"
md[1] # == $1 → "12"
md[2] # == $2 → "34"
md.pre_match # == $` → "Time: "
md.post_match # == $’ → "am"
re = /(d+):(d+)/ # match a time hh:mm
md1 = re.match("Time: 12:34am")
md2 = re.match("Time: 10:30pm")
md1[1, 2] → ["12", "34"]
md2[1, 2] → ["10", "30"]
re = /(d+):(d+)/
md1 = re.match("Time: 12:34am")
md2 = re.match("Time: 10:30pm")
[ $1, $2 ] # last successful match ["10", "30"]
$~ = md1
[ $1, $2 ] # previous successful match ["12", "34"]
Regex Characters List:
. any character except newline
[ ] any single character of set
[^ ] any single character NOT of set
* 0 or more previous regular expression
*? 0 or more previous regular expression(non greedy)
+ 1 or more previous regular expression
+? 1 or more previous regular expression(non greedy)
? 0 or 1 previous regular expression
| alternation
( ) grouping regular expressions
^ beginning of a line or string
$ end of a line or string
{m,n} at least m but most n previous regular expression
{m,n}? at least m but most n previous regular expression(non greedy)
A beginning of a string
b backspace(0×08)(inside[]only)
b word boundary(outside[]only)
B non-word boundary
d digit, same as[0-9]
D non-digit
S non-whitespace character
s whitespace character[ tnrf]
W non-word character
w word character[0-9A-Za-z_]
z end of a string
Z end of a string, or before newline at the end
(?# ) comment
(?: ) grouping without backreferences
(?= ) zero-width positive look-ahead assertion
(?! ) zero-width negative look-ahead assertion
(?ix-ix) turns on/off i/x options, localized in group if any.
(?ix-ix: ) turns on/off i/x options, localized in non-capturing group.
Special Character Classes:
[:alnum:] alpha-numeric characters
[:alpha:] alphabetic characters
[:blank:] whitespace – does not include tabs, carriage returns, etc
[:cntrl:] control characters
[:digit:] decimal digits
[:graph:] graph characters
[:lower:] lower case characters
[:print:] printable characters
[:punct:] punctuation characters
[:space:] whitespace, including tabs, carriage returns, etc
[:upper:] upper case characters
[:xdigit:] hexadecimal digits
转载请注明: 转自船长日志, 本文链接地址: http://www.cslog.cn/Content/ruby_regexp/