Python-based----regular expressions and re modules

Source: Internet
Author: User

Regular Expressions

In essence, a regular expression (or re) is a small, highly specialized programming language (in Python) that is embedded in Python and implemented through the RE module. The regular expression pattern is compiled into a sequence of bytecode, which is then executed by a matching engine written in C.

Character match (normal character, metacharacters):

1 Normal characters (exact match): Most characters and letters will match themselves

1 >>> import re2 >>> res= ' Hello World Good Morning ' 3 >>> re.findall (' Hello ', res) 4 [' Hello ']

2-dollar character (Fuzzy match):. ^ $ * + ? { } [ ] | ( ) \

Metacharacters

. Wildcards, which match any character except newline characters, and do not skip line breaks if there are line breaks in the middle, just can't match

1 >>> import re 2 >>> str1= "hello\n, Zhang San" 3 #re. FindAll () method, the first element is a matching rule, the second element is a string, and a list of all occurrences is returned. The method under RE details 4 >>> re.findall ("L", str1)    #遍历匹配l, if matched to, the output is an element of the list 5 [' L ', ' l '] 6 >>> re.findall ("H . ", str1) #匹配h开头, followed by any character except the newline character two characters 7 [' he '] 8 >>> re.findall (" Hello. ", str1) #因为hello后的字符是换行符, so matches less than 9 []10 > >> Re.findall ("Zhang.", str1) #包含的字符串在哪个位置匹配都可以11 [' Zhang San ']

^ Start match, match only the beginning of the string, and if not, the match will not be followed

1 >>> re.findall ("^hello", str1) #开头有hello所以匹配成功2 [' Hello ']3 >>> re.findall ("^llo", str1) #开头没有llo, matching Failure 4 []

$ end match, match end only

1 >>> re.findall ("llo$", str1) #该字符串以三结尾2 []3 >>> Re.findall ("Three $", str1) 4 [' Three ']5 >>> re.findal L ("Zhang San $", str1) #可匹配多个字符6 [' Zhang San ']7 >>> re.findall ("Zhang. $", str1) #不同的元字符联用8 [' Zhang San ']

repeating function meta characters

? Indicates that the previous character or group is optional, ranging from 0 or 1 times

1 >>> re.findall ("Ho?el", str1) #o可有可无, matching hel2 [' hel ']3 >>> re.findall ("He?l", str1) #e可有可无, sometimes matching hel4 [' Hel ']5 >>> re.findall ("He?el", str1) #第一个e可有可无, no time matching hel6 [' Hel ']

* Indicates the repeating range of the preceding character or group, ranging from 0 to positive infinity, [0,+∞]

1 >>> str2= "111111133111188211111111111134234" 2 >>> re.findall ("1*", str2) #1 * Indicates that character 1 appears 0 or more times, From string to match, if not 1 with empty complement 3 [' 1111111 ', ' ', ' ', ' 1111 ', ' ', ' ', ' ', ' ' 111111111111 ', ' ', ' ', ' ', ' ', ' ', ' ']4 >>> re.find All ("11*", str2) #11 * indicates that the second character 1 appears 0 or more times, the first character 1 is fixed 5 [' 1111111 ', ' 1111 ', ' 111111111111 ']

+ Similar to *, indicates a repeating range of the preceding character or group, ranging from 1 to positive infinity, [1,+∞]

1 >>> re.findall ("11+", str2) #11 + indicates that the second 1 must be present 1 times or more than 2 [' 1111111 ', ' 1111 ', ' 111111111111 ']3 >>> Re.findall ("1+", str2) #1 + means must appear once 1 and above 4 [' 1111111 ', ' 1111 ', ' 111111111111 ']

{} {n,m}, specifying a range, not specifying m from N to Infinity, only one n is only so many times

1 >>> str2= "111111133111188211111111111134234" 2 >>> re.findall ("1{4}", str2) #按照字符1重复4次匹配, total matching succeeded 5 times 3 [' 1111 ', ' 1111 ', ' 1111 ', ' 1111 ', ' 1111 ']4 >>> re.findall ("3{2}", str2) #匹配出现2次3的字符串5 [' $ ']6 >>> Re.fin Dall ("3{1,3}", str2) #匹配出现1到3次3的情况7 [' 3 ', ' 3 ']8 >>> re.findall ("1{1,}", str2) #匹配出现1至少一次的情况 9 [' 1111111 ', ' 1111 ', ' 111111111111 ']

Escape character

\ backslash followed by meta-character removal special function, backslash followed by ordinary characters to achieve special functions

\d  matches any decimal number;      it is equivalent to class [0-9]. \d  matches any non-numeric character;    it is equivalent to class [^0-9]. \s  matches any whitespace character;      it is equivalent to class [\t\n\r\f\v]. \s  matches any non-whitespace character;    it is equivalent to class [^ \t\n\r\f\v]. \w  matches any alphanumeric kanji character;   It is equivalent to class [a-za-z0-9_]. \w  matches any non-alphanumeric Chinese character; it is equivalent to a class [^a-za-z0-9_]\b  matches a special character boundary, such as a space, &,#, etc.

String: Contains special characters, numbers, uppercase and lowercase letters, and kanji

1 >>> str3= ' 123456  [email protected]# $abcd  efg!%&*hijklmn Chen '

\d Matching decimal numbers

1 >>> re.findall ("\d", STR3) 2 [' 1 ', ' 2 ', ' 3 ', ' 4 ', ' 5 ', ' 6 ', ' 7 ', ' 8 ', ' 9 ', ' 0 ']

\d matches non-numeric characters

1 >>> re.findall ("\d", STR3) 2 [', ' ', ' @ ', ' # ', ' $ ', ' a ', ' B ', ' C ', ' D ', ', ', ' e ', ' f ', ' g ', '! ', '% ', ' &A MP; ', ' * ', ' H ', ' I ', ' J ', ' K ', ' L ', ' M ', ' N ', ' Chen ']

\s matches any whitespace character

>>> Re.findall ("\s", STR3) [', ' ', ', ']

\s matches any non-whitespace character

1 >>> re.findall ("\s", STR3) 2 [' 1 ', ' 2 ', ' 3 ', ' 4 ', ' 5 ', ' 6 ', ' 7 ', ' 8 ', ' 9 ', ' 0 ', ' @ ', ' # ', ' $ ', ' a ', ' B ', ' C ', ' d ' , ' e ', ' f ', ' g ', '! ', '% ', ' & ', ' * ', ' H ', ' I ', ' J ', ' K ', ' L ', ' M ', ' N ', ' Chen ']

\w matches any alphanumeric kanji character

1 >>> re.findall ("\w", STR3) 2 [' 1 ', ' 2 ', ' 3 ', ' 4 ', ' 5 ', ' 6 ', ' 7 ', ' 8 ', ' 9 ', ' 0 ', ' a ', ' B ', ' C ', ' d ', ' e ', ' f ', ' G ' , ' H ', ' I ', ' J ', ' K ', ' L ', ' M ', ' N ', ' Chen ']

\w matches any non-alphanumeric kanji character

1 >>> re.findall ("\w", STR3) 2 [', ' ', ' @ ', ' # ', ' $ ', ', ', ', ', '% ', ' & ', ' * ']

\b matches a particular character boundary, note: Because the \b and the PY interpreter have conflicts, it is necessary to add the R representation to the re module to interpret it as a source word character.

1 >>> re.findall (r "abcd\b", STR3) 2 [' ABCD ']3 >>> re.findall (r "123456\b", STR3) 4 [' 123456 ']

Source characters R Example 1: Match ' c\l ' in "Abc\le", \l no special meaning

1 >>> import re 2 >>> str4= "Abc\le" 3 >>> re.findall (' c\l ', ' abc\le ')    #报错, the PY interpreter interprets the escape character once, A \ Cannot explain 4 >>> re.findall (' c\\l ', ' abc\le ')    #报错, the PY interpreter interprets two \ as a \ passed to re module, re module
5 >>> re.findall (' c\\\l ', ' abc\le ') #py解释器将三个 \ interpreted as \ \ Pass to the RE module, the re module explains \ Escape match c\l 6 [' c\\l '] #因为匹配出来的c \l \ is a special character, So the PY interpreter will \ Escape once output 7 >>> re.findall (' c\\\\l ', ' abc\le ') #py解释器将第一个和第二个 \ interpreted as a \, the third fourth \ interpreted into a \, a total of two \ passed to the RE Module 8 [' C\\l '] 9 >>> re.findall (R ' c\\l ', ' abc\le ') #py解释器通过r标识, pass two \ interpreted as the source word character directly to the RE module [' c\\l ']

Source characters R Example 2: Match ' c\b ' in ' abc\be ', \b has special meaning, is the backspace of the ASCII table

1 >>> str5= "Abc\be" 2 >>> re.findall (' c\b ', ' Abc\be ') 3 [' c\x08 '] 4 >>> re.findall (' c\\b ', ' ABC \be ') 5 [' C '] 6 >>> re.findall (' c\\\b ', ' Abc\be ') 7 [' c\x08 '] 8 >>> re.findall (' c\\\\b ', ' Abc\be ') 9 []10 >>> Re.findall (R ' c\b ', ' abc\be ') one by one [' C ']12 >>> re.findall (R ' c\\b ', ' abc\be ') []14 >>> Re.findall (R ' c\\\b ', ' abc\be ') []16 >>> Re.findall (R ' c\\\\b ', ' abc\be ') []18 >>> Re.findall (R ' c\ B ', R ' Abc\be ') [' C ']20 >>> re.findall (R ' c\\b ', R ' Abc\be ') #当字符串内有特殊含义的字符时候, need to add R escape for the source characters [' c\\b ']22 > >> Re.findall (R ' c\\\b ', R ' Abc\be ') [' c\\ ']24 >>> re.findall (R ' c\\\\b ', R ' Abc\be ') 25 []

() divided into ordinary groups and named groups of two, and the meaning of the matching rules of the string is grouped into a whole, the whole assignment rules match the string

1 >>> str6= "FAEFHUKNGHELLOHELLOHELLOAFEAHELLOADF" 2 >>> re.findall ("(hello) +", STR6) # Add hello to a whole to make + rule 3 [' Hello ', ' hello '] #因为优先级的限制, only display the contents within the Group 4 >>> re.findall ("(?: Hello) +", STR6) 5 [' Hellohellohello ', ' Hello '] #?: is a format that cancels the precedence limit and will match to all the displayed

Grouping general mates Re.search () and Re.match () method calls

The Re.search () format is the same as FindAll, but it returns an object that returns a specific value by calling the object's group method, Re.search () only matches the value once, and the match will no longer match backwards, i.e. there is only one result.

1 >>> re.search ("(hello) +", STR6) 2 <_sre. Sre_match object; Span= (9), match= ' Hellohellohello ' >3 >>> ret=re.search ("(hello) +", STR6) 4 >>> ret.group () 5 ' Hellohellohello '

The Re.match () method resembles the function of the meta-character ^, matches the beginning of a string, returns an object, and can also return a specific value through the Group method

1 >>> str7= "HELLOHELLOHELLONAMEAFEAHELLOADF" 2 >>> ret=re.match ("(hello) +", STR7) #开头有hello能匹配到3 >>> Ret.group () 4 ' Hellohellohello ' 5 >>> ret=re.match ("(name) +", STR7) #开头没有name, matching less than 6 >>> Ret.group () 7 Traceback (most recent call last): 8   File "<stdin>", line 1, in <module>9 attributeerror: ' Non EType ' object has no attribute ' group '

Named grouping: Adds a name to a group, and returns a specific value by calling the grouping name from the group method

1 >>> str8= "-blog-aticles-2015-04" 2 >>> ret=re.search (r "-blog-aticles-(? p<year>\d+)-(? p<month>\d+) ", Str8) #? P is the definition of the named format within,<> is the name 3 >>> ret.group (' year ') 4 ' 5 >>> ret.group (' month ') 6 ' 04 '

[] Character set, the character set in brackets, the relationship is or, that is, match any one of the parentheses within the character set,-^\ three characters have special meaning, the other characters lose their original special meaning.

1 >>> str9= "ADF13415AGGAE8657DFC" 2 >>> re.findall ("a[dg]+", STR9) #匹配a开头, after extra a D or G end of section 3 [' Ad ', ' Agg ']

Within the character set-represents a range

1 >>> re.findall ("[0-9]+", STR9) #包含0-9 digit character, match + rule 2 [' 13415 ', ' 8657 ']3 >>> re.findall ("[a-z]+", STR9) 4 [' ADF ', ' aggae ', ' DFC ']

The ^ in the character set indicates the inverse

1 >>> re.findall ("[^a-z]+", STR9) #不是包含a-Z letter characters, match + rule 2 [' 13415 ', ' 8657 ']3 >>> re.findall ("[^0-9]+", STR9) 4 [' ADF ', ' aggae ', ' DFC ']

\ means escape in character set

1 >>> re.findall ("[\d]", STR9) 2 [' 1 ', ' 3 ', ' 4 ', ' 1 ', ' 5 ', ' 8 ', ' 6 ', ' 5 ', ' 7 ']3 >>> re.findall ("[\w]", STR9 ) 4 [' A ', ' d ', ' f ', ' 1 ', ' 3 ', ' 4 ', ' 1 ', ' 5 ', ' A ', ' g ', ' g ', ' a ', ' e ', ' 8 ', ' 6 ', ' 5 ', ' 7 ', ' d ', ' F ', ' C '

| Pipe symbol, indicating or

1 >>> str10= "www.oldboy.com;www.oldboy.cn;www.baidu.com;" 2 >>> re.findall ("www\." (?: \ w+) \. (?: COM|CN) ", str10) 3 [' www.oldboy.com ', ' www.oldboy.cn ', ' www.baidu.com ']

Greedy and non-greedy matches (based on duplicates)

Greedy match: match with longest result

1 >>> str11= "DASA11S6666DABCCCCASD" 2 >>> re.findall ("abc+", Str11) #+ represents the range from 1 to positive infinity, so how many C can match to 3 [' ABCCCC ']4 >>> re.findall ("\d+", Str11) 5 [' 11 ', ' 6666 ']

Non-greedy match: matches the shortest result

1 >>> re.findall ("\d+", Str11) #在 + back add? Represents taking the minimum range of [' 1 ', ' 1 ', ' 6 ', ' 6 ', ' 6 ', ' 6 ']3 >>> re.findall ("abc+", Str11) 4 [' ABC ']

Examples of non-greedy applications:

1 >>> str12= "<div>yuan</div><a href=" "></div>" 2 >>> Re.findall ("<div>.*?</div>", Str12) #匹配到第一个 </div>3 [' <div>yuan</div> ']

Non-greedy matching rules:

1 *? Repeat any number of times, but repeat as little as 2 +? Repeat 1 or more times, but repeat 3 as little as possible?? Repeat 0 or 1 times, but repeat as little as 4 {n,m}? Repeat N to M times, but repeat as little as 5 {n,}? Repeat more than n times, but repeat as little as possible

. * Usage:

1. is any character 2 * is taking 0 to an infinite length of 3? Non-greedy mode. 4 together is to take as few as possible any character, generally not so alone, he mostly used in: ". *?a", is to take the front of any length of characters, to the last a appeared

Re Module method

Re.findall () method: matches all, returns the match to all the results in a list

Re.search () method: Returns an object that invokes a specific value through the group method, as shown in the Grouping section, matching to the first one no longer continues to match, no match to return a none

Re.match () method: matches only at the beginning of the string, returns an object, invokes the specific value through the group method, and the example is shown in the grouping section

Re.split () method: Split, return in a list, middle with regular fuzzy matching, you can limit the number of splits

1 >>> str13= "Hello23world12my7name" 2 >>> re.split ("\d+", Str13) #以数字为分割线, return other values 3 [' Hello ', ' world ', ' My ', ' name ']4 >>> re.split ("(\d+)", Str13) #同时返回分割线5 [' Hello ', ' + ', ' world ', ' I ', ' 7 ', ' Name ']6 >>& Gt Re.split ("\d+", str13,1) #分割一次7 [' Hello ', ' world12my7name ']

Special case, split out empty

1 >>> re.split ("L", "Hello Bob") 2 [' He ', ' ', ' O bob ']

re.sub () method: Replace, Format: rule-replace Content-string-count limit number

1 >>> str14= "Hello 123 123 name World my Name" 2 >>> re.sub ("\d+", "A", str14) 3 ' Hello a A-name world my n Ame ' 4 >>> re.sub ("\d", "a", str14) 5 ' Hello AAA AAA name World My Name ' 6 >>> re.sub ("\d", "a", str14,3) 7 ' H Ello AAA 123 name World my Name '

re.subn (): similar to sub, but returned in tuples, and returns the number of replacements

1 >>> re.subn ("\d", "A", str14,3) 2 (' Hello AAA 123 name World My Name ', 3)

re.finditer (): Returns an iterator that is also called to the group method

1 >>> ret=re.finditer ("\d+", "Dasfjk324khk4234kj234hkj234hkj234kj234k2j34hk2j3h4") 2 >>> print (ret ) 3 <callable_iterator object at 0x000001ce69609470>4 >>> print (Next (ret)) 5 <_sre. Sre_match object; Span= (6, 9), match= ' 324 ' >6 >>> print (Next (ret). Group ()) #因为已经迭代了一次, so return the second value 7 4234

re.compile (): compile rule, call to compile object, frequently call the same rule when using

1 >>> num_rule=re.compile (' \d+ ') 2 >>> print (Num_rule,type (num_rule)) 3 re.compile (' \\d+ ') <class ' _sre. Sre_pattern ' >4 >>> num_rule.findall ("Hello 123 3241") 5 [' 123 ', ' 3241 ']

Python-based----regular expressions and re modules

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.