Python Learning Day6 Re module

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A Brief introduction:
In essence, a regular expression (or RE) is a small, highly specialized programming language,
(in Python) it is embedded in Python and is implemented through the RE module. The regular expression pattern is
Compiled into a sequence of bytecode, which is then executed by a matching engine written in C.

Two
Character match (normal character, metacharacters):
Ordinary characters: Most characters and letters will match themselves
>>> Re.findall (' Alex ', ' Yuanalesxalexwupeiqi ')
[' Alex ']

2 meta characters:. ^ $ * +? { } [ ] | ( ) \

The first metacharacters we examine are "[" and "]". They are often used to specify a character category, the so-called character class
Don't be a character set that you want to match. Characters can be listed individually, or they can be separated by a "-" number of two given
Character to represent a range of characters. For example, [ABC] will match any word in "a", "B", or "C"
You can also use the interval [a-c] to represent the same character set, and the former effect is consistent. If you only want to match lowercase
Letters, then RE should be written as [A-z].
Metacharacters does not work in the category. For example, [akm$] will match the character "a", "K", "M", or "$" in
"$" is usually used as a meta-character, but in a character category, its properties are removed and restored to normal characters.
Character.

():
#!python
>>> p = re.compile (' (A (b) c) d ')
>>> m = p.match (' ABCD ')
>>> M.group (0)
' ABCD '
>>> M.group (1)
' ABC '
>>> M.group (2)
' B '

[]: metacharacters [] represent character classes, where only characters ^ 、-、] and \ have special meanings in a character class.
The character \ Still means escape, character-can define a range of characters, the character ^ is placed in front, indicating non.

+ match + number 1 times to unlimited
? 0 to 1 times before matching the number.
{m} matches the preceding content m times
{M,n} matches the preceding contents m to n times
*?,+?,??, {m,n}? In front of the *,+, and so on are greedy matches, that is, match as much as possible, after adding the number to make it an inert match

From the previous description can see ' * ', ' + ' and ' * ' are greedy, but this may not be what we say,
So, you can add a question mark later, change the strategy to non-greedy, just match as few re as possible. Example
Realize the difference between the two:
>>> Re.findall (R "A (\d+?)", "a23b") # Non-greedy mode
[' 2 ']
>>> Re.findall (R "A (\d+)", "a23b")
[' 23 ']

>>> Re.search (' < (. *) > ', ' <H1>title</H1> '). Group ()
' <H1>title</H1> '
Re.search (' < (. *?) > ', ' <H1>title</H1> '). Group ()
' <H1> '

Note comparing this situation:
>>> Re.findall (R "A (\d+) b", "a23b")
        [' 23 ']
>>> Re.findall (R "A (\d+?) B "," a23b ") #如果前后均有限定条件, the non-matching mode fails
        [']


\:
backslash followed by meta-character removal special function,
backslash followed by ordinary characters to achieve special functions.
The string that corresponds to the word group that references the ordinal number
Re.search (Alex) (Eric) com\2, "Alexericcomeric")

The

\d matches any decimal number; it is equivalent to class [0-9]. The
\d matches any non-numeric character; it is equivalent to class [^0-9].
\s matches any whitespace character; it is equivalent to class [\t\n\r\f\v]. The
\s matches any non-whitespace character; it is equivalent to the class [^ \t\n\r\f\v]. The
\w matches any alphanumeric character; it is equivalent to class [a-za-z0-9_].
\w matches any non-alphanumeric character; it is equivalent to a class [^a-za-z0-9_]
\b: matches a word boundary, which is the position between a word and a space.
    matches word boundaries (including start and end), where "words" are strings that consist of consecutive letters, numbers, and
    underscores. Note that the definition of \b is the junction of \w and \w,
    This is a 0-wide qualifier (Zero-width assertions) that matches only the word's first and final words. The
    Word is defined as a sequence of alphanumeric characters, so the ending is either blank or non-alphanumeric Fulai
   .
>>> Re.findall (r "abc\b", "Dzx &abc sdsadasabcasdsadasdabcasdsa")
[' abc ']
>>> Re.findall (R "\babc\b", "Dzx &abc sdsadasabcasdsadasdabcasdsa")
[' abc ']
>>> re.findall (r "\babc\b "," Dzx SABC SDSADASABCASDSADASDABCASDSA ")
[]

For example, ' er/b ' can match ' er ' in ' never ', but not ' er ' in ' verb '.
\b Only matches the position of the beginning end of the string and the space carriage return, and does not match the whitespace itself
such as "ABC Sdsadasabcasdsadasdabcasdsa",
\sabc\s cannot match, \babc\b can match to "ABC"
>>> Re.findall ("\babc\b", "ABC Sdsadasabcasdsadasdabcasdsa")
[]
>>> Re.findall (r "\babc\b", "ABC Sdsadasabcasdsadasdabcasdsa")
[' ABC ']
\b is used when you match the whole word. If it's not the whole word, it doesn't match. You want a horse?
With i words, you know, a lot of words have I, but I just want to match I, is "me", this time
Waiting for \bi\b
************************************************
Function:

1
Match:re.match (Pattern, string, flags=0)
Flags compile flags that modify how regular expressions are matched, such as: case-sensitive,
Multi-line matching and so on.
Re.match (' com ', ' Comwww.runcomoob '). Group ()

Re.match (' com ', ' Comwww.runcomoob ', re. I). Group ()

2
Search:re.search (Pattern, string, flags=0)
Re.search (' \dcom ', ' www.4comrunoob.5com '). Group ()

Attention:
Re.match (' com ', ' Comwww.runcomoob ')
Re.search (' \dcom ', ' www.4comrunoob.5com ')
Once the match succeeds, it is a match object object, and the match object object has the following methods:
Group () returns a string that is matched by RE
Start () returns the position where the match started
End () returns the position of the end of the match
Span () returns a tuple containing the position of the match (start, end)
Group () returns a string that matches the whole of the RE, and can enter multiple group numbers at a time, corresponding to the string matching the group number.
1. Group () returns the whole string of re-matches,
2. Group (N,M) returns a string that matches the group number n,m and returns the Indexerror exception if the group number does not exist
The 3.groups () groups () method returns a tuple containing all the group strings in the regular expression, from 1 to
The included group number, usually groups () does not require parameters, returns a tuple, and the tuple is a regular
A group defined in an expression.
Import re
A = "123abc456"
Re.search ("([0-9]*) ([a-z]*] ([0-9]*)", a). Group (0) #123abc456, return to the whole
Re.search ("([0-9]*) ([a-z]*) ([0-9]*)", a). Group (1) #123
Re.search ("([0-9]*) ([a-z]*) ([0-9]*)", a). Group (2) #abc
Re.search ("([0-9]*) ([a-z]*) ([0-9]*)", a). Group (3) #456

Group (1) lists the first bracket matching section, Group (2) lists the second bracket matching part, Group (3)
Lists the third bracket matching section.

-----------------------------------------------
3
FindAll
Re.findall returns all matching strings as a list
Re.findall can get all the matching strings in the string. Such as:

p = re.compile (R ' \d+ ')
Print P.findall (' One1two2three3four4 ')

Re.findall (R ' \w*oo\w* ', text); Gets all the words in the string that contain ' oo '.

Import re
Text = "Jgood is a handsome boy,he are handsome and cool,clever,and so on ...."
Print Re.findall (R ' \w*oo\w* ', text) #结果: [' jgood ', ' cool ']
#print Re.findall (R ' (\w) *oo (\w) * ', Text) # () indicates the result of the subexpression: [(' G ', ' d '), (' C ', ' l ')]

Finditer ():

>>> p = re.compile (R ' \d+ ')
>>> iterator = P.finditer (' drumm44ers drumming, 11 ... 10 ... ')
>>> for match in iterator:
Match.group (), Match.span ()

4
Sub SUBN:

Re.sub (Pattern, Repl, String, max=0)
Re.sub ("g.t", "have", ' I get A, I got B, I gut C ')

5
Split
p = re.compile (R ' \d+ ')
P.split (' One1two2three3four4 ')

Re.split (' \d+ ', ' one1two2three3four4 ')

6
Re.compile (strpattern[, flag]):
This method is the factory method of the pattern class, which compiles a regular expression in the form of a string to
The Pattern object. The second parameter, flag, is the matching pattern, and the value can use the bitwise OR operator ' | '
To take effect at the same time, such as re. I | Re. M
You can compile a regular expression into a regular expression object. You can put regular use of those regular
The expression is compiled into a regular expression object, which can improve some efficiency. The following is a regular expression
An example of an object:

Import re
Text = "Jgood is a handsome boy, he's cool, clever, and so on ..."
Regex = Re.compile (R ' \w*oo\w* ')
Print Regex.findall (text) #查找所有包含 ' oo ' word

Question

1 FindAll can I return a list of all matching groups instead of a list of priority capturing groups: Yes,
Import re

A = ' abc123abv23456 '
b = Re.findall (R ' 23°c (a)? ', a)
Print B
b = Re.findall (R ' (?: a)? ', a)
Print B

>>> re.findall ("www. ( Baidu|xinlang) \.com "," www.baidu.com ")
[' Baidu ']
>>> re.findall ("www. (?: Baidu|xinlang) \.com", "www.baidu.com")
[' www.baidu.com ']
>>> re.findall ("www. (?: Baidu|xinlang) \.com", "www.xinlang.com")
[' www.xinlang.com ']

FindAll If a grouping is used, the output will be the contents of the group rather than the result of the Find,
In order to get the result of find, add a question mark to enable "Do not capture mode", it is OK.

2 Re.findall (' \d* ', ' www33333 ')

3 Re.split ("[BC]", "ABCDE")

4 Source = "1-2 * ((60-30 + ( -9-2-5-2*3-5/3-40*4/2-3/5+6*3) * ( -9-2-5-2*5/3 + 7/3*99/4*2998 +10 * 568/14))-( -4*3)/ (16-3*2)) "

Re.search (' \ ([^ ()]*\) ', source). Group () regular= ' \d+\.? \d* ([*/]|\*\*) [\-]?\d+\.? \d* '
Re.search (' \d+\.? \d* ([*/]|\*\*) [\-]?\d+\.? \d* ', string). Group ()

Add_regular= ' [\-]?\d+\.? \d*\+[\-]?\d+\.? \d* '
Sub_regular= ' [\-]?\d+\.? \d*\-[\-]?\d+\.? \d* '
Re.findall (Sub_regular, "(3+4-5+7+9)")

4 Detect an IP address:
Re.search (R "([01]?\d?\d|2[0-4]\d|25[0-5]) \.) {3} ([01]?\d?\d|2[0-4]\d|25[0-5]\.) "," 192.168.1.1 ")

-----------------------------------------------------------

Re. I make the match case insensitive
Re. L do localization identification (locale-aware) matching
Re. M multiple lines match, affect ^ and $
Re. S make. Match all characters, including line breaks
>>> Re.findall (".", "Abc\nde")
>>> Re.findall (".", "Abc\nde", re. S
Re. U resolves characters based on the Unicode character set. This sign affects \w, \w, \b, \b.
Re. X This flag is given by giving you a more flexible format so that you can write regular expressions more easily.

Re. S:. Will match line breaks, default. Commas do not match line breaks
>>> Re.findall (R "A (\d+) b.+a (\d+) b", "a23b\na34b")
[]
>>> Re.findall (R "A (\d+) b.+a (\d+) b", "a23b\na34b", re. S
[(' 23 ', ' 34 ')]
>>>
Re. The m:^$ flag will match each row, and the default ^ will only match the first line that matches the regular, and the default $ will only match the last line that matches the regular
>>> Re.findall (r "^a (\d+) b", "a23b\na34b")
[' 23 ']
>>> Re.findall (r "^a (\d+) b", "a23b\na34b", re. M
[' 23 ', ' 34 ']
However, if there is no ^ flag,
>>> Re.findall (R "A (\d+) b", "a23b\na34b")
[' 23 ', ' 43 ']
Visible, is without re. M

Import re

n= "Drummers drumming,
One by one pipers piping, ten Lords a-leaping "

P=re.compile (' ^\d+ ')
P_multi=re.compile (' ^\d+ ', re. MULTILINE) #设置 MULTILINE flag
Print Re.findall (p,n) #[' 12 ']
Print Re.findall (p_multi,n) # [' 12 ', ' 11 ']
============================
Import re
A = ' a23b '
Print Re.findall (' A (\d+?) ', a) #[' 2 ']
Print Re.findall (' A (\d+) ', a) #[' 23 ']
Print Re.findall (R ' A (\d+) b ', a) #[' 23 ']
Print Re.findall (R ' A (\d+?) B ', a) # [' 23 ']
============================
b= ' a23b\na34b '
‘‘‘ . Match any character that is not a line break '

Re.findall (R ' A (\d+) b.+a (\d+) b ', b) #[]

Re.findall (R ' A (\d+) b ', B,re. M) # [' 23 ', ' 34 ']

Re.findall (R ' ^a (\d+) b ', B,re. M) # [' 23 ', ' 34 ']

Re.findall (R ' A (\d+) b ', b) #[' 23 ', ' 34 '] can match multiple lines

Re.findall (R ' ^a (\d+) b ', b) # [' 23 '] The default ^ will only match the first line that matches the regular

Re.findall (R ' A (\d+) b$ ', b) # [' 34 '] default $ will only match the last line that matches the regular

Re.findall (R ' A (\d+) b ', B,re. M) #[' 23 ', ' 34 ']

Re.findall (R ' A (\d+) b.? ', B,re. M) # [' 23 ', ' 34 ']

Re.findall (R "A (\d+) b", "a23b\na34b") # [' 23 ', ' 34 ']
---------------------------------------------------------------

Recommendation: http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html

About rawstring and \:

\ n is a newline, ASCLL code is 10.
\ r is the carriage return, ASCLL code is 13

Re.findall ("\", "Abc\de")

F=open ("C:\abc.txt")
\a is the escape character 007, the bell character BEL.
F=open (r "D:\abc.txt")
>>>>>>python themselves also need to be escaped, also through \ escaped

>>> Re.findall (r "\d", "WW2EE")
[' 2 ']
>>> Re.findall ("\d", "WW2EE")
[' 2 ']

>> strongly recommends that you use raw strings to express regular

You may have seen some of the previous examples of primitive string usage. The original string is generated precisely because there is a regular table
The existence of a type of Darcy. The reason is that the ASCII character wildcards regular the conflicts that arise between the special characters of the expression. For example, the special symbol "\b" is
The ASCII character represents the backspace key, but at the same time "\b" is also a special symbol of a regular expression, which means "match a word boundary".
In order for the re compiler to take two characters "\b" as the string you want to express, instead of a backspace key, you need to use another
The backslash escapes it, which can be written like this: "\\b".
But this complicates the problem, especially if you have a lot of special characters in your regular expression string.
Easily confusing. The original string is the level of complexity used to simplify regular expressions.
In fact, many Python programmers use only the original string when defining a regular expression.
The following example shows the difference between the backspace key "\b" and the regular expression "\b" (contains or does not contain the original string):
>>> m = re.search (' \bblow ', ' Blow ') # backspace, no match #退格键, no match

>>> re.search (' \\bblow ', ' I Blow '). Group () # escaped \, now it works #用 \ escaped
With a

>>> Re.search (R ' \bblow ', ' I Blow '). Group () # Use raw string instead #改用原始字符串

You may notice that we use "\d" in regular expressions, that there is no original string, and that there is no problem. That's because
There is no corresponding special character in ASCII, so the regular expression compiler knows you mean a decimal number.

Python Learning Day6 Re module

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More