Day5 module Learning -- re regular module, day5 module -- re

Source: Internet
Author: User

Day5 module Learning -- re regular module, day5 module -- re

1. Regular Expression Basics

1.1. Brief Introduction

Regular expressions are not part of Python. Regular Expressions are powerful tools used to process strings. They have their own unique syntax and an independent processing engine, which may not be as efficient as the built-in str method, but are very powerful. Thanks to this, in languages that provide regular expressions, the syntax of regular expressions is the same. The difference is that different programming languages support different syntaxes, unsupported syntax is usually not commonly used. If you have already used regular expressions in other languages, you just need to take a look.

Shows the process of matching with a regular expression: http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html

 

The general matching process of a regular expression is as follows: Compare the expression with the characters in the text in sequence. If each character can match, the matching succeeds. If any character fails to match, the matching fails. If the expression contains quantifiers or boundary, this process may be slightly different, but it is also easy to understand. You can see the examples and use them several times.

Lists the Python-supported regular expression metacharacters and syntaxes:

1. "." match any character other than line breaks

>>> String = "abafsdafd \ nafdasfd" # contains line breaks
>>> String1 = "adfasdfadfasdfwer12345656 character" # Line Break not included
>>> Import re
>>> M = re. search (". +", string) # verify whether a line break can be matched
>>> M. group ()
'Afsdafs'

>>> N = re. search (". +", string1)
>>> N. group ()
'Adfasdfadfasdfwer12345656 comment'
From the output above, we can see that "." matches any character except the line break. If the "\ n" line break is encountered, the match is terminated.

2. "\" escape characters

Escape characters to change the meaning of the last character. If the * sign in the string needs to be matched, you can use \ * or character set [*]."A \. c" indicates matching a. c "a \ c" indicates matching a \ c

>>> Str_num = "22.567979 mafdasdf"
>>> M = re. search ("\ d + \. \ d +", str_num)
>>> M. group ()
'22. 567979'
We know ,". "In python, it indicates all characters except" \ n ". If no escape character is used here, the matching character is any character other than" \ n, therefore, escape.

>>> String = "dfafdasfd \ fafdasfda"
>>> String
'Dfafdasfd \ x0cafdasfda'
In python, if the string contains "\", sometimes it cannot be displayed, or you can modify the subsequent content to change others. This does not know what is going on Linux.

3. [...] character set (character class)
[...]: The character set (character class) can be any character in the character set. The characters in the character set can be listed one by one in the range of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] or [0-9]. If the first character set is ^, it indicates the inverse. For example, [^ abc] indicates that it is not other characters of abc.

All special characters lose their original special meaning in character sets.. If you want to use], or ^ In the character set, you can add a backslash (\) to the front, or put] \,-in the first character, and put ^ in a non-first character.

>>> String = "dafdafdasf [adfas ^ fad"

>>> M = re. search ("[a-z] + \ [", string)
>>> M. group ()
'Dafdafdasf ['
From the script above, we can see that if you want to match [you need to add the "\" Escape Character before.

>>> M = re. search ("\ w + [[]", string) (1) Match "[" in the character set "["
>>> M. group ()
'Dafdafdasf ['
>>> N = re. search ("w + [\ []", string) (2) Escape matching to verify matching in the character set [whether [\] is required
>>> N. group ()
Traceback (most recent call last ):
File "<stdin>", line 1, in <module>
AttributeError: 'nonetype 'object has no attribute 'group'
If you want to use], or ^ In the character set, you can add a backslash (\) to the front, or put] \,-in the first character, and put ^ in a non-first character.
Predefined Character Set

1. \ d Number

>>> M = re. search ("\ d", string)
>>> M. group ()
'1'
\ D is a matching number, which is equivalent to [0-9].
2. \ D is equivalent to [^ \ d].

>>> String = "dfaMNA12581 paifa"
>>> M = re. search ("\ D", string)
>>> M. group ()
'D'
>>> N = re. search ("[^ \ d]", string)
>>> N. group ()
'D'

>>> L = re. search ("[^ 0-9]", string)
>>> L. group ()
'D'
It can be seen from the above that "\ D" is used to match non-numbers. It matches any character except the number 0-9, which is equivalent to [^ 0-9] or [^ \ d].
3. \ s blank character [<space> \ r \ t \ n \ f \ v]

"\ S" matches any non-null characters, such as [<space> \ r \ t \ n \ f \ v]. The example is as follows:

>>> M = re. search ("\ s +", string) # \ s to match blank characters
>>> M. group ()
'\ T \ n \ x0c \ x0b \ R'
As shown above, \ s matches any blank characters, such as spaces, \ r, \ t, \ n, \ f, and \ v.

4. \ S non-blank characters \ S and \ s match exactly the opposite, is to match any non-empty characters. It is equivalent to [^ \ s] matching any non-null characters.

>>> String = "faMM records \ t \ n \ f \ v \ rDASDF"
>>> N = re. search ("\ S +", string)
>>> N. group ()
'Famm comment'
It can be seen from the above that "\ S" is a match for any non-null characters. If it encounters an empty character, it stops. It matches any non-null character and "." matches any character, except for line breaks.

>>> M = re. search (". +", string)
>>> M. group ()
'Famm success \ t'
It can be seen from the above that there are still differences between "\ S" and ".". One is any non-null character, and the other is any non-null character except "\ n.

5. \ w characters [A-Z0-9a-z _]
\ W is a matching word character. Let's see if it can match Chinese characters or other characters:

>>> String = "faMM records \ t \ n \ f \ v \ rDASDF"

>>> M = re. search ("\ w +", string)
>>> M. group ()
'Famm comment' (1) script

>>> Format_string = "fdMM greater KMM length/Greater MM \ n \ tMMKDSI"
>>> M = re. search ("\ w +", format_string)
>>> M. group ()
'Fdmm greater KMM length' (2) script

We can see that "\ w" can match Chinese characters, but cannot match spaces or line breaks, but can match Chinese characters.

6. \ W is equivalent to a non-word character [^ \ w]

>>> Import re
>>> String = "naefda once LmKDS 1316547 \ n \ t \ r @ 3 $ & ^ $"
>>> M = re. search ("\ W +", string)

>>> M. group ()
''

We can see from the above that "\ W" matches out null, indicating that "" is not a word character, and "\ W" is a match of non-word characters.

Quantifiers (used after a character or)

1. "*" matches the first character 0 or infinitely before the first character

"*" Matches the first character 0 or unlimited times.

>>> Import re
>>> String = "naefda once LmKDS 1316547 \ n \ t \ r @ 3 $ & ^ $"
>>> M = re. search ("\ w *", string)
>>> M. group ()
'Naefda Zeng'
>>> N = re. search ("\ d *", string)
>>> N. group ()
''

>>> N = re. search ("plaintext", string)
>>> N
>>> N. group ()
Traceback (most recent call last ):
File "<stdin>", line 1, in <module>
AttributeError: 'nonetype 'object has no attribute 'group'

From the script code above, we can see that when the * character to be searched is used, it will match at the beginning, and it will not match in the middle. This lazy match method returns an error if it cannot be found in other cases.

2. "+" matches the first character once or infinitely (Remember, only match the first character)

>>> Import re
>>> String = "naefda once LmKDS 1316547 \ n \ t \ r @ 3 $ & ^ $"
>>> M = re. search ("\ w +", string)
>>> M. group ()
'Naefda Zeng'
>>> N = re. search ("\ s +", string)
>>> N. group ()
''
>>> D = re. search ("more", string)
>>> D. group ()
Traceback (most recent call last ):
File "<stdin>", line 1, in <module>
AttributeError: 'nonetype 'object has no attribute 'group'
From the above, we can see that "+" matches the previous character once or infinitely. If not, None is returned.

3 ."? "Match the first character 0 times or onceThe first character (Remember, only match the previous one)

>>> Import re
>>> String = "naefda once LmKDS 1316547 \ n \ t \ r @ 3 $ & ^ $"
>>> M = re. search ("\ w? ", String)
>>> M. group ()
'N'
>>> N = re. search ("f? ", String)
>>> N. group ()
''
We can see from the above ,? It is a greedy match .? Is to match the first character 0 times or 1 time. In the preceding example, a problem is found "? "And" * "are both matched from the beginning. If the start cannot match," "is returned. It is equivalent "? "And" * "are equivalent to match () from the beginning. If no match is found, no match is found. The difference is that match () returns None, and seach () returns "".

4. {m} indicates matching the previous character m timesThe first character (Remember, only match the previous one)

>>> Import re

>>> String = "dafMM \ n greater than 1134657 Qqcd m, l #! "
>>> M = re. search ("\ d {4}", string) # indicates that the number is matched four times.
>>> M. group ()
'123'

>>> N = re. search ("\ d {10}", string)
>>> N. group ()
Traceback (most recent call last ):
File "<stdin>", line 1, in <module>
AttributeError: 'nonetype 'object has no attribute 'group'

It can be seen from the above that if the match fails, None is returned; {m} indicates matching the previous character m times. Note that when matching, we can set a matching interval as below, so that the matching will not exceed the standard.

5. {m, n} matches one character m to n timesThe first character (Remember, only match the previous one)

M and n can be omitted. For example, m ({, n}) is omitted, and 0 to n times are matched. If n ({m,}) is omitted, m is matched to an infinite number of times.

>>> Import re

>>> String = "dafMM \ n greater than 1134657 Qqcd m, l #! "

>>> M = re. search ("\ d {1, 10}", string)
>>> M. group ()
'123'
>>> N = re. search ("\ d {, 5}", string)
>>> N. group ()
''
>>> D = re. search ("\ d {4,}", string)
>>> D. group ()
'123'
We can see from the above that {m, n} is the number of matches in a range, but we also found a problem. Do not let the regular expression match 0 times, once the match is zero, the system returns "" directly if it does not match the beginning "".

*? +? ?? {M, n }? Make *, + ,?, {M, n} is switched to non-Greedy mode.

Boundary match

1. ^ match the start of the string. In multiline mode, match the beginning of the first line.

>>> Import re

>>> String = "dafMM \ n greater than 1134657 Qqcd m, l #! "

>>> M = re. search ("^ da", string)
>>> M. group ()
'Da'

Match at the beginning, which is equivalent to matching with match.

2. $ match the end of a string. In multiline mode, match the end of each row.

>>> Import re

>>> String = "dafMM \ n greater than 1134657 Qqcd m, l #! "

>>> M = re. search ("#! ", String)
>>> M. group ()
'#! '
"$" Is the end of the matching string, regardless of the front, whether the value matches the end is the content to be matched.

3. \ A only matches the start of the string

>>> Import re

>>> String = "dafMM \ n greater than 1134657 Qqcd m, l #! "

>>> M = re. search ("\ Adaf", string)
>>> M. group ()
'Daf'
"\ A" matches only the start of A string, that is, it matches only the beginning of A string.
4. \ Z only matches the end of the string

>>> Import re

>>> String = "dafMM \ n greater than 1134657 Qqcd m, l #! "

>>> N = re. search ("l #! ", String)
>>> N. group ()
'L #! '
"\ Z" only matches the end of a string and only matches the end. The difference between \ Z and $ is,$ Is the end of each row.And \ Z matches only the end of the string.

5. \ B matches the characters between \ w and \ W.

6. \ B matches non-\ w and \ W characters, that is, [^ \ B]

Logical grouping

1. "|" indicates that the Left and Right expressions match any one.

It always tries to match the expression on the left first. Once the expression is successful, the expression on the right is skipped. If | is not included in (), its range is the entire regular expression.

>>> Import re
>>> M = re. search ("(? P <province> [0-9] {4 })(? P <city> [0-9] {2 })(? P <birthday> [0-9] {4}) "," 371481199306143242 "). groupdict ()
>>> M
{'City': '81 ', 'birthday': '20170101', 'province': '20160301 '}

You can group and match in sequence to generate a dictionary, groupdict (), group match, and name the matched string.

1.2. Greedy and non-Greedy modes of quantifiers

Regular Expressions are usually used to search for matched strings in the text. In Python, quantifiers are greedy by default (in a few languages, they may also be non-Greedy by default), and always try to match as many characters as possible; in non-greedy, the opposite is true, always try to match as few characters as possible. For example, if the regular expression "AB *" is used to find "abbbc", "abbb" is found ". If we use a non-Greedy quantizer "AB *? "," A "is found ".

1.3. slashes

Like most programming languages, regular expressions use "\" as escape characters, which may cause backlash troubles. If you need to match the character "\" in the text, four Backslash "\" will be required in the regular expression expressed in programming language "\\\\": the first two and the last two are used to convert them into backslashes in the programming language, convert them into two backslashes, and then escape them into a backslash in the regular expression. The native string in Python solves this problem well. The regular expression in this example can be represented by r. Similarly, "\ d" matching a number can be writtenR "\ d".With the native stringYou no longer have to worry about missing a backslash, and the written expression is more intuitive.

 

1.4. Matching Mode

Regular Expressions provide some available matching modes, such as case-insensitive and multi-row matching. This part of content will be used in the factory method re of the Pattern class. compile (pattern [, flags.

2. re Module
2.1. Start Using re
Python supports regular expressions through the re module. The general step to Use re is to first compile the string form of the regular expression into a Pattern instance, and then use the Pattern instance to process the text and obtain the matching result (a Match instance ), finally, use the Match instance to obtain information and perform other operations.

Import re # compile the regular expression into the Pattern object pattern = re. compile (r 'hello') # use pattern to match the text and obtain the matching result. If the match fails, Nonematch = pattern is returned. match ("hello world! ") If match: # Use match to obtain the group information print (match. group ())

 

In the above regular expression matching, the system first compiles the code, compiles it into the regular expression format, and then performs matching.

Re. compile (strPattern [, flag]):

This method is a factory method of the Pattern class. It is used to compile a regular expression in the string form into a Pattern object. The second parameter flag is the matching mode. The value can take effect simultaneously using the bitwise OR operator '|', such as re. I | re. M. In addition, you can specify the mode in the regex string, such as re. compile ('pattern', re. I | re. M) and re. compile ('(? Im) pattern ') is equivalent.
Optional values:

The re module also provides a method escape (string)Regular Expression metacharacters such as */+ /? Add an escape character before returning.This is useful when you need to match a large number of metacharacters.

2.2. Match
>>> String = "aafaaMMaaaa"
>>> M = re. search ("aa? ", String)
>>> M. group ()
'A'
>>> N = re. search ("aaa? ", String)
>>> N. group ()
'A'
>>> D = re. search ("aaaa? ", String)
>>> D. group ()
'Aaa'
In the above Code ,"? "Matches the first character, that is, the character before the regular expression 0 or once.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.