Python Regular Expression Learning summary and data

Source: Internet
Author: User

Source: Michael_ Xiang _

Summary

    • In regular expressions, if a character is given directly, it is exactly the exact match.

    • {m,n}? Repeat m to n for the previous character, and take as few cases as possible in the string ' aaaaaa ', a{2,4} will match 4 A, but a{2,4}? Matches only 2 A.

^ Represents the beginning of a line, and ^\d indicates that it must begin with a number.

$ represents the end of the line, and \d$ indicates that it must end with a number.

You may have noticed that the Py can also match the ' Python ' –>py;

But with ^py$ it becomes the whole line match, it can only match ' py ', match ' python ', there is nothing to get.

Reference table

Special sequence of regular expressions

^Match start $ matches the end of the line, matching any single character other than the line break, using-The m option allows it to match a newline character as well [...] to match any of the characters in parentheses (or meaning) [^...] Match a single character or multiple characters not in parentheses*match 0 or more of the preceding expressions+match 1 or more occurrences of the preceding expression? Matches 0 or 1 occurrences of the preceding expression {n} exactly matches the number of expressions preceding the previous occurrence {n,m} matches at least n times to M times a|B matches A or B*? + ,??, {m,n}? This is *,+.,? , {m,n} becomes a non-greedy mode (RE) group regular expression and matches the text in a timely fashion (? IMX) temporarily toggles the options on the I,m or X-quake expression, if only the region is affected by the (?: RE) group regular expression and matches the remembered text (?#....) Notes(?=re) specifies the mode location to use, without a range (?! RE) uses the specified mode to take the inverse position, without a range (?<n1>..) Match \d numbers in a list [0-9] Digit \d non-digital= = [^0-9]or[^\d] \s white space character \s non-whitespace character \w alphanumeric underline word \w non-alphanumeric underline

Re module

Re.compile (pattern[, flags])

Converts the pattern and identity of regular expressions into regular expression objects for use by the match () and search () functions.

The flags defined by re include:

Re. I Ignore case

Re. L represents a special character set \w, \w, \b, \b, \s, \s dependent on the current environment

Re. M Multi-line mode

Re. S is the '. ' and include a newline character ('. ' Do not include newline characters.

Re. U represents special character set \w, \w, \b, \b, \d, \d, \s, \s dependent on Unicode character Property database

Re. X to increase readability, ignore spaces and comments after ' # '

The following two usage results are the same:

A

Compiled_pattern = Re.compile (pattern)

result = Compiled_pattern.match (string)

B

result = Re.match (pattern, string)

s = ' abc\\-001 ' # python string

#对应的正则表达式字符串变成:

# ' abc\-001 '

Therefore, we strongly recommend that you use the Python R prefix without considering escaping the problem.

s = R ' abc\-001 ' # python string

# The corresponding regular expression string does not change:

# ' abc\-001 '

Search

Re.search (pattern, string[, flags])

Finds the position in the string that matches the regular expression pattern, returns an instance of Matchobject, or none if no matching position is found.

For compiled regular expression objects (re. Regexobject), you have the following search methods:

Search (string[, pos[, Endpos])

If the regex is a compiled regular expression object, Regex.search (string, 0, 50) is equivalent to Regex.search (String[:50], 0).

>>> pattern = Re.compile ("a")

>>> pattern.search ("ABCDE") # Match at index 0

>>> pattern.search ("ABCDE", 1) # No match;

Match

Re.match (pattern, string[, flags])

Determines whether the pattern matches at the beginning of the string. For Regexobject, there are:

Match (string[, pos[, Endpos])

The match () function attempts to match the regular expression only at the beginning of the string, that is, only the match that starts at position 0 is reported, and the search () function scans the entire string to find a match. If you want to search the entire string for a match, you should use Search ().

>>> pattern.match (' BCA ', 2). Group ()

A

Although, match defaults to match from the beginning, but if the location is specified, it can still succeed; Match also starts at the specified position, and the mismatch still fails, which is different from search.

The match () method determines if the match is true and returns a match object if the match succeeds, otherwise none is returned.

Test = ' user-entered string '

If Re.match (R ' Regular expression ', test):

Print (' OK ')

Else

Print (' failed ')

Split

Re.split (Pattern, string[, maxsplit=0, flags=0])

This feature is often used to split the part of a string-matching regular expression and return a list. For Regexobject, there are functions:

Split (string[, maxsplit=0])

Split does not split a string that cannot find a match

>>> ' a b C '. Split (')

[' A ', ' B ', ', ', ', ' C ']

The split method, which comes with strings, is not flexible.

>>> Re.split (R ' \s+ ', ' a B C ')

[' A ', ' B ', ' C ']

See the difference, very powerful!

One more Ultimate:

>>> Re.split (R ' [\s\,\;] + ', ' A-B;; C d ')

[' A ', ' B ', ' C ', ' d ']

R ' [\s\,\;] + ' Regular expression means: a space or, or, 1 or more than 1 occurrences of the condition of the split symbol! So, the final result is still very clean.

FindAll

Re.findall (pattern, string[, flags])

Finds all substrings that match the regular expression in the string and makes up a list to return. The same regexobject are:

FindAll (string[, pos[, Endpos])

#get all content enclosed with [], and return a list

>>> Pattern=re.compile (R ' HH ')

>>> pattern.findall (' hhmichaelhh ')

[' hh ', ' hh ']

Finditer

Re.finditer (pattern, string[, flags])

Similar to FindAll, finds all substrings that match the regular expression in the string and makes up an iterator to return. The same regexobject are:

Finditer (string[, pos[, Endpos])

Sub

Re.sub (Pattern, REPL, string[, Count, flags])

Finds all substrings matching the regular expression pattern in string strings and replaces them with another string repl. If no string matching the pattern is found, a string that has not been modified is returned. Repl can be either a string or a function.

The return value is the new string after replacement.

For Regexobject there are:

Sub (REPL, string[, count=0])

>>> pattern=re.compile (R ' \d ')

>>> pattern.sub (' No ', ' 12hh34hh ')

' Nonohhnonohh '

>>> pattern.sub (' No ', ' 12hh34hh ', 0)

' Nonohhnonohh '

>>> pattern.sub (' No ', ' 12hh34hh ', count=0)

' Nonohhnonohh '

>>> pattern.sub (' No ', ' 12hh34hh ', 1)

' No2hh34hh '

As you can see from the above example, count is the default, and the default value is 0, which means replace all;

Subn

RE.SUBN (Pattern, REPL, string[, Count, flags])

The function has the same function as a sub (), but it also returns the new string and the number of substitutions. The same regexobject are:

Subn (Repl, string[, count=0])

>>> pattern.subn (' No ', ' 12hh34hh ', count=0)

(' Nonohhnonohh ', 4)

Group

In addition to simply judging whether a match is matched, the regular expression also has the power to extract substrings. The group (group) to be extracted is represented by (). Like what:

^ (\d{3})-(\d{3,8}) $ defines two groups, which can extract the area code and local numbers directly from the matching string:

>>> m = Re.match (R ' ^ (\d{3})-(\d{3,8}) $ ', ' 010-12345 ')

>>> m

<_sre. Sre_match object; span= (0, 9), match= ' 010-12345 ' >

>>> M.group (0)

' 010-12345 '

>>> M.group (1)

' 010 '

>>> M.group (2)

' 12345 '

>>> m.groups ()

(' 010 ', ' 12345 ')

Through the experiment, if you do not use parentheses, the resulting match object class can be used such as A.group (0) or a.group () However, using A.group (1) will give an error.

Greedy match

A regular match is a greedy match by default, which is to match as many characters as possible. For example, match the 0 following the number:

>>> Re.match (R ' ^ (\d+) (0*) $ ', ' 102300 '). Groups ()

(' 102300 ', ')

Since the \d+ uses greedy matching, the following 0 are all matched directly, the result 0* can only match the empty string.

You must let \d+ use a non-greedy match (that is, as few matches as possible) in order to match the back of the 0, add a? You can let the d+ use a non-greedy match:

>>> Re.match (R ' ^ (\d+?) (0*) $ ', ' 102300 '). Groups ()

(' 1023 ', ' 00 ')

Python Regular Expression Learning resources

Python Regular Expression Learning summary and data

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.