"Python" Regular expression

Source: Internet
Author: User

Note: This article is mainly based on the site of the regular expression of Liao Xuefeng Learning, and according to the need to make a few changes, recorded here for follow-up review.

The concept and symbolic meaning of the regular expression of 0x01

To master regular expressions, you only need to remember the meanings represented by the different symbols, and the correct generalization of the pattern (or law) of the target object.

1. Basic content

Character matching

    • In regular expressions, if a character is given directly, it is exactly the exact match.
    • \d Match a number
    • \d matches a non-numeric
    • \w matches a letter, number, or underscore _
    • \w matches any non-word character, equivalent to "[^a-za-z0-9_]"
    • \s matches any whitespace character, including spaces, tabs, page breaks, and so on, equivalent to [\f\n\r\t\v]
    • \s matches any non-whitespace character
    • \ n matches a line break
    • \ r matches a carriage return character
    • \ t matches a tab

Quantity Matching

    • . Match any single character except "\ n"
    • * Match previous sub-expression 0 or more times
    • + Match Previous sub-expression one or more times
    • ? Match the preceding subexpression 0 or one time
    • {N},n is a non-negative integer that matches the determined N-times
    • {N,M},M and N are non-negative integers, where n<=m matches at least N times and matches up to M times
    • {N,},n is a non-negative integer that matches at least n times
    • {, m} matches the preceding regular expression up to M-times

Range Matching

    • X|y match x or Y
    • [XYZ] Character set, matching any one of the characters contained
    • [^XYZ] Negative character set, matching any characters not included
    • [A-z] character range, matching any character within a specified range
    • [^a-z] negative character range that matches any character that is not in the specified range

Look at a slightly more complicated example: \d{3,4}\s+\d{3,8}
Let's read from left to right:

    1. \d{3,4} indicates a match between 3 and 4 digits, e.g. ' 010 ', ' 0755 ';
    2. \s can match a space (also including tab and other whitespace), so \s+ indicates at least one space, such as "," and so on;
    3. \d{3,8} represents 3-8 digits, such as ' 1234567 '.

Together, the above regular expression can match a telephone number with an area code separated by any space.


What if I want to match a number like ' 010-12345 ', ' 0755-26776666 '?
Because '-' is a special character, in the regular expression, to be escaped with ' \ ', so, the above is \d{3,4}\-\d{3,8}.


However, you still cannot match ' 010-12345 ' because there is a space. So we need more complex ways of matching.

2. Advanced Content

To make a more accurate match, you can use [] to represent a range, such as:

    • [0-9a-za-z\_] can match a number, letter or underline;
    • [0-9a-za-z\_]+ can match a string consisting of at least one number, letter, or underscore, such as ' A100 ', ' 0_z ', ' Py3000 ', etc.;
    • [A-za-z\_] [0-9a-za-z\_]* can be matched by a letter or underscore, followed by a string consisting of a number, letter, or underscore, which is a valid Python variable;
    • [A-za-z\_] [0-9a-za-z\_] {0, 19} More precisely restricts the length of a variable to 1-20 characters (the preceding 1 characters + 19 characters later).
    • a| B can match A or B, so (p|p) Ython can match ' python ' or ' python '.
    • ^ Represents the beginning of a line, and ^\d indicates that it must begin with a number.
    • $ represents the end of the line, and \d$ indicates that it must end with a number.

You may have noticed that Py can also match ' Python ', but with ^py$ it becomes a string that only matches the one that starts with ' py '.
So, if a string is ' I love python ', then it can't be matched because it doesn't start with a py.

3. Regular expression Usage scenarios
    • Determines whether a string matches a specific pattern
    • Slicing a string
    • Extracting strings for a particular pattern
    • Replaces a string of the specified pattern
0x02 python expression module Re1 to determine if a string matches a particular pattern

Example of front area code + phone number

#Import re ModuleImportRe#Matchresult = Re.match (r'\d{3,4}\-\d{3,8}$','020-12345')Print(Result)#does not matchRESULT2 = Re.match (r'\d{3,4}\-\d{3,8}$','020 12345')Print(RESULT2)#The match () method determines if the match is true and returns a match object if the match succeeds, otherwise none is returned. #The common Judgment method is:#string with a judgmentTest ='020-12345'ifRe.match (R'\d{3,4}\-\d{3,8}$', test):Print('Match')Else:    Print('Not match')

Small exercise: Determine if a given email address is a Netec mailbox

    • assume that the Netec company's mailbox format is surname +.+ First name + number [email protected].
    • where numbers are not required, numbers are present only if there are multiple employees of the same name, and the name Pinyin or English will use lowercase letters instead of using uppercase letters
' [email protected] '  = r'^[a-z]{1,}\.[ a-z]+\d* @netec. com.cn$'if  re.match (pattern,email):  print(' It's Netec mailbox . ' Else:  print(' not Netec mailbox ')
2. Splitting a string

Using regular expressions to slice a string is more flexible than a fixed character, see the normal segmentation code:

' a b c '. Split (')print(RESULT3)

Unable to recognize contiguous spaces, try using regular expressions:

RESULT4 = Re.split (R'\s+"a b c')Print (RESULT4)

No matter how many spaces can be divided normally. Add "," Try:

RESULT5 = Re.split (R'[\s\,]+'a, B, c D')Print (RESULT5)

then add ";" Try:

RESULT6 = Re.split (R'[\s\,\;] +'b;; c d')print(RESULT6)
3. Extracting strings for specific patterns

In addition to simply judging whether a match is matched, the regular expression also has the power to extract substrings.
The group (group) to be extracted is represented by (). For example: ^ (\d{3,4})-(\d{3,8}) $
Two groups are defined separately, and the area code and local numbers can be extracted directly from the matching string:

m = Re.match (R'^ (\d{3,4})-(\d{3,8}) $'0755-12345')  print(m)print#  matches the entire string print#  Matches the contents of the first parenthesis, that is, the first matched substring of print#  matches the contents of the second parenthesis, that is, the second matched substring

A complex example that mentions hours, minutes, seconds in a given time string

' 19:05:30 '  = Re.match (R'^ (0[0-9]|1[0-9]|2[0-3]|[ 0-9]) \:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]| [0-9]) \:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]| [0-9]) $', T)print (m.groups ())print(m.group (1) ) Print(M.group (2))print(M.group (3))

In fact, it also has a simpler way of writing:

' 19:05:30 '  = Re.match (R'^ (0[0-9]|1[0-9]|2[0-3]|[ 0-9]) \:([0-5]? [0-9]) \:([0-5]? [0-9]) $', T)print (m.groups ())print(m.group (1) ) Print(M.group (2))print(M.group (3))
4. Replace the string with the specified pattern
result = Re.sub ('[ae]','X','Abcdefghi ' )print= re.subn ('[ae]','X ','abcdef')print(Result)
5. Greedy match vs non-greedy match

A regular match is a greedy match by default, which is to match as many characters as possible. For example, match the 0 following the number:

Result7 = Re.match (R'^ (\d+) (0*) $'102300'). Groups ()  Print(RESULT7)

Since the \d+ uses greedy matching, the following 0 are all matched directly, the result 0* can only match the empty string.
You have to let \d+ use a non-greedy match (that is, as few matches as possible) to match the 0
Add one? You can let \d+ use a non-greedy match:

RESULT8 = Re.match (R'^ (\d+?) (0*) $'102300'). Groups ()print(RESULT8)
6, the compilation of regular expressions

When we use regular expressions in Python, two things are done inside the RE module:

    1. Compiles the regular expression, if the regular expression string itself is illegal, will error;
    2. Use the compiled regular expression to match the string.

If a regular expression is to be reused thousands of times, we can precompile the regular expression for efficiency reasons.
It is not necessary to compile this step in the next reuse, directly match:

# compile Re_telephone = re.compile (R'^ (\d{3,4})-(\d{3,8}) $')#  Use printdirectly (Re_telephone.match ('010-12345'). Groups ()) # Direct Use Print (Re_telephone.match ('010-8086'). Groups ())

Generates a regular expression object after compilation, because the object itself contains a regular expression,
So call the corresponding method without giving the regular string.

7. Several functions commonly used in the RE module (1) Compile ()

Compile () compiles a regular expression pattern that returns the schema of an object so that a schema can be compiled once and used multiple times in the program

Import"Tina is a good girl, she's cool, clever, and so on ... "  = re.compile (R'\w*oo\w*')print# Find all contained ' OO ' word
(2) match ()

Match () determines whether the re matches the position at the beginning of the string. Note: This method is not an exact match.

If the string has any remaining characters at the end of the pattern, it is still considered successful. If you want an exact match, you can add the boundary match ' $ ' at the end of the expression

Print (Re.match ('com','comwww.runcomoob'). Group ())
(3) Search ()

The Re.search function looks for a pattern match within the string, as long as the first match is found and then returns none if the string does not match.

Print (Re.search ('\dcom','www.4comrunoob.5com'). Group ( ))
(4) FindAll ()

FindAll () traversal match, you can get all the matching strings in the string, return a list.

p = re.compile (R'\d+')print(P.findall ('o1n2m3k4 '))
(5) Finditer ()

Finditer () searches for a string that returns an iterator that accesses each matching result (match object) sequentially. Find all the substrings that the RE matches and return them as an iterator.

ITER = Re.finditer (r'\d+','drumm44ers drumming, 11 ... ... ' ) for in iter:print(i)print(I.group ())  Print(I.span ())
(6) Split ()

Split () splits a string into a list after it is able to match the substring.
You can use Re.split to split a string, such as: Re.split (R ' \s+ ', text), and divide the string into a word list by space.

Print (Re.split ('\d+','one1two2three3four4five5'))
(7) Sub ()

Sub () returns the replaced string after replacing each of the matched substrings in a string with re.

Import"Jgood is a handsome boy, he's cool, clever, and so on ... " Print (Re.sub (R'\s+'-', text))
(8) Subn ()

SUBN () returns the replaced string after replacing each matched substring in string with re and returns the number of replacements

Print(Re.subn ('[1-2]','A','123456abcdef'))Print(Re.sub ("g.t"," have",'I get A, I got B, I gut C'))Print(Re.subn ("g.t"," have",'I get A, I got B, I gut C'))
0X03 Reference Links
    • Https://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000/ 001386832260566c26442c671fa489ebc6fe85badda25cd000
    • Http://www.runoob.com/python3/python3-reg-expressions.html
    • Https://www.cnblogs.com/tina-python/p/5508402.html
    • Https://www.jb51.net/tools/regexsc.htm

"Python" Regular expression

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.