Regular Expression, Regular Expression

Source: Internet
Author: User

Regular Expression, Regular Expression

Syntax:

Import re # import module name p = re. compile ("^ [0-9]") # generates the regular object to be matched. ^ indicates matching from the beginning, and [0-9] indicates matching any number ranging from 0 to 9, therefore, this indicates matching the string. If the first character at the beginning of the string is a number, it indicates that the string matches m = p. match ('14534abc') # match the string based on the previously generated regular object. If the matching succeeds, this m will have a value, otherwise, m is None. <br> if m: # if not null, print (m. group () # m. group () returns the matching result. Here it is 1, because the matching character is 1 <br> else: <br> print ("doesn't match. ") <br>

The preceding 2nd and 3rd rows can also be merged into one row for writing:

m = p.match("^[0-9]",'14534Abc')

The results are the same. The difference is that the first method is to compile the format to be matched (parse the matching formula) in advance ), in this way, you do not need to compile the matching format when matching. The 2nd abbreviations are used to compile the matching formula each time. Therefore, if you need to match all rows starting with a number from a file with 5 million rows, we recommend that you compile the regular expression first and then match it. This will speed up.

Illustration:

Ip_adds = '2017. 168.1.2213 '# character m = re. match ('([0-9] {1, 3 }\.) {3} \ d {1, 3} ', ip_adds) print (m. group ())

 

Matching format:

Metacharacters Description
\ Mark the next character, or a backward reference, or an octal escape character. For example, "\ n" matches \ n. "\ N" matches the line break. The sequence "\" matches "\", and "\ (" matches "(". It is equivalent to the concept of "Escape Character" in multiple programming languages.
^ Matches the start position of the input string. If the Multiline attribute of the RegExp object is set, ^ matches the position after "\ n" or "\ r.
$ Matches the end position of the input string. If the Multiline attribute of the RegExp object is set, $ also matches the position before "\ n" or "\ r.
* Match the previous subexpression any time. For example, zo * matches "z", "zo", and "zoo", but does not match "bo ". * Is equivalent to {0 ,}.
+ Match the previous subexpression once or multiple times (greater than or equal to 1 time ). For example, "zo +" can match "zo" and "zoo", but cannot match "z ". + Is equivalent to {1 ,}.
? Match the previous subexpression zero or once. For example, "do (es )?" It can match "do" in "do" or "does ".? It is equivalent to {0, 1 }.
{N} N is a non-negative integer. Match n times. For example, "o {2}" cannot match "o" in "Bob", but can match two o in "food.
{N ,} N is a non-negative integer. Match at least n times. For example, "o {2,}" cannot match "o" in "Bob", but can match all o in "foooood. "O {1,}" is equivalent to "o + ". "O {0,}" is equivalent to "o *".
{N, m} Both m and n are non-negative integers, where n <= m. Match at least n times and at most m times. For example, "o {1, 3}" matches the first three o in "fooooood. "O {0, 1}" is equivalent to "o ?". Note that there must be no space between a comma and two numbers.
? When this character is followed by any other delimiter (*, + ,?, The matching mode after {n}, {n ,}, {n, m}) is not greedy. The non-Greedy mode matches as few searched strings as possible, while the default greedy mode matches as many searched strings as possible. For example, for strings "oooo", "o + ?" A single "o" will be matched, while "o +" will match all "o ".
. Match any single character except "\ r \ n. To match any character including "\ r \ n", use a pattern like "[\ s \ S.
(Pattern) Match pattern and obtain this match. The obtained match can be obtained from the generated Matches set. The SubMatches set is used in VBScript, and $0… is used in JScript... $9 attribute. To match the parentheses, use "\ (" or "\)".
(? : Pattern) If the match is not obtained, it matches pattern but does not obtain the matching result. It is not stored for future use. This is useful when you use the "(|)" character to combine all parts of a pattern. For example, "industr (? : Y | ies) "is a simpler expression than" industry | industrial.
(? = Pattern) If the match is not obtained, it is pre-checked in the forward direction and matches the search string at the beginning of any string that matches the pattern. This match does not need to be obtained for future use. For example (? = 95 | 98 | NT | 2000) "can match" Windows "in" Windows2000 ", but cannot match" Windows "in" Windows3.1 ". Pre-query does not consume characters, that is, after a match occurs, the next matching search starts immediately after the last match, instead of starting after the pre-query characters.
(?! Pattern) Non-get match, forward negative pre-query, match the search string at the beginning of any string that does not match pattern, this match does not need to be obtained for future use. For example, "Windows (?! 95 | 98 | NT | 2000) "can match" Windows "in" Windows3.1 ", but cannot match" Windows "in" Windows2000 ".
(? <= Pattern) Non-get matching, reverse certainly pre-query, similar to positive certainly pre-query, but in the opposite direction. For example, <= 95 | 98 | NT | 2000) Windows can match Windows in 2000Windows, but cannot match Windows in 3.1Windows ".
(? <! Pattern) Non-get match, reverse negative pre-query, similar to forward negative pre-query, only in the opposite direction. For example, "(? <! 95 | 98 | NT | 2000) Windows can match "Windows" in "3.1Windows", but cannot match "Windows" in "2000Windows ".
X | y Match x or y. For example, "z | food" can match "z", "food", or "zood" (Exercise caution here ). "(Z | f) ood" matches "zood" or "food ".
[Xyz] Character Set combination. Match any character in it. For example, "[abc]" can match "a" in "plain ".
[^ Xyz] Negative value character set combination. Match any character not included. For example, "[^ abc]" can match "plin" in "plain ".
[A-z] Character range. Matches any character in the specified range. For example, "[a-z]" can match any lowercase letter in the range of "a" to "z. Note: only when a hyphen is in a character group and appears between two characters can the range of the characters be expressed. If a group starts with a hyphen, it can only represent the character itself.
[^ A-z] Negative character range. Matches any character that is not within the specified range. For example, "[^ a-z]" can match any character that is not in the range of "a" to "z.
\ B Match A Word boundary, that is, the position between a word and a space (that is, the regular expression "match" has two concepts: matching characters and matching positions, here \ B is the matching position ). For example, "er \ B" can match "er" in "never", but cannot match "er" in "verb ".
\ B Match non-word boundary. "Er \ B" can match "er" in "verb", but cannot match "er" in "never ".
\ Cx Match the control characters specified by x. For example, \ cM matches a Control-M or carriage return character. The value of x must be either a A-Z or a-z. Otherwise, c is treated as a literal "c" character.
\ D Match a numeric character. It is equivalent to [0-9].
\ D Match a non-numeric character. It is equivalent to [^ 0-9].
\ F Match a form feed. It is equivalent to \ x0c and \ cL.
\ N Match A linefeed. It is equivalent to \ x0a and \ cJ.
\ R Match a carriage return. It is equivalent to \ x0d and \ cM.
\ S Match any invisible characters, including spaces, tabs, and page breaks. It is equivalent to [\ f \ n \ r \ t \ v].
\ S Match any visible characters. It is equivalent to [^ \ f \ n \ r \ t \ v].
\ T Match a tab. It is equivalent to \ x09 and \ cI.
\ V Match a vertical tab. It is equivalent to \ x0b and \ cK.
\ W Match any word characters that contain underscores. Similar to but not equivalent to "[A-Za-z0-9 _]", here the "word" character uses the Unicode Character Set.
\ W Match any non-word characters. It is equivalent to "[^ A-Za-z0-9 _]".
\ Xn Match n, where n is the hexadecimal escape value. The hexadecimal escape value must be determined by the length of two numbers. For example, "\ x41" matches "". "\ X041" is equivalent to "\ x04 & 1 ". The regular expression can be ASCII encoded.
\ Num Matches num, where num is a positive integer. References to the obtained matching. For example, "(.) \ 1" matches two consecutive identical characters.
\ N Identifies an octal escape value or a backward reference. If at least n subexpressions are obtained before \ n, n is backward referenced. Otherwise, if n is an octal digit (0-7), n is an octal escape value.
\ Nm Identifies an octal escape value or a backward reference. If at least one child expression is obtained before \ nm, the nm is backward referenced. If at least n records are obtained before \ nm, n is a backward reference followed by text m. If none of the preceding conditions are met, if n and m are Octal numbers (0-7), \ nm matches the octal escape value nm.
\ Nml If n is an octal number (0-7) and m and l are Octal numbers (0-7), the octal escape value nml is matched.
\ Un Match n, where n is a Unicode character represented by four hexadecimal numbers. For example, \ u00A9 matches the copyright symbol (& copy ;).
<> Start (<) and end (>) of the matching word (word ). For example, the regular expression <the> can match "the" in the string "for the wise", but cannot match "the" "in the string" otherwise ". Note: This metacharacter is not supported by all software.
() Defines the expressions between (and) as "group" and saves the characters matching the expression to a temporary region (a regular expression can save up to 9 characters ), they can be referenced using symbols from \ 1 to \ 9.
| Perform logical "Or" (Or) operations on the two matching conditions. For example, the regular expression (him | her) matches "it belongs to him" and "it belongs to her", but does not match "it belongs to them .". Note: This metacharacter is not supported by all software.
+ Match one or more characters that match exactly before it. For example, the regular expression 9 + matches 9, 99, and 999. Note: This metacharacter is not supported by all software.
? Match 0 or 1 character that is exactly before it. Note: This metacharacter is not supported by all software.
{I} {I, j} Matches a specified number of characters defined in the previous expression. For example, the regular expression A [0-9] {3} can match the character "A" followed by A string of exactly three numeric characters, such as A123 and A348, but does not match A1234. The regular expression [0-9] {} matches any four, five, or six consecutive numbers.

  

Regular Expressions are commonly used in five operations

Re. match (pattern, string) # match from scratch

Re. search (pattern, string) # match the entire string until a match is found.

Re. split () # Use the matched format as the split point to split the string into a list

m = re.split("[0-9]", "alex1rain2jack3helen rachel8")print(m)

Output: ['Alex ', 'rain', 'jack', 'helen rachel', '']

Re. findall () # Find all the characters to match and return the List format

m = re.findall("[0-9]", "alex1rain2jack3helen rachel8")print(m)<br>

Output: ['1', '2', '3', '8']

Re. sub (pattern, repl, string, count, flag) # Replace matched characters

m=re.sub("[0-9]","|", "alex1rain2jack3helen rachel8",count=2 )print(m)

Output: alex | rain | jack3helen cancel8

 

 

Regular Expression instance character match
Instance Description
Python Match "python ".
Character class
Instance Description
[Pp] ython Match "Python" or "python"
Rub [ye] Match "ruby" or "rube"
[Aeiou] Match any letter in brackets
[0-9] Match any number. Similar to [0123456789]
[A-z] Match any lowercase letter
A-Z Match any uppercase letter
A-zA-Z0-9 Match any letter or number
[^ Aeiou] All characters except aeiou letters
[^ 0-9] Match characters other than numbers
Special character class
Instance Description
. Matches any single character except "\ n. To match any character including '\ n', use a pattern like' [. \ n.
\ D Match a numeric character. It is equivalent to [0-9].
\ D Match a non-numeric character. It is equivalent to [^ 0-9].
\ S Matches any blank characters, including spaces, tabs, and page breaks. It is equivalent to [\ f \ n \ r \ t \ v].
\ S Match any non-blank characters. It is equivalent to [^ \ f \ n \ r \ t \ v].
\ W Match any word characters that contain underscores. It is equivalent to '[A-Za-z0-9 _]'.
\ W Match any non-word characters. It is equivalent to '[^ A-Za-z0-9 _]'.

 

 

Difference between re. match and re. search

Re. match only matches the start of the string. If the start of the string does not conform to the regular expression, the match fails, and the function returns None; and re. search matches the entire string until a match is found.

Regular Expression Modifiers: Option Flags

Regular expression literals may include an optional modifier to control varous aspects of matching. the modifiers are specified as an optional flag. you can provide multiple modifiers using exclusive OR (|), as shown previously and may be represented by one of these −

Modifier Description
Re. I

Performs case-insensitive matching. # case insensitive

Example:

>>> String = "KOBE"

>>> M = re. match ('[a-z]', string, flags = re. I) # keyword flags. Other variable names cannot be used.

>>> Print (m. group ())

Re. L Interprets words according to the current locale. This interpretation affects the alphabetic group (\ w and \ W), as well as word boundary behavior (\ B and \ B ).
Re. M Makes $ match the end of a line (not just the end of the string) and makes ^ match the start of any line (not just the start of the string ).
Re. S Makes a period (dot) match any character, including a newline.
Re. U Interprets letters according to the Unicode character set. This flag affects the behavior of \ w, \ W, \ B, \ B.
Re. X Permits "cuter" regular expression syntax. It ignores whitespace (partition t inside a set [] or when escaped by a backslash) and treats unescaped # as a comment marker.

 

Several common Regular Expressions:

Matching mobile phone number:

phone_str = "hey my name is alex, and my phone number is 13651054607, please call me if you are pretty!"phone_str2 = "hey my name is alex, and my phone number is 18651054604, please call me if you are pretty!" m = re.search("(1)([358]\d{9})",phone_str2)if m:    print(m.group())

Matching IP V4:

ip_addr = "inet 192.168.60.223 netmask 0xffffff00 broadcast 192.168.60.255" m = re.search("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", ip_addr) print(m.group())

Group matching address:

ContactInfo = 'oldboy School, Beijing Changping Shahe: 010-8343245 'match = re. search (R' (\ w +), (\ w +): (\ S +) ', contactInfo) # group "" >>> match. group (1) 'doe '>>> match. group (2) 'john' >>> match. group (3) '2014-555 '"" match = re. search (R '(? P <last> \ w + ),(? P <first> \ w + ):(? P <phone> \ S +) ', contactInfo) ""> match. group ('last') 'doe '>>> match. group ('first') 'john' >>> match. group ('phone') '2014-555 '"""

Matching email:

email = "alex.li@126.com   http://www.oldboyedu.com" m = re.search(r"[0-9.a-z]{0,26}@[0-9.a-z]{0,20}.[0-9a-z]{0,8}", email)print(m.group())

 

 

 

 

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.