Python full stack road 6--regular expression

Last Update:2016-09-01 Source: Internet

Author: User

Tags string format

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The regular itself is a language:

The regular expression uses a single string to describe, match a series of strings that conform to a certain syntactic rule, is very powerful in text processing, and is often used as a crawler to crawl specific content, and Python itself does not support regular, but by importing the RE module, Python can also use regular expressions. Here's a look at the usage of the Python regular expression.

Lists the regular expression meta characters and syntax supported by Python:

One, the Python regular character:

1. Ordinary characters:

Most characters and letters will match themselves.

2, meta-character:

　　Metacharacters:. ^ $ * + ? { } [ ] | ( ) \

2.1, [] Detailed

For example, [ABC] will match any one of the characters in "a", "B", or "C", or you can use the interval [a-c] to represent the same character set, and the former effect is the same. If you only want to match lowercase letters, then RE should be written as [A-z].

2.2 () detailed

2.3:+, *,? , {} detailed

　　Greedy mode and non-greedy mode:

From the previous description you can see ' * ', ' + ' and '? ' are greedy, but this may not be what we say, so, you can add a question mark in the back, change the strategy to non-greedy, only match as few re. Example

Realize the difference between the two:  <strong>findall only matches the output packet if it is a group, if it is not a group, output matches to the output of the content </STRONG>   introduction >>> Re.findall (R "A (\d+?)", "a234b") # non-greedy mode  if \d+ matches two digits,         [' 2 '] >>> Re.findall (r "A (\d+)", "a234b")

The role of R in regular matching

>>> Re.findall (r "\bi", "I Love U")            [' I '] >>> re.findall (r "\bil", "Ilove u") [' Il ']

Ii. various methods of the RE module 1, findall (gets all the matching strings in the string)

FindAll (), you can return the matching results as a list, and if they don't match, return an empty list, and take a look at the usage in the code

Import re L=re.findall (R ' \d ', ' 4g6gggg9,9 ') # \d represents numbers, placing the matched elements in a list print (L) # [' 4 ', ' 6 ', ' 9 ', ' 9 '] print (Re.findall (R ' \w ', ' DS '). _ 4 ')) # [' d ', ' s ', ' _ ', ' 4 '], match the alphanumeric underline print (Re.findall (R ' ^sk ', ' skggj,fd,7 ')) # starting with SK, [' SK '] print ( Re.findall (R ' ^sk ', ' kggj,fd,7 ')) # [] Print (Re.findall (R ' k{3,5} ', ' Ffkkkkk ')) # 3 to 5 times from the previous character ' K ', [' KKKKK '] p          Rint (Re.findall (R ' a{2} ', ' Aasdaaaaaf ')) # matches the previous character a two times, [' AA ', ' AA ', ' AA '] Print (Re.findall (R ' a*x ', ' Aaaaaax ')) # [' Aaaaaax '] matches the previous character 0 or more times, greedy match print (Re.findall (R ' \d* ', ' www33333 ')) # [', ', ', ' ', ' 33333 ', '] print (re.fi Ndall (R ' a+c ', ' AAAACCCC ')) # [' Aaaac '] matches one or more of the preceding characters, and the greedy match print (Re.findall (R ' a?c ', ' AAAACCCC ')) # [' AC ', ' C ', ' C ', ' C ') matches 0 or 1 print (Re.findall (R ' a[) of the preceding character. d ', ' Acdggg Abd ') #. Lost meaning in [], so the result is [] print (Re.findall (R ' [A-z] ', ' H43.hb-gg ')) # [' H ', ' H ', ' B ', ' g ', ' G '] p Rint (Re.findall (R ' [^a-z] ', ' h43.hb-gG ')) # reverse, [' 4 ', ' 3 ', '. ', ', '-'] print (Re.findall (R ' ax$ ', ' Dsadax ')) # End With ' ax ' [' Ax '] Print (Re.findall ( R ' A (\d+) b ', ' a23666b ') # [' 23666 '] print (Re.findall (R ' A (\d+?) B ', ' a23666b ') # [' 23666 '] are qualified before and after the non-greedy mode fails print (Re.findall (R ' A (\d+) ', ' a23b ')) # [' ['] '] print (re.findall         (R ' A (\d+?) ', ' a23b ')) # [2] plus one?  into a non-greedy mode

Advanced usage of find:?:

The default is to take the information in the grouping (), but I want to let the matching information outside the packet also take, need to use?:

Finditer (): Iterative Lookup

>>> p = re.compile (R ' \d+ ') >>> iterator = P.finditer (' drumm44ers drumming, 11 ... ... ') >>> for match in iterator: ...  Match.group (), Match.span () ...

2. Match (pattern, string, flag=0)

Regular expressions
The string to match
Flag bit, used to control how regular expressions are matched

The match object is a matching result that contains a lot of information about this match and can be obtained using the readable properties or methods provided by match.

Method: 1.group ([Group1, ...]): Gets the string intercepted by one or more packets; When multiple parameters are specified, they are returned in tuples. Group1 can use numbers or aliases; number 0 represents the entire matched substring; returns Group (0) when no parameters are filled; Groups that have not intercepted a string return none; The group that intercepted multiple times returns the last substring intercepted. 2.groups ([default]): Returns the string intercepted by all groups as a tuple. Equivalent to calling group (,... last). Default indicates that a group that does not intercept a string is replaced with this value, which defaults to none. 3.groupdict ([default]): Returns the alias of the group with the alias as the key, the substring intercepted by the group as the value of the dictionary, the group without an alias is not included. The default meaning is the same. 4.start ([group]): Returns the starting index of the substring intercepted by the specified group in string (the index of the first character of the substring). The group default value is 0. 5.end ([group]): Returns the end index of the substring intercepted by the specified group in string (the index of the last character of the substring + 1). The group default value is 0. 6.span ([group]): Returns (Start (group), End (group)). 7.expand (template): Substituting the matched grouping into the template and returning. The template can be grouped using \id or \g<id>, \g<name> reference, but cannot use number 0. \id and \g<id> are equivalent, but \10 will be considered a 10th grouping, if you want to express \1 after the character ' 0 ', use only \g<1>0.

3. Search (Pattern, string, flag=0)

According to the model to match the specified content in the string, matching a single, only one time, you can combine split will match to the content split stitching and then cycle the search again. Because FindAll can find everything, the content outside of the grouping () is not matched when the group is processed (). And FindAll is returned after the list will be introduced

4, GROUOP and groups

Group (0) Show All

Group (1) shows the first group ()

Group (2) shows the second group ()

If there is no grouping or exceeding the number of groups, an error is

5, sub (pattern, REPL, String, count=0, flag=0)

The string used to replace the match must be a regular expression within the pattern and cannot be a regular expression search or an assignment variable found by findall

For example, my calculator handles parentheses, and after matching with regular search, the variable cannot be directly in and out of the sub's pattern because it does not work

Sub Suspects

Sub (repl, string[, Count]) | Re.sub (Pattern, REPL, string[, Count]):

Returns the replaced string after each matched substring in string is replaced with REPL.
When Repl is a string, you can use \id or \g<id>, \g<name> reference grouping, but you cannot use number 0.
When Repl is a method, this method should only accept one parameter (the match object) and return a string for substitution (the returned string cannot be referenced in the grouping).

Subn method returns the number of total replacements

Subn (REPL, string[, Count]) |re.sub (pattern, REPL, string[, Count]):

Returns (Sub (REPL, string[, Count]), number of replacements).

Import re  p = re.compile (R ' (\w+) (\w+) ') s = ' I say, hello world! '   Print p.subn (R ' \2 \1 ', s)    def func (m):     return M.group (1). Title () + "+ m.group (2)." title ()    Print p.subn (func , s)

6, Split (pattern, String, maxsplit=0, flags=0)

Split (string[, Maxsplit]) | Re.split (Pattern, string[, Maxsplit]):
Returns a list after splitting a string by a substring that can be matched. The maxsplit is used to specify the maximum number of splits and does not specify that all will be split.

Import re    p = re.compile (R ' \d+ ') print p.split (' One1two2three3four4 ')

7 re.compile (strpattern[, Flag]): Compile compilation Method

If a matching rule is to be used more than once in the future, it can be compiled in the future without having to write the matching rules every time.

This method is the factory method of the pattern class, which compiles a regular expression in the form of a string to
The Pattern object. The second parameter, flag, is the matching pattern, and the value can use the bitwise OR operator ' | '
To take effect at the same time, such as re. I | Re. M
You can compile a regular expression into a regular expression object. You can put regular use of those regular
The expression is compiled into a regular expression object, which can improve some efficiency. The following is a regular expression
An example of an object:

import Re Text = "Jgood is a handsome boy, he's cool, clever, and so on ..." regex = Re.compile (R ' \w*oo\w* ') print regex.fi Ndall (text)

Three, native string, compilation, grouping 1, native string

Careful people will find that every time I write a matching rule, I add a r in front, why write this, the following code to explain,

Import re # "\b" represents the backspace key in the ASCII character, \b "matches a word boundary" in regular expression "print (Re.findall (" \bblow "," Jason Blow Cat ")    #这里 \b Represents the backspace key, So there is no match to   print (Re.findall ("\\bblow", "Jason Blow Cat"))   #用 \ Escaped after this matches to [' Blow ']   print (Re.findall (r "\ Bblow "," Jason Blow Cat "))

You may notice that we use "\d" in regular expressions, that there is no original string, and that there is no problem. That's because there are no special characters in ASCII, so the regular expression compiler knows you're referring to a decimal number. But we write code in a rigorous and simple principle, preferably written in a native string format.

2. Compiling

If a matching rule, we want to use multiple times, we can first compile it, and then do not have to write each time the matching rules, see the usage

Import re C=re.compile (R ' \d ')                             #以后要在次使用的话, just call it directly to print (C.findall (' AS3. 56, '))

3. Grouping

In addition to simply judging whether a match is matched, the regular expression also has the power to extract substrings. With the () expression is to extract the group, can have more than one group, the use of a lot of groups, here is just a brief introduction

Import re Print (Re.findall (R ' (\d+)-([A-z]) ', ' 34324-dfsdfs777-hhh ') # [(' 34324 ', ' d '), (' 777 ', ' h ')] Print (Re.sea RCH (R ' (\d+)-([A-z]) ', ' 34324-dfsdfs777-hhh '). Group (0)) # 34324-d back to overall print (Re.search (R ' (\d+)-([A-z]) ', ' 34324- Dfsdfs777-hhh '). Group (1) # 34324 Gets the first group of print (Re.search (R ' (\d+)-([A-z]) ', ' 34324-dfsdfs777-hhh '). Group (2)) # D Gets the second Group Print (Re.search (R ' (\d+)-([A-z]) ', ' 34324-dfsdfs777-hhh '). Group (3)) # Indexerror:no such group print (Re.search (ja son) kk\1 "," Xjasonkkjason "). Group ()) #\1 represents an application number of 1 for Groups Jasonkkjason print (Re.search (R ' (\d) gg\1 ', ' 2j333gg3jjj8 '). Group ()) # 3gg3 \1 indicates that using the first group \d # returns none below why is NULL? And the match is not 3gg7, because \1 not only represents the first group, and the match to the same content is matched to the first group to match the same content, the first group matches to 3, the second group matches to 7 is not the same, so return empty print (Re.search (R ' (\d) gg\1 ', ' 2j333gg7jjj8 ') print (Re.search (? p<first>\d) ABC (? P=first) ', ' 1ABC1 ') # 1ABC1 declares a group name, using the ancestor name to refer to a group of R=re.match (' (? P&LT;N1&GT;H) (?                       　　　　p<n2>\w+) ', ' Hello,hi,help ') # Another usage of the Group name Print (R.group ())　　　　 # Hello returns the match to the value print (R.groups ()) # (' H ', ' Ello ') return matched to the group print (R.groupdict ()) # {' N2 ': ' Ello ', ' N1 ': ' H '} returns the result of the grouping, and a dictionary with the corresponding group name # Grouping is from the match to the inside to take the value of origin = "Hello Alex,acd,alex" pri NT (Re.findall (R ' (a) (\w+) (x) ', origin)) # [(' A ', ' Le ', ' x '), (' A ', ' Le ', ' x ')] Print (Re.findall (R ' a\w+ ', ori gin) # [' Alex ', ' ACD ', ' Alex '] Print (Re.findall (R ' A (\w+) ', origin)) # [' Lex ' , ' CD ', ' Lex '] print (Re.findall (R ' (a\w+) ', origin)) # [' Alex ', ' ACD ', ' Alex '] Print (Re.findall (R ' (a) ( \w+ (e)) (x) ', origin ') # [(' A ', ' Le ', ' e ', ' X '), (' A ', ' Le ', ' e ', ' X ')] R=re.finditer (R ' (a) (\w+ (e)) (? P<name>x) ', origin ' for I in R:print (I,i.group (), i.groupdict ()) ' [(' A ', ' Le ', ' e ', ' X '), (' A ', ' Le ', ' e ' , ' X ')] <_sre. Sre_match object; Span= (6, ten), match= ' Alex ' > Alex {' name ': ' X '} <_sre. Sre_match object; span=, match= ' Alex ' > ALex {' name ': ' x '} ' Print (Re.findall (' (\w) * ', ' Alex ') # matches to Alex, but 4 times only last time x true parenthesis only 1 print (Re.find All (R ' (\w) (\w) (\w) (\w) ', ' Alex ') # [(' A ', ' l ', ' e ', ' x ')] brackets appear 4 times, so 4 values are taken to origin= ' Hello Alex sss hhh KKK ' Print (                     Re.split (R ' A (\w+) ', origin) # [' Hello ', ' lex ', ' sss hhh KKK '] Print (Re.split (R ' a\w+ ', origin))  # [' Hello ', ' sss hhh KKK ']

Python full stack road 6--regular expression

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More