Regular Expression, Regular Expression

Source: Internet
Author: User

Regular Expression, Regular Expression

Regular Expressions are a combination of symbols with special meanings to describe character or string. These special symbols are combined together called regular expressions. Regular Expressions are a rule.

Regular Expressions are essentially a small programming language that is embedded in Python and implemented through the re module.

 

1. metacharacters

A regular expression contains a total of 11 metacharacters:

. ^ $ * +? {} [] () \ |

1). represents any symbol other than the line break. (If you want to represent the line break, you can also change the mode to re. S in the parameters of the method)

2) ^ indicates the start of a topic

3) $ indicates the end of a domain name. If ^. $ is used, it indicates that the domain name matches the domain name from the beginning to the end.

4) * +? {}

All of these are repeated,

* Indicates that the previous characters are repeated for zero to infinite times.

+ Indicates that the previous characters are repeated for one to infinite times.

? Indicates that the previous character is repeated zero or once.

{} You can specify a number in braces to indicate the number of repeated times or a range to indicate the number of repeated times.

Ret = re. findall ('12 {1, 3} ', '100') print (ret) # ['100'] greedy match by default

5) the character set [] Only matches one character set in the character set.

Note that metacharacters do not have special functions in character sets.

There are only three functional symbols in the character set:-^ \

-Represents the range, 0-9, a-z, A-Z

^ Writing at the beginning of the character set indicates the inverse and 'non'. Note that not all characters in the character set

ret=re.findall('[^abc]','abdcdefg')print(ret)  #['d', 'd', 'e', 'f', 'g']

\ Stands for escape

6) pipe character | it also means or. the difference from the character set is that the pipe character matches more than one

If A and B are regular expressions, A | B matches any string that matches A or B. | The priority is very low to run properly when you have to select multiple characters. For example, the differences between pipe operators and character sets:

# If you want to match Python and Jythonret = re. findall ('[PJ] ython', 'python, Jython ') # But if you want to match Python and perl, ret = re. findall ('P (? : Ython | erl) ', 'python, perl ')

 

7) Escape Character \

1. Remove Special functions from the back of escape characters and metacharacters. For example, \. Indicates matching.

2. Special functions are implemented with some common characters behind the escape operator. For example, \ d indicates any decimal number.

\ D matches any decimal number; it is equivalent to the class [0-9].
\ D matches any non-numeric character; it is equivalent to the class [^ 0-9].
\ S matches any blank character; it is equivalent to class [\ t \ n \ r \ f \ v].
\ S matches any non-blank characters. It is equivalent to the class [^ \ t \ n \ r \ f \ v].

\ W matches any alphanumeric character; it is equivalent to the class [a-zA-Z0-9 _].

\ W matches any non-alphanumeric character; it is equivalent to the class [^ a-zA-Z0-9 _]
\ B matches a special character boundary, such as space, &, #, etc.

\ A matches strings

\ Z match string ends

\ Z matches the end of a string. If a line break exists, it only matches the end character before the line break.

Position where \ G matching is completed

\ N matches a linefeed.

\ T matches a tab

Note: When we need to match \, We need to write four \ In the rule, because the regular expression we write must be first translated by the Python interpreter, and then interpreted by the re module, if an r is added to the rule to indicate a native string, the translation process of the Python interpreter is skipped.

8) Grouping () indicates processing the expressions in parentheses as a whole.

You can name a group as a famous group.

ret=re.search(r'(?P<first_mul>-?\d+\.?\d*)(?:\*|\/)(?P<second_mul>-?\d+\.?\d*)', s)

In the findall method, a group has a priority. If you want to cancel a priority, add? :

ret=re.findall('p(ython|erl)','python,perl')print(ret)  #['ython', 'erl']ret=re.findall('p(?:ython|erl)','python,perl')print(ret)  #['python', 'perl']

2. Greedy match

Greedy match is a string that matches as long as possible when matching conditions are met. By default, greedy match is used, and. * + {} is a greedy match by default.

s='<div>zhang</div><a href=''></div>'ret=re.findall('<div>.*</div>',s)print(ret)      #['<div>zhang</div><a href=></div>']

For non-Greedy match, add a question mark after the non-Greedy rule?

s='<div>zhang</div><a href=''></div>'ret=re.findall('<div>.*?</div>',s)print(ret)      #['<div>zhang</div>']

.*? It is often applied in crawlers ,.*? X is a character that matches any length until an x occurs.

3. Common Methods

1) re. fidall () method

Match All results that meet the conditions and put the results in the list. If no match is found, an empty list is returned.

ret=re.findall('[PJ]ython','Python,Jython')print(ret)  #['Python', 'Jython']

Re. search () method

Only the first matching result is matched and an object containing the matching result is returned.

You can use the. group () method to obtain the matched string. If no matching string is found, None is returned.

Ret = re. search ('[PJ] ython', 'python, Jython ') print (ret. group () # Pythonret = re. search ('[AB] ython', 'python, Jython ') print (ret) # Noneprint (ret. group () # AttributeError: 'nonetype 'object has no attribute 'group'

You can use the. group (group name) method to extract only the information of the specified group name.

s='3*2'ret=re.search(r'(?P<first_mul>-?\d+\.?\d*)(?:\*|\/)(?P<second_mul>-?\d+\.?\d*)', s)print(ret.group('first_mul'))   #3print(ret.group('second_mul'))  #2

Re. match () method

Similar to the search method, the match method matches the search method only at the beginning of the string.

Re. split () is used to separate a string. A regular expression can be used as a separator.

s='hello124124world3235hi'ret=re.split('\d+',s)print(ret)  #['hello', 'world', 'hi']

You can also specify the maximum number of splits.

s='hello124124world3235hi'ret=re.split('\d+',s,1)print(ret)  #['hello', 'world3235hi']

You can also display delimiters.

s='hello124124world3235hi'ret=re.split('(\d+)',s,1)print(ret)  #['hello', '124124', 'world3235hi']

Re. sub () Replacement

ret=re.sub('w.{2,3}d','everyone','hello wasd')print(ret)  #hello everyone

Re. subn () Replacement and return replacement times

ret=re.subn('w.{2,3}d','everyone','hello wasd')print(ret)  #('hello everyone', 1)

Re. compile () compilation is to convert a rule into an object. If this rule is frequently used, you do not need to write rules in re every time.

obj=re.compile('\d+')ret=obj.findall('agasgagr4t24aga43')print(ret)  #['4', '24', '43']

Re. finditer () returns an iterator object. This method is used to process the memory that will be occupied if many returned content is put in the list, in this case, we can put the matching results in an iterator object. When will the matching results be used and when will the matching results be obtained?

ret=re.finditer('\d+','asg67gas6gasd58gasg69asg58asg96g9as6ga78')print(ret)  #<callable_iterator object at 0x000001D415E3B240>print(next(ret).group())    #67print(next(ret).group())    #6print(next(ret).group())    #58

4. Exercise

Crawl the name, score, number of ratings, and overview of movies listed on the top 50 webpage of Douban.

 1 import requests 2 import re 3 def getpage(): 4     respnse_str=requests.get('https://movie.douban.com/top250?start=0&filter=') 5     return respnse_str.text 6 def run(): 7     resonse=getpage() 8     obj=re.compile('<div class="item">.*?<em.*?>(?P<id>\d+)</em>.*?<span class="title">' 9                    '(?P<title>.*?)</span>.*?<span.*?>(?P<grade>\d+\.\d)</span>'10                    '.*?<span>(?P<evalu>.*?)</span>.*?<span class="inq">(?P<intro>.*?)</span>',re.S)11     ret=obj.findall(resonse)12     with open('movie.ini','w')as f:13         for i in ret:14             print('')15             for j in i:16                 j+=' '17                 print(j,end=' ')18 run()
View Code

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.