Python crawler regular expression, python Crawler

Source: Internet
Author: User

Python crawler regular expression, python Crawler
1. Regular Expression Overview

A regular expression is a logical formula for string operations. It uses predefined characters and combinations of these specific characters to form a "rule string ", this "rule string" is used to express a filtering logic for strings.

Regular Expressions are very powerful tools used to match strings. They are also used in other programming languages. Python is no exception and uses regular expressions, we want to extract the content from the returned page.

The general matching process of a regular expression is as follows:
1. Compare the expression with the characters in the text in sequence,
2. If each character can be matched, the match is successful. If any character cannot be matched, the match fails.
3. If the expression contains quantifiers or boundaries, this process will be slightly different.

2. Regular expression syntax rules

 

3. Regular Expression related annotations (1) greedy and non-Greedy modes of quantifiers

Regular Expressions are usually used to search for matched strings in the text. In Python, quantifiers are greedy by default (in a few languages, they may also be non-Greedy by default), and always try to match as many characters as possible; in non-greedy, the opposite is true, always try to match as few characters as possible. For example, if the regular expression "AB *" is used to find "abbbc", "abbb" is found ". If we use a non-Greedy quantizer "AB *?", "A" is found ".

Note: We generally use the non-Greedy mode for extraction.

(2) backslash

Like most programming languages, regular expressions use "\" as escape characters, which may cause backlash troubles. If you need to match the character "\" in the text, four Backslash "\" will be required in the regular expression expressed in programming language "\\\\": the first two and the last two are used to convert them into backslashes in the programming language, convert them into two backslashes, and then escape them into a backslash in the regular expression.

The native string in Python solves this problem well. In this example, the regular expression can be represented by r. Similarly, "\ d" matching a number can be written as r "\ d ". With the native string, mom doesn't have to worry about missing the backslash, and the written expression is more intuitive.

4. Python Re Module

Python comes with the re module, which provides support for regular expressions. The main methods used are as follows:

 

1 # returns the pattern object 2 re. compile (string [, flag]) 3 # The following is the matching function 4 re. match (pattern, string [, flags]) 5 re. search (pattern, string [, flags]) 6 re. split (pattern, string [, maxsplit]) 7 re. findall (pattern, string [, flags]) 8 re. finditer (pattern, string [, flags]) 9 re. sub (pattern, repl, string [, count]) 10 re. subn (pattern, repl, string [, count])

Before introducing these methods, let's first introduce the concept of pattern. pattern can be understood as a matching pattern. How can we get this matching pattern? It's easy. We need to use the re. compile method. For example

 pattern = re.compile(r'hello')

 

In the parameters, we pass in the native String object, compile and generate a pattern object using the compile method, and then use this object for further matching.

In addition, you may have noticed another parameter flags. Here we will explain the meaning of this parameter:

The flag parameter is a matching mode. The value can take effect using the bitwise OR operator '|', for example, re. I | re. M.

Optional values:

1 • re. I (full spelling: IGNORECASE): Ignore the case (the brackets are complete, the same below) 2 • re. M (full spelling: MULTILINE): MULTILINE mode, changing the behavior of '^' and '$' (SEE) 3 • re. S (full spell: DOTALL): Any point matching mode, changed '. 'behavior 4 • re. L (full spelling: LOCALE): Make the pre-defined character class \ w \ W \ B \ B \ s \ S depends on the current region set 5 • re. U (full spell: UNICODE): Make the predefined character class \ w \ W \ B \ B \ s \ S \ d \ D depends on the Character attribute defined by unicode 6 • re. X (full spell: VERBOSE): VERBOSE mode. In this mode, the regular expression can be multiple rows, ignore blank characters, and add comments.

We need to use this pattern in the other methods we just mentioned, such as re. match. We will introduce it one by one.

Note: flags in the following seven methods also represent the matching mode. If flags is specified during pattern generation, this parameter is not required in the following methods.

(1) re. match (pattern, string [, flags])

This method starts from the beginning of string (the string we want to match) and tries to match pattern until backward matching. If any character that cannot be matched is encountered, None is returned immediately, if the match has not ended and it has reached the end of the string, None is returned. Both results indicate that the match failed. Otherwise, the match is successful and the match ends. The following is an example.

 

1 # coding: utf8 2 # import re Module 3 import re 4 5 # compile the regular expression into a Pattern object, note that r in front of hello indicates "Native string" 6 pattern = re. compile (r 'hello') 7 8 # Use re. match matches the text to obtain the matching result. If the match fails, None 9 result1 = re is returned. match (pattern, 'Hello') 10 result2 = re. match (pattern, 'helloo CQC! ') 11 result3 = re. match (pattern, 'helo CQC! ') 12 result4 = re. match (pattern, 'Hello CQC! ') 13 14 # if 1 matches successfully 15 if result1: 16 # Use Match to obtain the group information 17 print result1.group () 18 else: 19 print '1 Match failed! '20 21 22 # if 2 matches successfully 23 if result2: 24 # Use Match to obtain group information 25 print result2.group () 26 else: 27 print '2 Match failed! '28 29 30 # if 3 matches successfully 31 if result3: 32 # Use Match to obtain the group information 33 print result3.group () 34 else: 35 print '3 Match failed! '36 37 # if 4 matches successfully 38 if result4: 39 # Use Match to obtain the group information 40 print result4.group () 41 else: 42 print '4 Match failed! '

Running result:

Hellohello3 match failed! Hello

Matching analysis

1. The first match. The regular expression of pattern is 'hello', and the target string we match is also 'Hello'. The match is successful from the beginning to the end.

2. the second match. The string is helloo CQC. Matching pattern from the string header can be completely matched. The pattern match ends, and the matching ends. The following o CQC does not match any more. A successful match is returned.

3. Third match. string is helo CQC. It matches pattern starting from the string header. If 'O' is found, the matching cannot be completed. If the matching ends, None is returned.

4. The fourth matching is the same as the Second Matching Principle, and will not be affected even if a space character is encountered.

The result. group () is finally printed. What does this mean? The following describes the attributes and methods of the match object.
A Match object is a matching result that contains a lot of information about this matching. You can use the readable attributes or methods provided by Match to obtain this information.

 

Attribute: 1. string: the text used for matching. 2. re: Specifies the Pattern object used for matching. 3. pos: The index that the regular expression in the text starts to search. The value is the same as that of the Pattern. match () and Pattern. seach () methods. 4. endpos: Index of the ending search by a regular expression in the text. The value is the same as that of the Pattern. match () and Pattern. seach () methods. 5. lastindex: Index of the last captured group in the text. If no captured group exists, the value is None. 6. lastgroup: the alias of the last captured group. If this group does not have an alias or is not captured, it is set to None. Method: 1. group ([group1,…]) : Gets one or more string intercepted by a group. If multiple parameters are specified, the string is returned as a tuple. Group1 can be numbered or alias. number 0 indicates the entire matched substring. If no parameter is set, group (0) is returned. If no string is intercepted, None is returned; the group that has been intercepted multiple times returns the last intercepted substring. 2. groups ([default]): returns the string intercepted by all groups in the form of tuples. It is equivalent to calling group (1, 2 ,... Last ). Default indicates that the group that has not intercepted the string is replaced by this value. The default value is None. 3. groupdict ([default]): returns a dictionary with the alias of an alias group as the key and the substring intercepted by this group as the value. groups without aliases are not included. The meaning of default is the same as that of default. 4. start ([group]): returns the starting index of the substring intercepted by the specified group in the string (index of the first character of the substring ). The default value of group is 0. 5. end ([group]): return the end index of the substring intercepted by the specified group in the string (index of the last character of the substring + 1 ). The default value of group is 0. 6. span ([group]): Return (start (group), end (group )). 7. expand (template): place the matched group into the template and return the result. You can use \ id or \ g to reference a group in template, but cannot use number 0. \ Id is equivalent to \ g, but \ 10 is considered to be 10th groups. If you want to express \ 1 followed by the character '0', you can only use \ g0.

The following is an example.

 

1 # coding: utf8 2 import re 3 # match the following content: Word + space + word + any character 4 m = re. match (R' (\ w + )(? P <sign>. *) ', 'Hello world! ') 5 6 print "m. string: ", m. string 7 print "m. re: ", m. re 8 print "m. pos: ", m. pos 9 print "m. endpos: ", m. endpos10 print "m. lastindex: ", m. lastindex11 print "m. lastgroup: ", m. lastgroup12 print "m. group (): ", m. group () 13 print "m. group (1, 2): ", m. group (1, 2) 14 print "m. groups (): ", m. groups () 15 print "m. groupdict (): ", m. groupdict () 16 print "m. start (2): ", m. start (2) 17 print "m. end (2): ", m. end (2) 18 print "m. span (2): ", m. span (2) 19 print r "m. expand (R' \ g \ G'): ", m. expand (R' \ 2 \ 1 \ 3 ')

Output result:

m.string: hello world!m.re: m.pos: 0m.endpos: 12m.lastindex: 3m.lastgroup: signm.group(1,2): ('hello', 'world')m.groups(): ('hello', 'world', '!')m.groupdict(): {'sign': '!'}m.start(2): 6m.end(2): 11m.span(2): (6, 11)m.expand(r'\2 \1\3'): world hello!

 

(2) re. search (pattern, string [, flags])

The search method is similar to the match method. The difference is that the match () function only checks whether the re matches at the starting position of the string. search () scans the entire string for matching, and match () only when the 0-position match is successful is returned. If the start position match is not successful, match () returns None. Similarly, the return object of the search method also matches () to return the method and attribute of the object.

1 # coding: utf8 2 # import re Module 3 import re 4 5 # compile the regular expression into the Pattern object 6 pattern = re. compile (r'world') 7 # search for matched substrings using search (). If no matched substrings exist, None 8 is returned. # match () is used in this example () 9 match = re. search (pattern, 'Hello world! ') 10 if match: 11 # Use Match to obtain the group information 12 print match. group () 13 ### output ### 14 # world
(3) re. split (pattern, string [, maxsplit])

Split string by matching substrings and return to the list. Maxsplit is used to specify the maximum number of splits. If not specified, all splits are performed.

 

#coding:utf8import repattern = re.compile(r'\d+')print re.split(pattern,'one1two2three3four4') 

Output result:

['one', 'two', 'three', 'four', '']

 

(4) re. findall (pattern, string [, flags])

Searches for strings and returns all matched substrings in the form of a list.

1 #coding:utf82 import re3 4 pattern = re.compile(r'\d+')5 print re.findall(pattern,'one1two2three3four4')6  
['1', '2', '3', '4']

 

(5) re. finditer (pattern, string [, flags])

Returns an iterator that accesses each matching result (Match object) sequentially.

1 #coding:utf82 import re3 4 pattern = re.compile(r'\d+')5 res = re.finditer(pattern,'one1two2thr5ee3four4')6 for i in res:7     print i.group()

Output result:

12534

 

(6) re. sub (pattern, repl, string [, count])

Use repl to replace each matched substring in the string, and then return the replaced string.
When repl is a string, you can use \ id or \ g, \ g to reference the group, but cannot use number 0.
When repl is a method, this method should only accept one parameter (Match object) and return a string for replacement (the returned string cannot reference the group ).
Count is used to specify the maximum number of replicas. If not specified, all replicas are replaced.

 1 #coding:utf8 2 import re 3   4 pattern = re.compile(r'(\w+) (\w+)') 5 s = 'i say, hello world!' 6   7 print re.sub(pattern,r'\2 \1', s) 8   9 def func(m):10     return m.group(1).title() + ' ' + m.group(2).title()11  12 print re.sub(pattern,func, s)

Output result:

say i, world hello!I Say, Hello World!

 

(7) re. subn (pattern, repl, string [, count])

Returns (sub (repl, string [, count]), replacement times ).

 1 #coding:UTF8 2 import re 3   4 pattern = re.compile(r'(\w+) (\w+)') 5 s = 'i say, hello world!' 6   7 print re.subn(pattern,r'\2 \1', s) 8   9 def func(m):10     return m.group(1).title() + ' ' + m.group(2).title()11  12 print re.subn(pattern,func, s)

Output result:

('say i, world hello!', 2)('I Say, Hello World!', 2)

 

5. Another method of using the Python Re Module

We have introduced 7 tool methods above, such as match and search, but the call methods are all re. match, re. in fact, there is another way to call the search method. You can use pattern. match, pattern. search call, so that you do not need to pass pattern as the first parameter.

Function API list

 match(string[, pos[, endpos]]) | re.match(pattern, string[, flags]) search(string[, pos[, endpos]]) | re.search(pattern, string[, flags]) split(string[, maxsplit]) | re.split(pattern, string[, maxsplit]) findall(string[, pos[, endpos]]) | re.findall(pattern, string[, flags]) finditer(string[, pos[, endpos]]) | re.finditer(pattern, string[, flags]) sub(repl, string[, count]) | re.sub(pattern, repl, string[, count]) subn(repl, string[, count]) |re.sub(pattern, repl, string[, count])
Match (string [, pos [, endpos]) | re. match (pattern, string [, flags]) search (string [, pos [, endpos]) | re. search (pattern, string [, flags]) split (string [, maxsplit]) | re. split (pattern, string [, maxsplit]) findall (string [, pos [, endpos]) | re. findall (pattern, string [, flags]) finditer (string [, pos [, endpos]) | re. finditer (pattern, string [, flags]) sub (repl, string [, count]) | re. sub (pattern, repl, string [, count]) subn (repl, string [, count]) | re. sub (pattern, repl, string [, cou

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.