Basic expression for writing python crawlers

Source: Internet
Author: User
The role of regular expressions in Python crawlers is like a roster used by instructors for naming. it is an essential weapon. Regular expressions are powerful tools used to process strings. they are not part of Python. The concept of regular expressions is also available in other programming languages. The difference is only a small example of crawler with baibai.
However, before that, we should first detail the relevant content of the regular expression in Python.
The role of regular expressions in Python crawlers is like a roster used by instructors for naming. it is an essential weapon.

I. Regular expression basics
1. Introduction to Concepts

Regular expressions are powerful tools used to process strings. they are not part of Python.
Other programming languages also have the concept of regular expressions. The difference is that different programming languages support different syntaxes.
It has its own unique syntax and an independent processing engine. in languages that provide regular expressions, the regular expression syntax is the same.
The following figure shows the matching process using a regular expression:

The general matching process of a regular expression is as follows:
1. compare the expression with the characters in the text in sequence,
2. if each character can be matched, the match is successful. if any character cannot be matched, the match fails.
3. if the expression contains quantifiers or boundaries, this process will be slightly different.

Lists the Python-supported regular expression metacharacters and syntaxes:

1.2. greedy and non-greedy modes of quantifiers

Regular expressions are usually used to search for matched strings in the text.
Greedy mode, always try to match as many characters as possible;
The non-greedy mode is the opposite. it always tries to match as few characters as possible.
In Python, quantifiers are greedy by default.
For example, if the regular expression "AB *" is used to find "abbbc", "abbb" is found ".
If we use a non-greedy quantizer "AB *? "," A "is found ".

1.3. backslash

Like most programming languages, regular expressions use "\" as escape characters, which may cause backlash troubles.
If you need to match the character "\" in the text, four backslash "\" will be required in the regular expression expressed in programming language "\\\\":
The first and third parts are used to convert the second and fourth parts into a backslash in the programming language,
Convert it to two backslash \ and then escape it into a backslash in the regular expression to match the backslash \.
This is obviously very troublesome.
The native string in Python solves this problem well. the regular expression in this example can be represented by r.
Similarly, "\ d" matching a number can be written as r "\ d ".
With the native string, mom no longer needs to worry about my backslash problem ~

II. Introduction to re module

2.1. Compile

Python supports regular expressions through the re module.
The general steps for using re are:
Step 1: first compile the regular expression string form into a Pattern instance.
Step 2: Use the Pattern instance to process text and obtain matching results (a Match instance ).
Step 3: Use the Match instance to obtain information and perform other operations.
Let's create a re01.py to test the re Application:

The code is as follows:


#-*-Coding: UTF-8 -*-
# A simple re instance that matches the hello string in the string
# Import re module
Import re
# Compile the regular expression into a Pattern object. Note that the r mentioned above hello indicates "native string"
Pattern = re. compile (r 'Hello ')
# Use Pattern to match the text and obtain the matching result. if the matching fails, None is returned.
Mattings = pattern. match ('Hello world! ')
Match2 = pattern. match ('helloo world! ')
Match3 = pattern. match ('helllo world! ')
# If the mate8 match is successful
If mattings:
# Use Match to obtain group information
Print match1.group ()
Else:
Print 'mate8 match failed! '
# If match2 matches successfully
If match2:
# Use Match to obtain group information
Print match2.group ()
Else:
Print 'match2 matching failed! '
# If match3 matches successfully
If match3:
# Use Match to obtain group information
Print match3.group ()
Else:
Print 'match3 matching failed! '

The console outputs three matching results:

Next, let's take a look at the key methods in the code.
★Re. compile (strPattern [, flag]):
This method is a factory method of the Pattern class. it is used to compile a regular expression in the string form into a Pattern object.
The second parameter flag is the matching mode. The value can take effect simultaneously using the bitwise OR operator '|', such as re. I | re. M.
In addition, you can specify the mode in the regex string,
For example, re. compile ('pattern', re. I | re. M) and re. compile ('(? Im) pattern ') is equivalent.
Optional values:
Re. I (full spelling: IGNORECASE): case-insensitive (complete writing is in brackets, the same below)
Re. M (full spelling: MULTILINE): MULTILINE mode, changing the behavior of '^' and '$' (see)
Re. S (full spell: DOTALL): Any point matching mode, changing the behavior '.'
Re. L (full spelling: LOCALE): make the pre-defined character class \ w \ W \ B \ B \ s \ S depends on the current region settings
Re. U (full spell: UNICODE): make the pre-defined character class \ w \ W \ B \ B \ s \ S \ d \ D depends on the character attribute defined by unicode
Re. X (full spell: VERBOSE): detailed mode. In this mode, the regular expression can be multiple rows, Ignore blank characters, and add comments.

The following two regular expressions are equivalent:

The code is as follows:


#-*-Coding: UTF-8 -*-
# Two equivalent re matches, matching a decimal number
Import re
A = re. compile (r "\ d + # the integral part
\. # The decimal point
\ D * # some fractional digits ", re. X)
B = re. compile (r "\ d + \. \ d *")
Match11 = a. match ('3. 1415 ')
Match12 = a. match ('33 ')
Match21 = B. match ('3. 1415 ')
Match22 = B. match ('33 ')
If match11:
# Use Match to obtain group information
Print match11.group ()
Else:
Print u 'match11 is not decimal'
If match12:
# Use Match to obtain group information
Print match12.group ()
Else:
Print u 'match12 is not decimal'
If match21:
# Use Match to obtain group information
Print match21.group ()
Else:
Print u 'match21 is not decimal'
If match22:
# Use Match to obtain group information
Print match22.group ()
Else:
Print u 'match22 is not decimal'

Re provides many module methods for completing the regular expression function.
These methods can be replaced by the corresponding method of the Pattern instance. The only advantage is that less than one line of re. compile () code is written,
However, the compiled Pattern object cannot be reused.
These methods will be introduced together in the instance method section of the Pattern class.
For example, the initial hello instance can be abbreviated:

The code is as follows:


#-*-Coding: UTF-8 -*-
# A simple re instance that matches the hello string in the string
Import re

M = re. match (r 'hello', 'Hello world! ')
Print m. group ()

The re module also provides the method escape (string) to use the regular expression metacharacters in the string, such as */+ /? Add an escape character before returning.

2.2. Match

A Match object is a matching result that contains a lot of information about this matching. you can use the readable attributes or methods provided by Match to obtain this information.
Attribute:
String: The text used for matching.
Re: specifies the Pattern object used for matching.
Pos: index in the text where regular expressions start to search. The value is the same as that of the Pattern. match () and Pattern. seach () methods.
Endpos: Index of the ending search by a regular expression in the text. The value is the same as that of the Pattern. match () and Pattern. seach () methods.
Lastindex: the index of the last captured group in the text. If no captured group exists, the value is None.
Lastgroup: the alias of the last captured group. If this group does not have an alias or is not captured, it is set to None.
Method:
Group ([group1,…]) :
Obtain one or more string intercepted by a group. if multiple parameters are specified, the string is returned as a tuple. Group1 can be numbered or alias. number 0 indicates the entire matched substring. If no parameter is set, group (0) is returned. If no string is intercepted, None is returned; the group that has been intercepted multiple times returns the last intercepted substring.
Groups ([default]):
Returns the string intercepted by all groups in the form of tuples. It is equivalent to calling group (1, 2 ,... Last ). Default indicates that the group that has not intercepted the string is replaced by this value. The default value is None.
Groupdict ([default]):
Returns a dictionary that uses the alias of an alias group as the key and the intercepted substring as the value. a group without an alias is not included. The meaning of default is the same as that of default.
Start ([group]):
Returns the starting index of the substring intercepted by the specified group in the string (index of the first character of the substring ). The default value of group is 0.
End ([group]):
Returns the ending index of the substring intercepted by the specified group in the string (index of the last character of the substring + 1 ). The default value of group is 0.
Span ([group]):
Returns (start (group), end (group )).
Expand (template ):
Place the matched group into the template and return the result. \ Id or \ g can be used in template , \ G Group referenced, but No. 0 is allowed. \ Id and \ g It is equivalent, but \ 10 will be considered as 10th groups. if you want to express \ 1 followed by the character '0', you can only use \ g <1> 0.
Next we will use a py instance to output all the content for better understanding:

The code is as follows:


#-*-Coding: UTF-8 -*-
# A simple match instance

Import re
# Match the following content: Word + space + word + any character
M = re. match (r' (\ w + )(? P . *) ', 'Hello world! ')

Print "m. string:", m. string
Print "m. re:", m. re
Print "m. pos:", m. pos
Print "m. endpos:", m. endpos
Print "m. lastindex:", m. lastindex
Print "m. lastgroup:", m. lastgroup

Print "m. group ():", m. group ()
Print "m. group (1, 2):", m. group (1, 2)
Print "m. groups ():", m. groups ()
Print "m. groupdict ():", m. groupdict ()
Print "m. start (2):", m. start (2)
Print "m. end (2):", m. end (2)
Print "m. span (2):", m. span (2)
Print r "m. expand (r' \ g <2> \ g <1> \ g <3> '): ", m. expand (r' \ 2 \ 1 \ 3 ')

### Output ###
# M. string: hello world!
# M. re: <_ sre. SRE_Pattern object at 0x016E1A38>
# M. pos: 0
# M. endpos: 12
# M. lastindex: 3
# M. lastgroup: sign
# M. group (1, 2): ('hello', 'World ')
# M. groups (): ('hello', 'World ','! ')
# M. groupdict (): {'sign ':'! '}
# M. start (2): 6
# M. end (2): 11
# M. span (2): (6, 11)
# M. expand (r' \ 2 \ 1 \ 3 '): world hello!

2.3. Pattern
The Pattern object is a compiled regular expression. you can use a series of methods provided by Pattern to search for the text.
Pattern cannot be directly instantiated. it must be constructed using re. compile (), that is, the object returned by re. compile.
Pattern provides several readable attributes for obtaining information about an expression:
Pattern: expression string used for compilation.
Flags: The matching mode used during compilation. Digit format.
Groups: number of groups in the expression.
Groupindex: The key is the alias of a group with an alias in the expression, and the number of the group is the value of the dictionary. a group without an alias is not included.
You can use the following example to view the attributes of pattern:

The code is as follows:


#-*-Coding: UTF-8 -*-
# A simple pattern instance

Import re
P = re. compile (r' (\ w + )(? P . *) ', Re. DOTALL)

Print "p. pattern:", p. pattern
Print "p. flags:", p. flags
Print "p. groups:", p. groups
Print "p. groupindex:", p. groupindex

### Output ###
# P. pattern: (\ w + )(? P .*)
# P. flags: 16
# P. groups: 3
# P. groupindex: {'sign': 3}

The following describes the pattern instance method and its usage.

1. match

Match (string [, pos [, endpos]) | re. match (pattern, string [, flags]):
This method will try to match pattern from the pos subscript of string;
If pattern can still be matched at the end, a Match object is returned;
If the pattern does not match during the matching process, or the matching has reached endpos before it is completed, None is returned.
The default values of pos and endpos are 0 and len (string), respectively );
Re. match () cannot specify these two parameters. the flags parameter is used to specify the matching mode when compiling pattern.
Note: This method does not fully match.
When pattern ends, if the string contains any remaining characters, the operation is still considered successful.
To perform a full match, you can add the boundary match '$' at the end of the expression '.
Here is a simple case of Match:

The code is as follows:


# Encoding: UTF-8
Import re

# Compile a regular expression into a Pattern object
Pattern = re. compile (r 'Hello ')

# Use Pattern to match the text and obtain the matching result. if the matching fails, None is returned.
Match = pattern. match ('Hello world! ')

If match:
# Use Match to obtain group information
Print match. group ()

### Output ###
# Hello

2. search

Search (string [, pos [, endpos]) | re. search (pattern, string [, flags]):
This method is used to search for substrings that can be matched successfully in a string.
Match pattern from the pos subscript of string,
If pattern can still be matched at the end, a Match object is returned;
If no match is available, add pos to 1 and try again;
If the pos = endpos still does not match, None is returned.
The default values of pos and endpos are 0 and len (string) respectively ));
Re. search () cannot specify these two parameters. the flags parameter is used to specify the matching mode when compiling pattern.
So what is the difference between it and match?
The match () function only checks whether the re matches at the starting position of the string,
Search () scans the entire string for matching,

Match () is returned only when the match is successful at 0. if the match is not successful at the starting position, match () returns none.
For example:
Print (re. match ('super', 'Superstition '). span ())
Will return (0, 5)
Print (re. match ('super', 'insuperable '))
Returns None.

Search () scans the entire string and returns the first successful match.
For example:
Print (re. search ('super', 'Superstition '). span ())
Returns (0, 5)
Print (re. search ('super', 'insuperable'). span ())
Returns (2, 7)
Look at a search instance:

The code is as follows:


#-*-Coding: UTF-8 -*-
# A simple search instance

Import re

# Compile a regular expression into a Pattern object
Pattern = re. compile (r 'world ')

# Search for matched substrings using search (). If no matched substrings exist, None is returned.
# In this example, match () cannot be successfully matched.
Match = pattern. search ('Hello world! ')

If match:
# Use Match to obtain group information
Print match. group ()

### Output ###
# World

3. split
Split (string [, maxsplit]) | re. split (pattern, string [, maxsplit]):
Split string by matching substrings and return to the list.
Maxsplit is used to specify the maximum number of splits. if not specified, all splits are performed.

The code is as follows:


Import re

P = re. compile (r' \ d + ')
Print p. split ('one1two2three3four4 ')

### Output ###
# ['One', 'two', 'Three ', 'four', '']

4. findall
Findall (string [, pos [, endpos]) | re. findall (pattern, string [, flags]):
Search for strings and return all matching substrings in the form of a list.

The code is as follows:


Import re

P = re. compile (r' \ d + ')
Print p. findall ('one1two2three3four4 ')

### Output ###
# ['1', '2', '3', '4']

5. finditer
Finditer (string [, pos [, endpos]) | re. finditer (pattern, string [, flags]):
Returns an iterator that accesses each matching result (Match object) sequentially.

The code is as follows:


Import re

P = re. compile (r' \ d + ')
For m in p. finditer ('one1two2three3four4 '):
Print m. group (),

### Output ###
#1 2 3 4

6. sub

Sub (repl, string [, count]) | re. sub (pattern, repl, string [, count]):
Use repl to replace each matched substring in the string, and then return the replaced string.
When repl is a string, you can use \ id or \ g , \ G Group referenced, but No. 0 is allowed.
When repl is a method, this method should only accept one parameter (Match object) and return a string for replacement (The Returned string cannot reference the group ).
Count is used to specify the maximum number of replicas. if not specified, all replicas are replaced.

The code is as follows:


Import re

P = re. compile (r' (\ w + )')
S = 'I say, hello world! '

Print p. sub (r' \ 2 \ 1 ', s)

Def func (m ):
Return m. group (1). title () + ''+ m. group (2). title ()

Print p. sub (func, s)

### Output ###
# Say I, world hello!
# I Say, Hello World!

7. subn
Subn (repl, string [, count]) | re. sub (pattern, repl, string [, count]):
Returns (sub (repl, string [, count]), replacement times ).

The code is as follows:


Import re

P = re. compile (r' (\ w + )')
S = 'I say, hello world! '

Print p. subn (r' \ 2 \ 1', s)

Def func (m ):
Return m. group (1). title () + ''+ m. group (2). title ()

Print p. subn (func, s)

### Output ###
# ('Say I, world hello! ', 2)
# ('I Say, Hello World! ', 2)

The above is a basic introduction to the python artifact regular expression. it is very simple and practical. I hope it will help you. ^_^

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.