0 Basic Write Python crawler artifact regular expression

Last Update:2016-06-06 Source: Internet

Author: User

Tags character classes locale setting

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The next step is to make a small example of a reptile with embarrassing hundred.
But before you do that, make a detailed collation of the regular expressions in Python.
The function of regular expressions in Python crawlers is like the roster used by the teacher at the time of roll-call, which is an essential weapon of divine soldiers.

The basis of regular expression
1.1. Concept Introduction

A regular expression is a powerful tool for working with strings, and it is not part of Python.
Other programming languages also have the concept of regular expressions, except that the number of grammars supported by different programming language implementations differs.
It has its own unique syntax and a separate processing engine, and in the language that provides regular expressions, the syntax for regular expressions is the same.
Shows the process of matching using regular expressions:

The approximate matching process for regular expressions is:
1. Take out the expression in turn and compare the characters in the text,
2. If each character matches, the match succeeds, and the match fails once a match is unsuccessful.
3. If there are quantifiers or boundaries in an expression, the process is slightly different.

Lists the regular expression meta characters and syntax supported by Python:

1.2. Greedy mode and non-greedy mode of counting quantifiers

Regular expressions are typically used to find matching strings in text.
Greedy mode, always try to match as many characters as possible;
Non-greedy mode is the opposite, always trying to match as few characters as possible.
The number of words in Python is greedy by default.
For example: the regular expression "ab*" will find "abbb" if it is used to find "ABBBC".
And if you use a non-greedy quantity word "ab*?", you will find "a".

1.3. Anti-slash problem

As with most programming languages, "\" is used as an escape character in regular expressions, which can cause a backslash to be plagued.
If you need to match the character "\" in the text, you will need 4 backslashes "\\\\" in the regular expression expressed in the programming language:
The first and third are used to escape the second and fourth in a programming language into backslashes,
Converted to two backslashes \ \ And then escaped in the regular expression to a backslash to match the backslash \.
This is obviously a very troublesome thing to do.
The native string in Python solves this problem well, and the regular expression in this example can be expressed using R "\ \".
Similarly, a "\\d" that matches a number can be written as r "\d".
With the native string, mom doesn't have to worry about my backslash anymore.

Ii. Introduction of RE modules

2.1. Compile

Python provides support for regular expressions through the RE module.
The general steps for using re are:
STEP1: The string form of the regular expression is first compiled into a pattern instance.
STEP2: Then use the pattern instance to process the text and get the matching result (a match instance).
STEP3: Finally, use the match instance to get the information and do other things.
Let's create a new re01.py to test the RE application:

The code is as follows:

#-*-Coding:utf-8-*-
#一个简单的re实例, match the Hello string in the string
#导入re模块
Import re
# Compile the regular expression into a pattern object, note that the R in front of Hello means "native string"
Pattern = Re.compile (R ' Hello ')
# match text with pattern, get match result, cannot match when will return none
Match1 = Pattern.match (' Hello world! ')
MATCH2 = Pattern.match (' Helloo world! ')
Match3 = Pattern.match (' Helllo world! ')
#如果match1匹配成功
If Match1:
# Use Match to get group information
Print Match1.group ()
Else
print ' Match1 match failed! '
#如果match2匹配成功
If MATCH2:
# Use Match to get group information
Print Match2.group ()
Else
print ' MATCH2 match failed! '
#如果match3匹配成功
If Match3:
# Use Match to get group information
Print Match3.group ()
Else
print ' Match3 match failed! '

You can see that the console outputs a match of three results:

Here's a look at the key methods in the code.
★re.compile (strpattern[, flag]):
This method is the factory method of the pattern class, which compiles a regular expression in the form of a string into a pattern object.
The second parameter, flag, is the matching pattern, and the value can use the bitwise OR operator ' | ' To take effect at the same time, such as re. I | Re. M.
Alternatively, you can specify the pattern in the regex string,
such as re.compile (' pattern ', re. I | Re. M) is equivalent to Re.compile (' (? im) pattern ').
The optional values are:
Re. I (full spell: IGNORECASE): Ignoring case (full wording in parentheses, same as below)
Re. M (full spell: MULTILINE): Multiline mode, changing the behavior of ' ^ ' and ' $ ' (see)
Re. S (full spell: dotall): Point random match mode, change '. ' The behavior
Re. L (full spell: locale): Make a predetermined character class \w \w \b \b \s \s depends on the current locale setting
Re. U (full spell: Unicode): Make predefined character classes \w \w \b \b \s \s \d \d Depending on UNICODE-defined character attributes
Re. X (full spell: VERBOSE): Verbose mode. In this mode, the regular expression can be multiple lines, ignore whitespace characters, and can be added to comments.

The following two regular expressions are equivalent:

The code is as follows:

#-*-Coding:utf-8-*-
#两个等价的re匹配, match a decimal
Import re
A = Re.compile (r "" "\d + # The integral part
\. # The decimal point
\d * # Some fractional digits "" ", Re. X
b = Re.compile (r "\d+\.\d*")
Match11 = A.match (' 3.1415 ')
Match12 = A.match (' 33 ')
Match21 = B.match (' 3.1415 ')
Match22 = B.match (' 33 ')
If Match11:
# Use Match to get group information
Print Match11.group ()
Else
Print U ' match11 not decimal '
If Match12:
# Use Match to get group information
Print Match12.group ()
Else
Print U ' match12 not decimal '
If match21:
# Use Match to get group information
Print Match21.group ()
Else
Print U ' match21 not decimal '
If match22:
# Use Match to get group information
Print Match22.group ()
Else
Print U ' match22 not decimal '

Re provides a number of modular methods for completing regular expression functions.
These methods can be substituted with the corresponding method of the pattern instance, with the only advantage being to write less one line of Re.compile () code,
However, the compiled pattern object cannot be reused at the same time.
These methods are described in the example Method section of the pattern class.
A first instance of Hello can be abbreviated as:

The code is as follows:

#-*-Coding:utf-8-*-
#一个简单的re实例, match the Hello string in the string
Import re

m = Re.match (R ' Hello ', ' Hello world! ')
Print M.group ()

The RE module also provides a method of escape (string), which is used to such as the regular expression meta-character in string */+/? And then return with the escape character.

2.2. Match

The match object is a matching result that contains a lot of information about this match and can be obtained using the readable properties or methods provided by match.
Property:
String: The text to use when matching.
Re: The pattern object to use when matching.
POS: The index in which the text expression begins the search. The value is the same as the parameter with the same name as the Pattern.match () and Pattern.seach () methods.
Endpos: The index of the end-of-search text expression. The value is the same as the parameter with the same name as the Pattern.match () and Pattern.seach () methods.
Lastindex: The index of the last captured grouping in the text. If there are no captured groupings, it will be none.
Lastgroup: The alias of the last captured group. If the group has no aliases or no captured groupings, it will be none.
Method:
Group ([Group1, ...]) ：
Gets the string that is intercepted by one or more groups, and returns a tuple when multiple parameters are specified. Group1 can use numbers or aliases; number 0 represents the entire matched substring; returns Group (0) when no parameters are filled; Groups that have not intercepted a string return none; The group that intercepted multiple times returns the last substring intercepted.
Groups ([default]):
Returns the string intercepted by all groups as a tuple. Equivalent to calling group (,... last). Default indicates that a group that does not intercept a string is replaced with this value, which defaults to none.
Groupdict ([default]):
Returns a dictionary with aliases for the alias of the group, the value of the substring intercepted by the group, and no alias for the group. The default meaning is the same.
Start ([group]):
Returns the starting index of the substring intercepted by the specified group in string (the index of the first character of the substring). The group default value is 0.
End ([group]):
Returns the end index of the substring intercepted by the specified group in string (the index of the last character of the substring + 1). The group default value is 0.
span ([group]):
Returns (Start (group), End (group)).
Expand (Template):
Substituting the matched grouping into the template and then returns. You can use \id or \g in the template , \g reference grouping, but cannot use number 0. \id and \g are equivalent, but \10 will be considered a 10th grouping, if you want to express \1 after the character ' 0 ', use only \g<1>0.
The following is to use a PY instance to output all the content to deepen understanding:

The code is as follows:

#-*-Coding:utf-8-*-
#一个简单的match实例

Import re
# matches the following: Word + space + word + any character
m = Re.match (R ' (\w+) (\w+) (? P . *) ', ' Hello world! ')

Print "m.string:", m.string
Print "M.re:", m.re
Print "M.pos:", M.pos
Print "M.endpos:", M.endpos
Print "M.lastindex:", M.lastindex
Print "M.lastgroup:", M.lastgroup

Print "M.group ():", M.group ()
Print "M.group:", M.group (1, 2)
Print "M.groups ():", M.groups ()
Print "M.groupdict ():", M.groupdict ()
Print "M.start (2):", M.start (2)
Print "M.end (2):", M.end (2)
Print "M.span (2):", M.span (2)
Print R "M.expand (R ' \g<2> \g<1>\g<3> '):", M.expand (R ' \2 \1\3 ')

# # # output # #
# M.string:hello world!
# m.re: <_sre. Sre_pattern Object at 0x016e1a38>
# m.pos:0
# M.endpos:12
# M.lastindex:3
# m.lastgroup:sign
# M.group: (' Hello ', ' world ')
# m.groups (): (' Hello ', ' world ', '! ')
# m.groupdict (): {' sign ': '! '}
# M.start (2): 6
# M.end (2): 11
# M.span (2): (6, 11)
# M.expand (R ' \2 \1\3 '): World hello!

2.3. Pattern
The pattern object is a compiled regular expression that can be matched to the text by a series of methods provided by pattern.
Pattern cannot be instantiated directly and must be constructed using Re.compile (), which is the object returned by Re.compile ().
The pattern provides several readable properties for getting information about an expression:
Pattern: The expression string used at compile time.
Flags: The matching pattern used at compile time. Digital form.
Groups: The number of groupings in an expression.
Groupindex: The alias of the group with the alias in the expression is the key, the dictionary with the number corresponding to that group, and the group without the alias is not included.
You can use the following example to view the properties of the pattern:

The code is as follows:

#-*-Coding:utf-8-*-
#一个简单的pattern实例

Import re
p = re.compile (R ' (\w+) (\w+) (? P . *) ', re. Dotall)

Print "P.pattern:", P.pattern
Print "P.flags:", p.flags
Print "P.groups:", p.groups
Print "P.groupindex:", P.groupindex

# # # output # #
# P.pattern: (\w+) (\w+) (? P . *)
# p.flags:16
# P.groups:3
# P.groupindex: {' sign ': 3}

The following highlights the example method of pattern and its use.

1.match

Match (string[, pos[, Endpos]) | Re.match (pattern, string[, flags]):
This method attempts to match pattern from the point at which the pos of string is labeled;
Returns a Match object if the pattern is still matched at the end;
None is returned if pattern does not match during the match, or if the match does not end and the Endpos is reached.
The default values for POS and Endpos are 0 and Len (string), respectively;
Re.match () cannot specify these two parameters, the parameter flags specifies the matching pattern when compiling pattern.
Note: This method is not an exact match.
If the string has any remaining characters at the end of the pattern, it is still considered successful.
If you want an exact match, you can add the boundary match ' $ ' at the end of the expression.
Let's look at a simple case of match:

The code is as follows:

# Encoding:utf-8
Import re

# compile regular expressions into pattern objects
Pattern = Re.compile (R ' Hello ')

# match text with pattern, get match result, cannot match when will return none
Match = Pattern.match (' Hello world! ')

If match:
# Use Match to get group information
Print Match.group ()

# # # output # #
# Hello

2.search

Search (string[, pos[, Endpos]) | Re.search (pattern, string[, flags]):
This method is used to find substrings in a string that can match a success.
Try to match pattern from the POS subscript of String,
Returns a Match object if the pattern is still matched at the end;
If there is no match, the POS is added 1 and then the match is tried again;
None is returned until Pos=endpos is still unable to match.
The default values for POS and Endpos are 0 and len (string) respectively;
Re.search () cannot specify these two parameters, the parameter flags specifies the matching pattern when compiling pattern.
So what's the difference between it and match?
The match () function only detects if the re is matched at the start of the string,
Search () scans the entire string lookup match,

Match () returns only if the 0-bit match succeeds, and if the match () is not successful, match () returns none
For example:
Print (Re.match (' super ', ' superstition '). span ())
will return (0, 5)
Print (Re.match (' super ', ' insuperable '))
Then return none

Search () scans the entire string and returns the first successful match
For example:
Print (Re.search (' super ', ' superstition '). span ())
Back (0, 5)
Print (Re.search (' super ', ' insuperable '). span ())
Back (2, 7)
Look at an example of search:

The code is as follows:

#-*-Coding:utf-8-*-
#一个简单的search实例

Import re

# compile regular expressions into pattern objects
Pattern = Re.compile (R ' World ')

# Use Search () to find a matching substring, no matching substring will be returned when none is present
# using Match () in this example does not match successfully
Match = Pattern.search (' Hello world! ')

If match:
# Use Match to get group information
Print Match.group ()

# # # output # #
# World

3.split
Split (string[, Maxsplit]) | Re.split (Pattern, string[, Maxsplit]):
Returns a list after splitting a string by a substring that can be matched.
The maxsplit is used to specify the maximum number of splits and does not specify that all will be split.

The code is as follows:

Import re

p = re.compile (R ' \d+ ')
Print p.split (' One1two2three3four4 ')

# # # output # #
# [' One ', ' one ', ' one ', ' three ', ' four ', ']

4.findall
FindAll (string[, pos[, Endpos]) | Re.findall (pattern, string[, flags]):
Searches for a string, returning all matching substrings as a list.

The code is as follows:

Import re

p = re.compile (R ' \d+ ')
Print P.findall (' One1two2three3four4 ')

# # # output # #
# [' 1 ', ' 2 ', ' 3 ', ' 4 ']

5.finditer
Finditer (string[, pos[, Endpos]) | Re.finditer (pattern, string[, flags]):
Searches for a string that returns an iterator that accesses each matching result (match object) sequentially.

The code is as follows:

Import re

p = re.compile (R ' \d+ ')
For M in P.finditer (' One1two2three3four4 '):
Print M.group (),

# # # output # #
# 1 2 3 4

6.sub

Sub (repl, string[, Count]) | Re.sub (Pattern, REPL, string[, Count]):
Returns the replaced string after each matched substring in string is replaced with REPL.
When Repl is a string, you can use \id or \g , \g reference grouping, but you cannot use number 0.
When Repl is a method, this method should only accept one parameter (the match object) and return a string for substitution (the returned string cannot be referenced in the grouping).
Count is used to specify the maximum number of replacements, not all when specified.

The code is as follows:

Import re

p = re.compile (R ' (\w+) (\w+) ')
s = ' I say, hello world! '

Print p.sub (R ' \2 \1 ', s)

def func (m):
return M.group (1). Title () + "+ m.group (2)." Title ()

Print P.sub (func, s)

# # # output # #
# Say I, World hello!
# I Say, Hello world!

7.subn
Subn (REPL, string[, Count]) |re.sub (pattern, REPL, string[, Count]):
Returns (Sub (REPL, string[, Count]), number of replacements).

The code is as follows:

Import re

p = re.compile (R ' (\w+) (\w+) ')
s = ' I say, hello world! '

Print p.subn (R ' \2 \1 ', s)

def func (m):
return M.group (1). Title () + "+ m.group (2)." Title ()

Print P.subn (func, s)

# # # output # #
# (' Say I, World hello! ', 2)
# (' I Say, Hello world! ', 2)

The above is the Python artifact regular expression of the basic introduction, very simple and practical, I hope to have some help ^_^



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More