Regular expression of the Python crawler

Source: Internet
Author: User
Tags aliases character classes locale setting terminates
In the face of a lot of messy code inclusion text How can we extract it to organize it? Let's start with a very powerful tool, Regular Expressions

1. Understanding Regular Expressions

A regular expression is a logical formula for a string operation, which is a "rule string" that is used to express a filter logic for a string, using predefined specific characters and combinations of these specific characters.

Regular expressions are a very powerful tool for matching strings, and in other programming languages there is also the concept of regular expressions, and Python is no exception, and using regular expressions, we want to extract what we want from the returned page content.

The approximate matching process for regular expressions is:

    • Take out the expression in turn and compare the characters in the text,
    • If each character matches, the match succeeds, and the match fails once there is a character that matches unsuccessfully.
    • If there are quantifiers or boundaries in an expression, the process is slightly different.

2. Syntax rules for regular expressions

Here are some of the matching rules for Python's regular expressions, with pictures from csdn

3. Regular expression-related annotations

(1) Greedy mode and non-greedy mode of quantitative words
Regular expressions are typically used to find matching strings in text. The number of words in Python is greedy by default (which may be the default non-greedy in a few languages), always trying to match as many characters as possible, and not greedy, instead, always trying to match as few characters as possible. For example: the regular expression "ab*" will find "abbb" if it is used to find "ABBBC". And if you use a non-greedy quantity word "ab*?", you will find "a".

Note: We generally use non-greedy mode to extract.

(2) Anti-slash problem
As with most programming languages, "\" is used as an escape character in regular expressions, which can cause a backslash to be plagued. If you need to match the character "\" in the text, then 4 backslashes "\\\\" will be required in the regular expression expressed in the programming language: the first two and the last two are used to escape the backslash in the programming language, converted to two backslashes, and then escaped in the regular expression into a backslash.

The native string in Python solves this problem well, and the regular expression in this example can be expressed using R "\ \". Similarly, a "\\d" that matches a number can be written as r "\d". With the native string, the mother does not have to worry about the omission of the backslash, written out of the expression is more intuitive.

4.Python RE Module

Python has its own RE module, which provides support for regular expressions. The main methods used are listed below

#返回pattern对象re. Compile (String[,flag]) #以下为匹配所用函数re. Match (pattern, string[, flags) Re.search (pattern, string[, flags ]) Re.split (pattern, string[, Maxsplit]) Re.findall (pattern, string[, flags]) Re.finditer (pattern, string[, flags]) Re.sub (Pattern, REPL, string[, Count]) re.subn (pattern, REPL, string[, Count])

Before introducing these methods, let's introduce the concept of pattern, which can be understood as a matching pattern, so how do we get the matching pattern? Very simply, we need to use the Re.compile method. For example

Pattern = Re.compile (R ' Hello ')

In the argument we pass in the native string object, build a pattern object by compiling the compile method, and then we use this object for further matching.

In addition, you may notice another parameter, flags, explaining the meaning of this parameter here:

The parameter flag is a matching pattern, and the value can use the bitwise OR operator ' | ' To take effect at the same time, such as re. I | Re. M.

The optional values are:

Re. I (full spell: IGNORECASE): Ignoring case (full wording in parentheses, same as below)
Re. M (full spell: MULTILINE): Multiline mode, changing the behavior of ' ^ ' and ' $ ' (see)
Re. S (full spell: dotall): Point random match mode, change '. ' The behavior
Re. L (full spell: locale): Make a predetermined character class \w \w \b \b \s \s depends on the current locale setting
Re. U (full spell: Unicode): Make predefined character classes \w \w \b \b \s \s \d \d Depending on UNICODE-defined character attributes
Re. X (full spell: VERBOSE): Verbose mode. In this mode, the regular expression can be multiple lines, ignore whitespace characters, and can be added to comments.

We need to use this pattern in a few other ways, such as Re.match, which we have described below.

NOTE: The flags in the following seven methods also represent the meaning of the matching pattern, which is not required in the following method if flags are already indicated in the pattern generation.
(1) Re.match (pattern, string[, flags])
This method will start at the beginning of the string (which we want to match), try to match the pattern, match it backwards, and if it encounters an unmatched character, return none immediately, and return none if the match does not end already reached the end of the string. All two results indicate a match failure, otherwise the match pattern succeeds, the match terminates, and the string is no longer matched backwards. Let's take a look at the following example

#-*-Coding:utf-8-*-#导入re模块import re # Compile the regular expression into a pattern object, note that the R in front of Hello means "native string" pattern = Re.compile (R ' Hello ') # using RE. Match matches the text, gets the match result, fails to match when it returns NONERESULT1 = Re.match (pattern, ' hello ') result2 = Re.match (Pattern, ' Helloo cqc! ') RESULT3 = Re.match (Pattern, ' helo cqc! ') RESULT4 = Re.match (pattern, ' Hello cqc! ') #如果1匹配成功if result1:  # get group info using match  print result1.group () Else:  print ' 1 match failed! '  #如果2匹配成功if result2:  # Use Match to get Group Info  Print Result2.group () Else:  print ' 2 match failed! '  #如果3匹配成功if RESULT3:  # Use Match to get Group Info  Print Result3.group () Else:  print ' 3 match failed! ' #如果4匹配成功if RESULT4:  # Use Match to get Group Info  Print Result4.group () Else:  print ' 4 match failed! '

Run results

Hellohello3 Match failed! Hello

Match analysis

1. The first match, the pattern regular expression is ' hello ', our matching target string is also hello, complete match from beginning to end, match succeeds.

2. The second match, String Helloo CQC, starting from the string header match pattern can match, pattern match end, and the match terminates, the following O CQC no longer matches, return the information matching success.

3. Third match, String helo CQC, start matching pattern from string header, Discovery to ' O ' cannot complete match, match terminated, return none

4. The fourth match, with the second matching principle, will not be affected even if a space character is encountered.

We also saw the last print out of the Result.group (), what does this mean? Let's talk about the properties and methods of the Match object
The match object is a matching result that contains a lot of information about this match and can be obtained using the readable properties or methods provided by match.

Properties:
1.string: The text to use when matching.
2.re: Pattern object to be used when matching.
3.pos: The index in which the text expression begins the search. The value is the same as the parameter with the same name as the Pattern.match () and Pattern.seach () methods.
4.endpos: The index of the text expression end search. The value is the same as the parameter with the same name as the Pattern.match () and Pattern.seach () methods.
5.lastindex: The last captured grouping is indexed in the text. If there are no captured groupings, it will be none.
6.lastgroup: The alias of the last captured group. If the group has no aliases or no captured groupings, it will be none.
Method:
1.group ([Group1, ...]):
Gets the string that is intercepted by one or more groups, and returns a tuple when multiple parameters are specified. Group1 can use numbers or aliases; number 0 represents the entire matched substring; returns Group (0) when no parameters are filled; Groups that have not intercepted a string return none; The group that intercepted multiple times returns the last substring intercepted.
2.groups ([default]):
Returns the string intercepted by all groups as a tuple. Equivalent to calling group (,... last). Default indicates that a group that does not intercept a string is replaced with this value, which defaults to none.
3.groupdict ([default]):
Returns a dictionary with aliases for the alias of the group, the value of the substring intercepted by the group, and no alias for the group. The default meaning is the same.
4.start ([group]):
Returns the starting index of the substring intercepted by the specified group in string (the index of the first character of the substring). The group default value is 0.
5.end ([group]):
Returns the end index of the substring intercepted by the specified group in string (the index of the last character of the substring + 1). The group default value is 0.
6.span ([group]):
Returns (Start (group), End (group)).
7.expand (template):
Substituting the matched grouping into the template and then returns. The template can be grouped using \id or \g, \g reference, but cannot use number 0. \id and \g are equivalent, but \10 will be considered a 10th grouping, if you want to express \1 after the character ' 0 ', use only \g0.
Let's use an example to understand

#-*-Coding:utf-8-*-#一个简单的match实例 import re# matches the following: Word + space + word + any character M = Re.match (R ' (\w+) (\w+) (? P
 
  
   
  . *) ', ' Hello world! ') print "m.string:", M.stringprint "M.re:", M.reprint "M.pos:", M.posprint "M.endpos: ", M.endposprint" M.lastindex: ", M.lastindexprint" M.lastgroup: ", M.lastgroupprint" M.group (): ", M.group () print" M.group: ", M.group (1, 2) print" M.groups (): ", m.groups () print" M.groupdict (): ", M.groupdict () print" M.start (2): ", M.start (2) print "M.end (2):", M.end (2) print "M.span (2):", M.span (2) Print R "M.expand (R ' \g \g\g '):", M.expand (R ' \2 \1\3 ') # # # output # # # # M.string:hello world!# m.re: # m.pos:0# m.endpos:12# m.lastindex:3# m.lastgroup:sign# m.group (): ( ' Hello ', ' World ') # m.groups (): (' Hello ', ' world ', '! ') # m.groupdict (): {' sign ': '! '} # M.start (2): 6# m.end (2): 11# M.span (2): (6, one) # M.expand (R ' \2 \1\3 '): World hello!
 
  

(2) Re.search (pattern, string[, flags])
The search method is very similar to the match method, except that the match () function only detects if the re is matched at the start of the string, and search () scans the entire string lookup match, and match () returns only if the 0-bit match succeeds. Match () returns none if the match is not successful at the start position. Similarly, the return object of the search method also returns the method and properties of the object, as well as the match (). Let's use an example to feel

#导入re模块import RE # compiles the regular expression into a pattern object pattern = Re.compile (R ' World ') # uses search () to find a matching substring, and returns none# when there are no substrings to match In this example, match () cannot be successfully matched with match = Re.search (pattern, ' Hello world! ') If match:  # use Match to get grouping information  print match.group () # # # # # # # # # #

(3) re.split (pattern, string[, Maxsplit])
Returns a list after splitting a string by a substring that can be matched. The maxsplit is used to specify the maximum number of splits and does not specify that all will be split. Let's take a look at the following example.

Import RE pattern = Re.compile (R ' \d+ ') print re.split (pattern, ' One1two2three3four4 ') # # # output # # # # # [' One ', ' I ', ' three ', ' F Our ', ']

(4) Re.findall (pattern, string[, flags])
Searches for a string, returning all matching substrings as a list. Let's use this example to feel

Import RE pattern = Re.compile (R ' \d+ ') print re.findall (pattern, ' One1two2three3four4 ') # # # output # # # # # # # # [' 1 ', ' 2 ', ' 3 ', ' 4 ']

(5) Re.finditer (pattern, string[, flags])
Searches for a string that returns an iterator that accesses each matching result (match object) sequentially. Let's take a look at the following example

Import RE pattern = Re.compile (R ' \d+ ') for M in Re.finditer (pattern, ' one1two2three3four4 '):  print M.group (), # # # output # # # 1 2 3 4

(6) Re.sub (pattern, REPL, string[, Count])
Returns the replaced string after each matched substring in string is replaced with REPL.
When Repl is a string, you can use \id or \g, \g reference grouping, but you cannot use number 0.
When Repl is a method, this method should only accept one parameter (the match object) and return a string for substitution (the returned string cannot be referenced in the grouping).
Count is used to specify the maximum number of replacements, not all when specified.

Import RE pattern = Re.compile (R ' (\w+) (\w+) ') s = ' I say, hello world! ' Print re.sub (pattern,r ' \2 \1 ', s) def func (m): 
  
   return M.group (1). Title () + "+ m.group (2)." Title () print Re.sub (Pattern,func, s) # # # output # # # # # # # # # Say I, World hello!# I Say, Hello world!.
  

(7) Re.subn (pattern, REPL, string[, Count])
Returns (Sub (REPL, string[, Count]), number of replacements).

Import RE pattern = Re.compile (R ' (\w+) (\w+) ') s = ' I say, hello world! ' Print re.subn (pattern,r ' \2 \1 ', s) def func (m): 
  return M.group (1). Title () + "+ m.group (2)." Title () print Re.subn (Pattern,func, s) # # # output # # # # # (' Say I, world hello ! ', 2) # (' I Say, Hello world! ', 2)

Another way to use the 5.Python re module

In the above we introduced 7 tool methods, such as Match,search and so on, but the calling method is Re.match,re.search way, in fact, there is another way to call, can be called by Pattern.match,pattern.search, so that the call does not have to pass the pattern as the first parameter passed, we want to how to call all can.

List of function APIs

Match (string[, pos[, Endpos]) | Re.match (pattern, string[, flags]) search (string[, pos[, Endpos]) | Re.search (pattern, string[, flags]) split (string[, Maxsplit]) | Re.split (Pattern, string[, Maxsplit]) findall (string[, pos[, Endpos]) | Re.findall (pattern, string[, flags]) Finditer (string[, pos[, Endpos]) | Re.finditer (pattern, string[, flags]) sub (REPL, string[, Count]) | Re.sub (Pattern, REPL, string[, Count]) subn (REPL, string[, Count]) |re.sub (pattern, REPL, string[, Count])

The specific calling method does not have to be detailed, the principle is similar, but the parameters vary. Little Friends Try it ~

Small partners refueling, even if this section is foggy, and then we will use some practical examples to help you master the regular expression.

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.