International - English

Cart Console

Topic Center

Contact Sales

Home > Developer > Python

Python crawler Development "1th" "Regular expression"

Last Update:2018-07-29 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Regular expressions

It is a logical formula for string manipulation, which is to make a "rule string" that is used to express a filter logic for a string by using a predefined set of specific characters and combinations of these specific characters.

2. Re module

2.1, Re module use steps:

Use compile() a function to compile the string form of a regular expression into an Pattern object
A match object is obtained by matching the text by a Pattern series of methods provided by the object to find the matching result.
Finally, use the Match properties and methods provided by the object to get information and perform other actions as needed

2.2, compile ()

The compile function compiles regular expressions and generates a Pattern object, which is typically used in the following form:

Import re# compiles regular expressions into pattern objects pattern = Re.compile (R ' \d+ ')

Some common methods of Pattern objects include:

Match method: Search from start position, one match
Search method: Search from any location, one match
FindAll method: Match all, return list
Finditer method: Match all, return iterator
Split method: Split string, return list
Sub method: Replace

2.2.1, match method:

The match method is used to find the head of a string (you can also specify a starting position), which is a match, and returns if a matching result is found, rather than finding all matching results.

Its general form of use is as follows:match(string[, pos[, endpos]])

Where the string is the strings to be matched, POS and endpos are optional parameters, specifying the starting and ending positions of the string, and the default values are 0 and len (string length). So

When you do not specify POS and Endpos, the match method defaults to the header of the string.

When the match succeeds, a match object is returned and none is returned if there is no match.

Example 1:

>>> Import re>>> pattern = Re.compile (R ' \d+ ') # is used to match at least one number >>> m = Pattern.match (' One12twothree34four ') # Find header, no match >>> print mnone>>> m = pattern.match (' One12twothree34four ', 2, 10) # from The ' E ' position begins to match, no match >>> print mnone>>> m = pattern.match (' One12twothree34four ', 3, 10) # Starting with ' 1 ' position match, exactly match & gt;>> Print M # Returns a Match object <_sre. Sre_match object at 0x10a42aac0>>>> m.group (0) # can omit 0 ' >>> m.start (0) # can omit 03>>> m.end (0 ) # can be omitted 05>>> m.span (0) # can be omitted 0 (3, 5) Group ([Group1, ...]) method is used to obtain one or more grouping matching strings, which can be used directly when the entire matched substring is to be obtained, either group () or group (0); Start ([ Group]) method to get the starting position of a packet-matched substring in the entire string (the index of the first character of the substring), the default value of the 0;end ([group]) method used to get the ending position of the grouped matched substring in the entire string (the index of the last character of the substring + 1), The default value of the parameter is 0;span ([group]) method return (Start (group), End (group)).


Example 2:
>>> Import re>>> pattern = Re.compile (R ' ([a-z]+) ([a-z]+) ', Re. I)  # Re. I means ignore case >>> m = Pattern.match (' Hello World Wide Web ') >>> print M     # match succeeds, returns a match object <_sre. Sre_match object at 0x10bea83e8>>>> m.group (0)  # Returns the entire substring matching success ' Hello World ' >>> m.span (0)   # Returns the index of the entire substring that matches successfully (0, one) >>> m.group (1)  # Returns the first packet matching successful substring ' Hello ' >>> m.span (1)   # Returns the index of the successful substring of the first grouping match (0, 5) >>> M.group (2)  # returns the second packet matching successful substring ' world ' >>> m.span (2)   # Returns the second packet matching successful substring (6, one-by-one) >>> m.groups ()  # equivalent to (M.group (1), M.group (2), ...) (' Hello ', ' World ') >>> M.group (3)   # There is no third grouping traceback (most recent call last):  File "<stdin>", Line 1, in <module>indexerror:no such group

2.2.2, search method

The search method is used to find any location of a string, it is also a match, and returns if a matching result is found, rather than finding all matching results.

Its general form of use is as follows:search(string[, pos[, endpos]])

When the match succeeds, a match object is returned and none is returned if there is no match.

Example 1:>>> import re>>> pattern = re.compile (' \d+ ') >>> m = Pattern.search (' One12twothree34four ')  # here if you use the match method does not match >>> M<_sre. Sre_match object at 0x10cc03ac0>>>> m.group () ' >>> m = pattern.search (' One12twothree34four ', 10 )  # Specify the string range >>> m<_sre. Sre_match object at 0x10cc03b28>>>> m.group () ' >>> M.span () (13, 15) example 2:#-*-coding:utf-8-*-  Import re# compiles the regular expression into the pattern object pattern = Re.compile (R ' \d+ ') # Use Search () to find a matching substring, no matching substring will be returned none# here using match () cannot successfully match m = Pattern.search (' Hello 123456 789 ') if M:    # Use Match to get grouping information    print ' matching string: ', M.group ()    # Start and end position 
   print ' position: ', m.span () execution result: matching string:123456position: (6, 12)

2.2.3, FindAll method

FindAll can search the entire string for all matching results.

The FindAll method is used in the following form:findall(string[, pos[, endpos]])

FindAll returns all matched substrings in a list, and returns an empty list if there is no match.

Example 1:import Repattern = Re.compile (R ' \d+ ')   # find Numbers result1 = pattern.findall (' Hello 123456 789 ') result2 = Pattern.findall (' One1two2three3four4 ', 0,) print result1print result2 execution result: [' 123456 ', ' 789 '] [' 1 ', ' 2 '] example 2:# re_ The Test.pyimport Re#re module provides a method called the Compile module, which provides us with a matching rule for input # and then returns a pattern instance that we follow to match the string pattern = Re.compile (R ' \d+\.\d* ') #通过partten. FindAll () method will be able to match all of the string we get to result = Pattern.findall ("123.141593, ' Bigcat ', 232312, 3.15") #findall in list form Returns all matching substrings to resultfor item in result:    print item run Result: 123.1415933.15

2.2.4, Finditer method

The behavior of the Finditer method is similar to the behavior of FindAll, and it also searches the entire string for all matching results. But it returns an iterator that sequentially accesses each match result (Match object).

#-*-coding:utf-8-*-import repattern = re.compile (R ' \d+ ') Result_iter1 = Pattern.finditer (' Hello 123456 789 ') result_ite r2 = pattern.finditer (' One1two2three3four4 ', 0, ten) print type (result_iter1) print type (result_iter2) print ' Result1 ... ' For M1 in Result_iter1:   # M1 is the Match object    print ' matching string: {}, Position: {} '. Format (M1.group (), M1.span ()) PRI NT ' result2 ... ' for m2 in result_iter2:    print ' matching string: {}, Position: {} '. Format (M2.group (), M2.span ()) Execution Result: & Lt;type ' callable-iterator ' ><type ' callable-iterator ' >result1...matching string:123456, Position: (6, 12) Matching string:789, Position: (4) result2...matching string:1, Position: (3,) matching string:2, Position: (7, 8)

2.2.5, spilt method

The Split method returns the list after splitting the string by a substring that can be matched.

It is used in the following form:split(string[, maxsplit])

Where Maxsplit is used to specify the maximum number of splits and does not specify that all will be split.

Import rep = Re.compile (R ' [\s\,\;] + ') Print p.split (' A, b;; c   d ') execution result: [' A ', ' B ', ' C ', ' d ']

2.2.6, Sub method

The sub method is used for substitution.

It is used in the following form:sub(repl, string[, count])

Where repl can be a string, or it can be a function:

- If Repl is a string, it uses REPL to replace each matched substring of the string and returns the substituted string, and Repl can also refer to the grouping using the ID, but not the number 0;
- If Repl is a function, this method should only accept one argument (the Match object) and return a string for substitution (the returned string cannot be referenced in a group).
- Count is used to specify the maximum number of replacements, not all when specified.

Import rep = Re.compile (R ' (\w+) (\w+) ') # \w = [a-za-z0-9]s = ' Hello 123, hello 456 ' print p.sub (R ' Hello World ', s)  # make With ' Hello World ' replace ' Hello 123 ' and ' Hello 456 ' print p.sub (R ' \2 \1 ', s)        # Reference Group def func (m):    return ' Hi ' + ' + M.grou P (2) Print P.sub (func, s) print P.sub (func, S, 1)         # Replace at most one execution result: Hello world, hello world123 Hello, 456 hellohi 123, HI 45 6hi 123, Hello 456

2.2.7, matching Chinese

Suppose now want to put the string title = U ' Hello, hello, world ' in Chinese to extract it, can do this: import retitle = u ' Hello, hello, world ' pattern = re.compile (ur ' [\u4e00-\u9fa5]+ ') result = Pattern.findall (title) print result
The regular expression is preceded by a two prefix ur, where r means the original string, and U is the Unicode string. Execution result: [u ' \u4f60\u597d ', U ' \u4e16\u754c ']

3. Greedy mode and non-greedy mode

Greedy mode: As many matches as possible (*) on the premise that the entire expression matches successfully;
Non-greedy mode: as few matches as possible if the entire expression matches successfully (?) ；
The number of words in Python is greedy by default.

Example 1: source string: abbbc

The regular expression ab*, which uses the greedy number of words, matches the result: abbb. * decided to match B as much as possible, so all B after a has appeared. A regular expression that uses a non-greedy number of words ab*?, matches the result: a. Even if there is *, but? decided to match B as little as possible, so there's no B.

Example 2: Source string: aa<div>test1</div>bb<div>test2</div>cc

Regular expressions using greedy quantitative words:<div>.*</div> match Results:<div>test1</div>bb<div>test2</div> The greedy pattern is used here. The entire expression can be successfully matched to the first "</div>", but because of the greedy pattern, you still try to match to the right to see if there is a longer substring that can be successfully matched. After matching to the second "</div>", there are no strings that can be successfully matched to the right, the match ends, and the match result is "<div>test1</div>bb<div>test2</div>"

4. Regular expression Test URL

Python crawler Development "1th" "Regular expression"

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler Development "1th" "Regular expression"

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support