Basic use of Python crawler from beginner to discard (v)

Last Update:2017-05-31 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What is a regular expression

A regular expression is a logical formula for the manipulation of a string, which is defined in advance by certain characters, and the combination of these specific characters, forming a "regular character", the "regular character" to express a filtering logic for the character.

The regular is not unique to Python, and other languages have regular
Regular in Python, encapsulating the RE module

Python regular detailed explanation of common matching patterns

\w      match alphanumeric and underscore \w      match f non-alphanumeric underscore \s      match any whitespace character, equivalent to [\t\n\r\f]\s      match any non-null character \d match any      number \d match any      non-numeric \a      Match string starts \z      match string ends, if there is a newline, matches only the end string before the line break \z      match string ends \g matches      the last match completed location \ n      matches a newline character \ t      matches the end of a tab ^ match string at the beginning of the        $       match string.       Matches any character except the line break, re. When the Dotall tag is specified, it is possible to match any character that includes a line break [....]  Used to represent a set of characters, listed separately: [AMK] matches a,m or k[^ ...]  Characters not in []: [^ABC] matches characters other than A,b,c *        match 0 or more expressions +        match 1 or more expressions?       Matches 0 or 1 fragments defined by the preceding regular expression, non-greedy mode {n}     exactly matches n preceding representation {m,m}   matches n to m times by preceding regular expression definition fragment, greedy pattern a|  b     matches A or B ()      matches the expression in parentheses, and also represents a group

Re.match ()

Try to match a pattern from the beginning of the string, and match () will return none if it is not a start position match
Syntax format:
Re.match (pattern,string,flags=0)

The most common match

Import recontent " Hello 123 4567 world_this is a regex Demo "  = Re.match ('^hello\s\d\d\d\s\d{4}\s\w{10}.*demo$', content)  print (result)print(result.group ())print(Result.span ())

The results are as follows:

Result.group () Gets the results of the match
Result.span () The length range of the matched string to be obtained
Pan-Match

In fact, the above approach is not very convenient, in fact, the above regular rules can be changed

Import recontent " Hello 123 4567 world_this is a regex Demo "  = Re.match ("^hello.*demo$", content)print(result)  Print(result.group ())print(Result.span ())

The result of this code is the same as the regular match above, but it is much easier to write.

Match target

If you want to match the specific target in the string, you need to enclose it (), as in the following example:

Import recontent " Hello 1234567 world_this is a regex Demo "  = Re.match ('^hello\s (\d+) \sworld.*demo$', content)print (Result) Print (Result.group ()) Print (Result.group (1)) Print (Result.span ())

The results are as follows:

What needs to be said here is that when the result is obtained through Re.group (), if there are parentheses in the regular expression, then Re.group (1) Gets the result that matches the first one in parentheses.

Greedy match

Let's look at the following code:

Import recontent " Hello 1234567 world_this is a regex Demo " result= Re.match ('^hello.* (\d+). *demo', content)print (Result) Print (Result.group (1))

The result of this piece of code is

From the results can be seen only matched to 7, and did not match to 1234567, the cause of this situation is the previous. * Match out,. * Here will match as much content as we call the greedy match,

If we want to match to 1234567, we need to change the regular expression to:

result= re.match (' ^he.*? ( \d+). *Demo ', content)

So the results can be matched to 1234567.

Matching mode

A lot of times the matching content is the problem of line-wrapping, this time you need to use the matching mode of re. s to match the contents of the line break

Import"" "Hello 123456 world_thismy name is Zhaofan""" =re.match ( ' ^he.*? (\d+). *?zhaofan$', Content,re. (S)print (result)print (result.group ())print( Result.group (1))

The results are as follows

Escape

When we want to match the content in the presence of special characters, we need to use the transfer symbol \, the example is as follows:

Import recontent " Price is $5.00 "  = Re.match ('price is\$5\.00', content)print(result)  Print(Result.group ())

A summary of the above:
Use generic matches as much as possible, use parentheses to get matching targets, use non-greedy patterns as much as possible, and re with newline characters. S
Emphasize that Re.match is a pattern that matches the starting position of a string

Re.search

Re.search scans the entire string to return the result of the first successful match

Import"extra things Hello 123455 world_this is a Re extra things"= Re.search ("hello.*?" ( \d+). *? Re", content)print (result)print(Result.group ())  Print(Result.group (1))

The results are as follows:

In fact this time we don't need to write ^ and $, because search is scanning the entire string

Note: So in order to match conveniently, we will use search more, do not need match,match must match the head, so many times is not particularly convenient

Matching walkthroughs

Example 1:

Importrehtml=" "<div id= "songs-list" > " "result= Re.search ('<li.*?active.*?singer= "(. *?)" > (. *?) </a>', Html,re. S)Print(Result)Print(Result.groups ())Print(Result.group (1))Print(Result.group (2))

The result is:

Re.findall

Search string to return all matching substrings as a list

The code example is as follows:

Importrehtml=" "<div id= "songs-list" > " "Results= Re.findall ('<li.*?href= "(. *?)". *?singer= "(. *?)" > (. *?) </a>', HTML, re. S)Print(Results)Print(Type (results)) forResultinchResults:Print(Result)Print(Result[0], result[1], result[2])

The results are as follows:

Example 2:

Importrehtml=" "<div id= "songs-list" > " "Results= Re.findall ('<li.*?>\s*? (<a.*?>)? (\w+) (</a>)?\s*?</li>', Html,re. S)Print(Results) forResultinchResults:Print(Result[1])

The results are as follows:

In fact, here we can see

\s*? This usage is in order to solve the problem that some have a newline, some do not have a line change.

(<a.*?>)? This usage is because some of the HTML has a tag, some do not,? Represents one or 0 matches that can be used to match

Re.sub

Returns the replaced string after each matched substring in the replacement string

Re.sub (regular expression, replaced by string, original string)

Example 1

Import"Extra things Hello 123455 world_this is a regex Demo Extra things"
    = re.sub ('\d+',', content)print(content)

The result is that the number is replaced with empty:

Example 2, in some cases when we replace the character, we also want to get our matching string, and then add some content, can be implemented in the following way:

Import"Extra things Hello 123455 world_this is a regex Demo Extra things"
    = re.sub ('(\d+)', R'\1 7890', content) Print (content)

The results are as follows:

One of the issues to note here is that \1 is getting the first match result, and in order to prevent the escape character, we need to precede the R

Re.compile

Compiles regular expressions into regular expression objects for easy reuse of regular expressions

Import recontent """ Hello 12345 world_this123 fan """  =re.compile ("hello.*fan"= re.match (pattern,content) Print (Result) Print (Result.group ())

A regular comprehensive exercise

Get book information for Douban book pages, via regular implementations

ImportRequestsImportrecontent= Requests.get ('https://book.douban.com/'). Textpattern= Re.compile ('<li.*?cover.*?href= "(. *?)". *?title= "(. *?)". *?more-meta.*?author "> (. *?) </span>.*?year "> (. *?) </span>.*?</li>', Re. S) Results=Re.findall (pattern, content)Print(Results) forResultinchresults:url,name,author,date=Result author= Re.sub ('\s',"', author) date= Re.sub ('\s',"', date)Print(url,name,author,date)

The results are as follows:

Basic use of Python crawler from beginner to discard (v)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Basic use of Python crawler from beginner to discard (v)

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support