Python simple crawler second egg!

Last Update:2018-05-02 Source: Internet

Author: User

Tags http authentication ssl certificate

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Second Lecture: Several instances of crawling Web pages

This talk through a few examples to describe the basic operation of crawling Web pages, but also review the contents of the last lesson, I believe you can remember the General code framework, this talk also need to use it, ha ha, is not forget the drip.

All right, here we go!

First Instance:

1 " "The novel crawls" "2 3 ImportRequests4 5  6 7 defgethtml_text (URL):8 9     Try:Ten  OneR = Requests.get (Url,timeout = 20) A  -R.raise_for_status ()#If the status is not 200, an exception is generated -  theR.encoding =r.apparent_encoding -  -         returnR.text -  +     except: -  +         return 'Create an exception' A  at   -  - if __name__=='__main__': -  -URL ='http://wanmeishijiexiaoshuo.org/book/9.html' -  in #the above is one of the novels on the website of the novel content -  to Print(Gethtml_text (URL) [: 5000])

Let's run a look at the results:

See, is the content of the novel, climbed down, understand the operation of it, the simple thing is to enter the URL of the novel website, and then crawl the frame can be automatically crawled down.

Here's a second example:

1 " "Amazon Arcade Crawl '2 3 Import Requests4 5  6 7 def gethtml_text (URL):8 9 Try:Ten  One kv = {' user-agent ': ' mozilla/5.0 '} #headers头信息 A  - r = requests.get (url,timeout = 20,headers = kv) -  the r.raise_for_status () #如果状态不是200, an exception is generated -  - r.encoding = r.apparent_encoding -  + return R.text -  + except: A  at return ' produces an exception ' -  -   -  - if __name__ = = ' __main__ ': -  in url = ' https://www.amazon.cn/' -  to print (Gethtml_text (URL) [: +])

Look at the results of the operation:

Quite a mess, but this is not the point, the focus is that we get all the response information, to the back we will be involved in extracting useful information, now do not worry Kazakhstan, slowly, rice to eat a mouthful.

KV = {' user-agent ': ' mozilla/5.0 '} #headers头信息

R = requests.get (url,timeout = 20,headers = kv)

Remember that the Get () method mentioned above also has a few parameters not mentioned, here headers is one, after adding the header information, the crawler will disguise as Mozilla browser to the target URL to send requests, if the other server is able to identify the crawler, No disguise. Ken can be denied access in minutes, here we add a kv dictionary as the head information, to a certain extent, to enhance the ability of the crawler, Ha ha, see below an example.

Example three:

1 " "Baidu Search" "2 3 ImportRequests4 5  6 7 defgethtml_text (URL):8 9     Try:Ten  OneKV = {'user-agent':'mozilla/5.0'} A  -WD = {'WD':'Beijing Weather'} -  theR = requests.get (url,timeout = 20,params = WD, headers =kv) -  -R.raise_for_status ()#If the status is not 200, an exception is generated -  +R.encoding =r.apparent_encoding -  +         returnR.text A  at     except: -  -         return 'Create an exception' -  -   -  in if __name__=='__main__': -  toURL ="http://www.baidu.com/s?" +  -     Print(Gethtml_text (URL) [: 5000])

Look at the results of the operation:

I #@￥! @%！ @#

Calm down, then change:

1 " "Baidu Search" "2 3 ImportRequests4 5  6 7 defgethtml_text (URL):8 9     Try:Ten  OneWD = {'WD':'Beijing Weather'} A  -R = requests.get (url,timeout = 20,params =wd) -  theR.raise_for_status ()#If the status is not 200, an exception is generated -  -R.encoding =r.apparent_encoding -  +         returnR.text -  +     except: A  at         return 'Create an exception' -  -   -  - if __name__=='__main__': -  inURL ="http://www.baidu.com/s?" -  to     Print(Gethtml_text (URL) [: 5000])

Operation Result:

Already have the result, the reason is that headers parameter we simulate is Mozilla browser, Baidu's server directly refuses to visit, nothing, the problem is not big, we pay attention to the process, do not need these details.

Well, after 3 examples above, I believe we have a deeper understanding of the use of the Requests.get () method, and finally summarize the parameters of the Get () method.

Name of parameter	Description
Data	Dictionary, byte sequence, or file object, as the content of the request
Json	JSON-formatted data as the content of the request
Headers	Dictionary, HTTP custom header
Cookies	A cookie in a dictionary or cookiejar,request
Auth	Tuples, support HTTP authentication function
Timeout	Set the time-out, in seconds
Files	Dictionary types, transferring files
Proxies	Dictionary type, set Access Proxy, can increase login authentication
Allow_redirects	True/false, default bit true, redirect switch
Stream	True/false, default bit true, get content download now switch
Verify	True/false, default bit True, authentication SSL certificate switch
Cert	Local SSL Certificate path

These parameters will be gradually deepened with the subsequent study, this period to this end.

Python simple crawler second egg!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More