Python simple crawler second egg!

Source: Internet
Author: User
Tags http authentication ssl certificate

Second Lecture: Several instances of crawling Web pages

This talk through a few examples to describe the basic operation of crawling Web pages, but also review the contents of the last lesson, I believe you can remember the General code framework, this talk also need to use it, ha ha, is not forget the drip.

All right, here we go!

First Instance:

1 " "The novel crawls" "2 3 ImportRequests4 5  6 7 defgethtml_text (URL):8 9     Try:Ten  OneR = Requests.get (Url,timeout = 20) A  -R.raise_for_status ()#If the status is not 200, an exception is generated -  theR.encoding =r.apparent_encoding -  -         returnR.text -  +     except: -  +         return 'Create an exception' A  at   -  - if __name__=='__main__': -  -URL ='http://wanmeishijiexiaoshuo.org/book/9.html' -  in #the above is one of the novels on the website of the novel content -  to Print(Gethtml_text (URL) [: 5000])

Let's run a look at the results:

See, is the content of the novel, climbed down, understand the operation of it, the simple thing is to enter the URL of the novel website, and then crawl the frame can be automatically crawled down.

Here's a second example:

1 " "Amazon Arcade Crawl '2 3 Import Requests4 5  6 7 def gethtml_text (URL):8 9 Try:Ten  One kv = {' user-agent ': ' mozilla/5.0 '} #headers头信息 A  - r = requests.get (url,timeout = 20,headers = kv) -  the r.raise_for_status () #如果状态不是200, an exception is generated -  - r.encoding = r.apparent_encoding -  + return R.text -  + except: A  at return ' produces an exception ' -  -   -  - if __name__ = = ' __main__ ': -  in url = ' https://www.amazon.cn/' -  to print (Gethtml_text (URL) [: +])

Look at the results of the operation:

Quite a mess, but this is not the point, the focus is that we get all the response information, to the back we will be involved in extracting useful information, now do not worry Kazakhstan, slowly, rice to eat a mouthful.

KV = {' user-agent ': ' mozilla/5.0 '} #headers头信息

R = requests.get (url,timeout = 20,headers = kv)

Remember that the Get () method mentioned above also has a few parameters not mentioned, here headers is one, after adding the header information, the crawler will disguise as Mozilla browser to the target URL to send requests, if the other server is able to identify the crawler, No disguise. Ken can be denied access in minutes, here we add a kv dictionary as the head information, to a certain extent, to enhance the ability of the crawler, Ha ha, see below an example.

Example three:

1 " "Baidu Search" "2 3 ImportRequests4 5  6 7 defgethtml_text (URL):8 9     Try:Ten  OneKV = {'user-agent':'mozilla/5.0'} A  -WD = {'WD':'Beijing Weather'} -  theR = requests.get (url,timeout = 20,params = WD, headers =kv) -  -R.raise_for_status ()#If the status is not 200, an exception is generated -  +R.encoding =r.apparent_encoding -  +         returnR.text A  at     except: -  -         return 'Create an exception' -  -   -  in if __name__=='__main__': -  toURL ="http://www.baidu.com/s?" +  -     Print(Gethtml_text (URL) [: 5000])

Look at the results of the operation:

I #@¥! @%! @#

Calm down, then change:

1 " "Baidu Search" "2 3 ImportRequests4 5  6 7 defgethtml_text (URL):8 9     Try:Ten  OneWD = {'WD':'Beijing Weather'} A  -R = requests.get (url,timeout = 20,params =wd) -  theR.raise_for_status ()#If the status is not 200, an exception is generated -  -R.encoding =r.apparent_encoding -  +         returnR.text -  +     except: A  at         return 'Create an exception' -  -   -  - if __name__=='__main__': -  inURL ="http://www.baidu.com/s?" -  to     Print(Gethtml_text (URL) [: 5000])

Operation Result:

Already have the result, the reason is that headers parameter we simulate is Mozilla browser, Baidu's server directly refuses to visit, nothing, the problem is not big, we pay attention to the process, do not need these details.

Well, after 3 examples above, I believe we have a deeper understanding of the use of the Requests.get () method, and finally summarize the parameters of the Get () method.

Name of parameter

Description

Data

Dictionary, byte sequence, or file object, as the content of the request

Json

JSON-formatted data as the content of the request

Headers

Dictionary, HTTP custom header

Cookies

A cookie in a dictionary or cookiejar,request

Auth

Tuples, support HTTP authentication function

Timeout

Set the time-out, in seconds

Files

Dictionary types, transferring files

Proxies

Dictionary type, set Access Proxy, can increase login authentication

Allow_redirects

True/false, default bit true, redirect switch

Stream

True/false, default bit true, get content download now switch

Verify

True/false, default bit True, authentication SSL certificate switch

Cert

Local SSL Certificate path

These parameters will be gradually deepened with the subsequent study, this period to this end.

Python simple crawler second egg!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.