Second Lecture: Several instances of crawling Web pages
This talk through a few examples to describe the basic operation of crawling Web pages, but also review the contents of the last lesson, I believe you can remember the General code framework, this talk also need to use it, ha ha, is not forget the drip.
All right, here we go!
First Instance:
1 " "The novel crawls" "2 3 ImportRequests4 5 6 7 defgethtml_text (URL):8 9 Try:Ten OneR = Requests.get (Url,timeout = 20) A -R.raise_for_status ()#If the status is not 200, an exception is generated - theR.encoding =r.apparent_encoding - - returnR.text - + except: - + return 'Create an exception' A at - - if __name__=='__main__': - -URL ='http://wanmeishijiexiaoshuo.org/book/9.html' - in #the above is one of the novels on the website of the novel content - to Print(Gethtml_text (URL) [: 5000])
Let's run a look at the results:
See, is the content of the novel, climbed down, understand the operation of it, the simple thing is to enter the URL of the novel website, and then crawl the frame can be automatically crawled down.
Here's a second example:
1 " "Amazon Arcade Crawl '2 3 Import Requests4 5 6 7 def gethtml_text (URL):8 9 Try:Ten One kv = {' user-agent ': ' mozilla/5.0 '} #headers头信息 A - r = requests.get (url,timeout = 20,headers = kv) - the r.raise_for_status () #如果状态不是200, an exception is generated - - r.encoding = r.apparent_encoding - + return R.text - + except: A at return ' produces an exception ' - - - - if __name__ = = ' __main__ ': - in url = ' https://www.amazon.cn/' - to print (Gethtml_text (URL) [: +])
Look at the results of the operation:
Quite a mess, but this is not the point, the focus is that we get all the response information, to the back we will be involved in extracting useful information, now do not worry Kazakhstan, slowly, rice to eat a mouthful.
KV = {' user-agent ': ' mozilla/5.0 '} #headers头信息
R = requests.get (url,timeout = 20,headers = kv)
Remember that the Get () method mentioned above also has a few parameters not mentioned, here headers is one, after adding the header information, the crawler will disguise as Mozilla browser to the target URL to send requests, if the other server is able to identify the crawler, No disguise. Ken can be denied access in minutes, here we add a kv dictionary as the head information, to a certain extent, to enhance the ability of the crawler, Ha ha, see below an example.
Example three:
1 " "Baidu Search" "2 3 ImportRequests4 5 6 7 defgethtml_text (URL):8 9 Try:Ten OneKV = {'user-agent':'mozilla/5.0'} A -WD = {'WD':'Beijing Weather'} - theR = requests.get (url,timeout = 20,params = WD, headers =kv) - -R.raise_for_status ()#If the status is not 200, an exception is generated - +R.encoding =r.apparent_encoding - + returnR.text A at except: - - return 'Create an exception' - - - in if __name__=='__main__': - toURL ="http://www.baidu.com/s?" + - Print(Gethtml_text (URL) [: 5000])
Look at the results of the operation:
I #@¥! @%! @#
Calm down, then change:
1 " "Baidu Search" "2 3 ImportRequests4 5 6 7 defgethtml_text (URL):8 9 Try:Ten OneWD = {'WD':'Beijing Weather'} A -R = requests.get (url,timeout = 20,params =wd) - theR.raise_for_status ()#If the status is not 200, an exception is generated - -R.encoding =r.apparent_encoding - + returnR.text - + except: A at return 'Create an exception' - - - - if __name__=='__main__': - inURL ="http://www.baidu.com/s?" - to Print(Gethtml_text (URL) [: 5000])
Operation Result:
Already have the result, the reason is that headers parameter we simulate is Mozilla browser, Baidu's server directly refuses to visit, nothing, the problem is not big, we pay attention to the process, do not need these details.
Well, after 3 examples above, I believe we have a deeper understanding of the use of the Requests.get () method, and finally summarize the parameters of the Get () method.
Name of parameter |
Description |
Data |
Dictionary, byte sequence, or file object, as the content of the request |
Json |
JSON-formatted data as the content of the request |
Headers |
Dictionary, HTTP custom header |
Cookies |
A cookie in a dictionary or cookiejar,request |
Auth |
Tuples, support HTTP authentication function |
Timeout |
Set the time-out, in seconds |
Files |
Dictionary types, transferring files |
Proxies |
Dictionary type, set Access Proxy, can increase login authentication |
Allow_redirects |
True/false, default bit true, redirect switch |
Stream |
True/false, default bit true, get content download now switch |
Verify |
True/false, default bit True, authentication SSL certificate switch |
Cert |
Local SSL Certificate path |
These parameters will be gradually deepened with the subsequent study, this period to this end.
Python simple crawler second egg!