Python web crawler's requests library

Source: Internet
Author: User
Tags ssl certificate python web crawler

The requests Library is an HTTP client written in Python . Requests Cubby urlopen more convenient. Can save a lot of intermediate processing process, so that directly crawl Web data. Take a look at specific examples:   
defRequest_function_try ():
headers={' User-agent ':' mozilla/5.0 (Windows NT 10.0; WOW64; rv:44.0) gecko/20100101 firefox/44.0 '}
R=requests.get (Url="Http://www.baidu.com",Headers=headers)
print "status code:%s" % R.status_code
print "headers:%s" % R.headers
print "encoding:%s" % r.encoding
print "cookies:%s" % R.cookies
print "url:%s" % R.url
Print r.content.decode ('utf-8 '). Encode (' MBCS ')
Use the Requests.get () method directly for the HTTP link, where you enter the parameter URL and headers. The return value is the response of the Web page . you can get the status from the returned response, the header information. Encoding paradigm,cookie value, Web address and page code      
E:\python2.7.11\python.exe e:/py_prj/test3.py
Status code:200
headers:{' content-encoding ': ' gzip ', ' transfer-encoding ': ' chunked ', ' set-cookie ': ' bdorz=27315; max-age=86400; domain=.baidu.com;  path=/', ' Server ': ' bfe/1.0.8.18 ', ' last-modified ': ' Mon, 13:28:24 GMT ', ' Connection ': ' keep-alive ', ' Pragma ': ' No-cache ', ' cache-control ': ' Private, No-cache, No-store, Proxy-revalidate, No-transform ', ' Date ': ' Sun, SEP 2017 02: 53:11 GMT ', ' content-type ': ' text/html '}
Encoding:iso-8859-1
cookies:{'. Baidu.com ': {'/': {' Bdorz ': Cookie (version=0, name= ' Bdorz ', value= ' 27315 ', Port=none, Port_specified=false , domain= '. Baidu.com ', Domain_specified=true, domain_initial_dot=true, path= '/', path_specified=true, Secure=False, expires=1505702637, Discard=false, Comment=none, Comment_url=none, rest={}, Rfc2109=false)}}
url:http://www.baidu.com/
Note that when you get the page code, there is a problem with printing directly in Python2 because of the Chinese language . So you need to decode and encode it first. The way to encode here is MBCS. The specific encoding method can be obtained by the following methods. 
Sys.setdefaultencoding (' utf-8 ')
Type = Sys.getfilesystemencoding ()
Requests also has a built-in JSON decoder that can help parse the resulting JSON data 
R=requests.get (' Https://github.com/timeline.json ')
R.json ()
E:\python2.7.11\python.exe e:/py_prj/test3.py
{u ' documentation_url ': U ' https://developer.github.com/v3/activity/events/#list-public-events ', u ' message ': U ' Hello there, Wayfaring stranger. If You\u2019re Reading this and you probably didn\u2019t see we blog post a couple of years back announcing that this AP I would go Away:http://git.io/17arog fear not, you should is able to get what is need from the shiny new Events API Inst EAD. '}
If you want to pass data, how to handle it. Here we take Baidu search as an example. Enter Python in the input box and get the result returned. 
defRequest_function_try1 ():
Reload(SYS)
Sys.setdefaultencoding (' Utf-8 ')
Type = Sys.getfilesystemencoding ()
PrintType
headers={' User-agent ':' mozilla/5.0 (Windows NT 10.0; WOW64; rv:44.0) gecko/20100101 firefox/44.0 '}
payload={' WD ':' python '}
R=requests.get (Url="http://www.baidu.com/s",params=payload,headers =headers)
    print r.status_code
    print r.content.decode ( ' utf-8 '     fp = open ( Span lang= "en-us" > ' search2.html ' ' W '     for line in r.content:
         fp.write (line)
    fp.close ()
Why do you use http://www.baidu.com/s in this website ? Let's look at it from the Web. After entering python in the input box , the Web page actually jumps to the https://www.baidu.com/s interface. The Wd=python and so on are the input data.   

The results of the implementation are as follows:
Status code:200
headers:{' strict-transport-security ': ' max-age=172800 ', ' bdqid ': ' 0xeb453e0b0000947a ', ' content-encoding ': ' gzip ', ' Transfer-encoding ': ' chunked ', ' set-cookie ': ' bdsvrtm=0; path=/, bd_home=0; path=/, h_ps_pssid=1421_21078_17001_24394; path=/; Domain=.baidu.com ', ' Expires ': ' Sun, Sep 02:56:13 GMT ', ' bduserid ': ' 0 ', ' x-powered-by ': ' hphp ', ' Server ': ' BWS/1. 1 ', ' Connection ': ' keep-alive ', ' cxy_all ': ' baidu+2455763ad13223918d1e7f7431d4d18e ', ' cache-control ': ' Private ', ' Date ': ' Sun, Sep 02:56:43 GMT ', ' Vary ': ' accept-encoding ', ' content-type ': ' text/html; Charset=utf-8 ', ' bdpagetype ': ' 1 ', ' x-ua-compatible ': ' ie=edge,chrome=1 '}
Encoding:utf-8
Cookies:<requestscookiejar[<cookie h_ps_pssid=1421_21078_17001_24394 for. Baidu.com/>, <Cookie BDSVRTM= 0 for Www.baidu.com/>, <cookie bd_home=0 for www.baidu.com/>]>
url:https://www.baidu.com/
If the website we visit returns a status code other than 200. This time requests library also has the exception handling way is raise_for_status. Throws an exception when returned as a non- 200 response   
URL=' http://www.baidubaidu.com/'
try
:
R=requests.get (URL)
R.raise_for_status ()
E:
E
The execution results are as follows, and the specific error code information is returned in the exception.
E:\python2.7.11\python.exe e:/py_prj/test3.py
409 Client error:conflict for url:http://www.baidubaidu.com/
Let's look at how to simulate access to an HTTPS website. We take the CSDN website as an example. To simulate landing, the first to collect web data for analysis, here with Fidder to collect. 

(a) Analysis of the page jump, the first is the landing interface, the URL is https://passport.csdn.net/account/login?from=http://my.csdn.net/my/mycsdn. Then it automatically jumps to my.csdn.net.

(b) Analysis of the data submitted on the Web page. On the right side of the screen, the actual data submitted by the page will appear. The box above is the header information that is sent. The following is the header information for the server to return data. We use the data above to construct the header information we send.
(c) From the third step above we see that the way to submit the data is POST. Then we need to see what the post data is. Click WebForms to see the uploaded data, which has fields such as Username,password,lt,execution,_eventid. We access these fields to make them easy to construct in the code.

(d) The final step is to look at the data that jumps to the Mycsdn interface, which takes the Get method and sends only the header information. So we just need to construct the header information.

After the data flow analysis, you can begin to construct the code:
The first is the construction of the head information, the most important is user-agent, if not set, will be banned by the site
headers={' host':' passport.csdn.net ',' user-agent ':' mozilla/5.0 ( Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) chrome/46.0.2490.80 safari/537.36 '}
headers1={' user-agent ':' mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) chrome/46.0.2490.80 safari/537.36 '}
And then the cookie value in the construction header information.
Cookies={' Jsessionid ':' 5543aaaaaaaaaaaaaaaabbbbbb.tomcat2 ',
' Uuid_tt_dd ':' -411111111111119_20170926 ',' Jsessionid ':' 2222222222222220265c40d8a33cb.tomcat2 ',
' UN ':' XXXXX ',' UE ':' [email protected] ',' BT ':' 334343481 ',' LSSC ':' Lssc-145514-7aaaaaaaaaaazggmhfvhfo9taaaaaaar-passport.csdn.net ',
' Hm_lvt_6bcd52f51bbbbbb2bec4a3997715ac ':' 15044213,150656493,15064444445,1534488843 ',' Hm_lpvt_6bcd52f51bbbbbbbe32bec4a3997715ac ':' 1506388843 ',
' Dc_tos ':' Oabckz ',' dc_session_id ': '15063aaaa027_0.7098840409889817 ',' __message_sys_msg_id ':' 0 ', ' __message_gu_msg_id':' 0',' __message_cnel_msg_id ':' 0 ' ,' __message_district_code ':' 000000 ',' __message_in_school ' :' 0 '}
Then setURLs, andPost'sData
Url= ' https://passport.csdn.net/account/login?from=http://my.csdn.net/my/mycsdn '
data={' username ':' xxxx ',' password ':' xxxxx ',' lt ':' lt-1522220-bsnh9fn6ycbbbbbqgssp2waaa1jvq', 'execution ':' E4ab ',' _eventid':' Submit '}
Start to prepare the link, where the session is used to keep the link behind the same reply, such as cookie value, etc. 
R=requests. Session ()
R.post (url=url,headers=headers,cookies=cookie,data=data)
In this step error, return the following results prompt certificate verify failed
File "E:\python2.7.11\lib\site-packages\requests\adapters.py", line 506, in send
Raise Sslerror (E, request=request)
requests.exceptions.SSLError:HTTPSConnectionPool (host= ' passport.csdn.net ', port=443): Max retries exceeded with URL:/account/login?from=http://my.csdn.net/my/mycsdn (caused by Sslerror (Sslerror (1, u ' [ssl:certificate_verify_ FAILED] Certificate Verify FAILED (_ssl.c:590),   ))
The reason for this error is that Python 2.7.9 introduces a new feature that  validates an SSL certificate When you urllib.urlopen an HTTPS
A urllib2 is burst when the target uses a self-signed certificate . Urlerror: <urlopen error [ssl:certificate_verify_failed] CERTIFICATE VERIFY FAILED (_ssl.c:581) > Wrong message
To solve this problem, the PEP-0476 documentation says:
for the users who wish to opt-out of certificate verification in a single connection, they can achieve this by providing T He  contextargument to Urllib.urlopen
That means you can ban the requirement for this certificate,Urllib There are two ways, one is Urllib.urlopen () has a parameter context, set him up as Ssl._create_unverified_ Context   
Qs.

Urllib.urlopen ("Https://no-valid-cert"
But actually in the requests, there is a verify parameter, set it to false on it.  
R.post (url=url,headers=headers,cookies=cookie,data=data,verify= False)           
Next visit the address of the MYCSDN. This will successfully login to the csdn website. 
S=r.get (' http://my.csdn.net/my/mycsdn ',headers=headers1)
S.status_code
S.content.decode (' utf-8 '). Encode (type)

Python web crawler's requests library

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.