The action related to URL requests in Python involves modules with urllib,urllib2,requests, where Urllib and Urllib2 are Python's own HTTP access standard library, and Requsets are third-party libraries that need to be installed on their own. Requests is a third-party library that can be thought of in use and it may be the most convenient one.
Urllib and Urllib2
Both the Urllib and URLLIB2 modules are related to URL requests, but the functionality provided is different. Our common Urllib2 Request method:
Response = Urllib2.urlopen (' http://www.baidu.com ')
In the parameter can pass in the URL and request object, incoming request can set the URL request headers, can disguise as the browser (when requests the website carries on the request monitoring the time), the urllib is only passes the URL, this is also one of the difference:
User_agent = ' mozilla/4.0 (compatible; MSIE 5.5; Windows NT) '
Request = Urllib2. Request (URL, headers={ ' user-agent ': user_agent}) Response = Urllib2.urlopen (Request)
But in the Urllib some methods are not joined in the URLLIB2, in some cases also need to urllib auxiliary, these I still do not understand, and so on, and so on when the encounter, such as related to Unicode encoding is only used urllib to deal with, This is one of the differences between the two.
There are several common parameters for the OpenURL function in URLLIB2: data, timeout, blocking operations in seconds, and data and request objects described in the request class.
The Requset class has 5 parameters: Url,data,headers,origin_req_host,unverifiable.
- The URL does not have to say is the URL address we want to request
- The data is the additional data we want to submit to the server, and if no data can be none, the request is the POST request if it is the data, which needs to be encoded in standard format and then transmitted to the request object.
- Headers the request header, which is a dictionary type. It is to tell the server to request some information, such as the requested browser information, operating system information, cookies, return information format, cache, whether to support compression and so on, such as some anti-crawler sites will monitor the type of request, we need to disguise as a browser instead of directly initiating the request, For example, the user-agent in the above code
- Origin_req_host is the request-host of the source interaction defined by RFC2965. The default value is Cookielib.request_host (self). This is the host name or IP address of the original request initiated by the user. For example, if you are requesting an image in an HTML document, this should be the request-host of the page request that contains the image.
- Unverifiable represents whether the request is not verifiable, and it is also defined by RFC2965. The default value is False. An unauthenticated request is that the URL of its user does not have sufficient permissions to be accepted.
We do not always request when the request is a successful page, if the request URL is not normal error is also need to do a good job of judgment processing.
Try: response = Urllib2.urlopen (' http://www.baidu.com ') except URLLIB2. Httperror as E: print e.code print e.reasonexcept urllib2. Urlerror as E: print E.reasonelse: response.read ()
When an error throws an exception we can capture the reason for viewing the exception and get the status code of the request. The GetCode () method can also obtain the request status code, Appendix:
# Table mapping response codes to messages; Entries has the# form {code: (shortmessage, longmessage)}.responses = {: (' Continue ', ' Request received, please CO Ntinue '), 101: (' Switching protocols ', ' switching to new protocol; Obey Upgrade header '), $: (' OK ', ' Requ EST fulfilled, document follows '), 201: (' Created ', ' document Created, URL follows '), 202: (' Accepted ', ' Re Quest accepted, processing continues off-line '), 203: (' non-authoritative information ', ' Request fulfilled from cache ') , 204: (' No content ', ' Request fulfilled, nothing follows '), 205: (' Reset Content ', ' Clear input ' form for further in Put. '), 206: (' Partial content ', ' partial content follows. '), +: (' Multiple Choices ', ' Object has several Resources-see URI List '), 301: (' Moved permanently ', ' Object Moved permanently--see URI List '), 302: (' Found ', ' Object moved temporarily--see URI List '], 303: (' See other ', ' object moved--see Method and URL list '), 304: (' Not Modified ', ' Document have not changed since given time '), 305: (' Use Proxy ', ' You must use proxy specified in the location to access this ' resource. '), 307: (' Temporary Redirect ', ' Object moved temporarily--see URI List '), at: (' Bad Request ', ' bad request syntax or unsupported method '), 401: (' Unauthorized ', ' no permission--See authorization schemes '), 402: (' Payment Required ', ' No Payment-See charging schemes '), 403: (' Forbidden ', ' Request Forbidden--authorization won't help '), 404: (' Not Found ', ' nothing matches the given URI '), 405: (' Method not allowed ', ' Specified method is invalid For the This server. '), 406: (' not acceptable ', ' URI not available in preferred format. '), 407: (' Proxy authentication Required ', ' must authenticate with ' "This proxy before proceeding. '), 408: (' request Timeout ', ' request T ' Imed out; Try again lateR. '), 409: (' Conflict ', ' Request Conflict. '), 410: (' Gone ', ' URI no longer exists and has been permanently removed. '), 411: (' Length Required ', ' Client must specify Content-length. '), 412: (' precondition Failed ', ' Precondit Ion in headers are false. '), 413: (' Request entity Too Large ', ' entity is Too Large. '), 414: (' Request-uri Too Long ', ' URI is too long. '), 415: (' Unsupported Media Type ', ' Entity body in unsupported format. '), 416: (' requested Range Not satisfiable ', ' cannot satisfy request range '), 417: (' expectation Failed ', ' Expect condition cou LD not being satisfied. '), $: (' Internal server Error ', ' server got itself in trouble '), 501: (' not implemented '), ' Server does not operation '), 502: ("Bad Gateway", ' Invalid responses from another server/proxy. '), 503: (' Service unavailable ', ' the server cannot process the request due to a high load '), 504: (' Gateway Ti Meout ', ' the GATeway server did not receive a timely response '), 505: (' HTTP Version not supported ', ' cannot fulfill request. '),}
Requests
Requests uses URLLIB3, inherits all the features of URLLIB2,requests supports HTTP connection retention and connection pooling, supports the use of cookies to keep sessions, supports file uploads, supports automatic encoding of response content, supports internationalized URLs, and The POST data is automatically encoded.
GET request
Response = Requests.get (' http://www.baidu.com ') print Response.text
POST request
Response = Requests.post (' http://api.baidu.com ', data={ ' data ': ' Value '})
Custom headers
Response = Requests.get (' http://www.baidu.com ', headers={ ' user-agent ': user_agent})
Response to return data related operations:
R.status_code #响应状态码
R.raw #返回原始响应体, which is urllib response object, is read using R.raw.read ()
R.content #字节方式的响应体 that will automatically decode for you gzip and deflate compression
R.text #字符串方式的响应体 is automatically decoded based on the character encoding of the response header
R.headers #以字典对象存储服务器响应头, but this dictionary is very special, the dictionary key is not case-sensitive, if the key does not exist return none
Special methods:
R.json () #Requests中内置的JSON解码器
R.raise_for_status () #失败请求 (not 200 responses) throws an exception
----------------to Be Continued---------------
Life is short python urllib urllib2 requests