# # Python implements the three-way HTTP request: Urllib2/urllib, Httplib/urllib, and requestsUrllib2/urllib implementation
Urllib2 and Urllib are python two built-in modules, to implement the HTTP function, the implementation is mainly URLLIB2, urllib as a supplement
1 first implementation of a complete request and response model
- URLLIB2 provides basic function Urlopen,
import urllib2
response = urllib2.urlopen(‘http://www.cnblogs.com/guguobao‘)
html = response.read()
print html
- Improved, two-step: request and response
#!coding:utf-8
Import urllib2
#request
Request = urllib2.Request(‘http://www.cnblogs.com/guguobao‘)
#response
Response = urllib2.urlopen(request)
Html = response.read()
Print html
- Use the GET request above, change to POST request below, use Urllib.
#!coding:utf-8
Import urllib
Import urllib2
Url = ‘http://www.cnblogs.com/login‘
Postdata = {‘username‘ : ‘qiye’,
‘password‘ : ‘qiye_pass’}
#info needs to be encoded as a format that urllib2 can understand. Here is urllib
Data = urllib.urlencode(postdata)
Req = urllib2.Request(url, data)
Response = urllib2.urlopen(req)
Html = response.read()
-
- However, the running result is not output because the server denies your access and needs to verify the request header information to determine if the request is from the browser
2 Request Header Headers processing
#coding:utf-8
#Request header header processing: Set the User-Agent domain and Referer domain information in the request header
Import urllib
Import urllib2
Url = ‘http://www.xxxxxx.com/login‘
User_agent = ‘Mozilla/4.0 (compatible; MSIE 5.5; Windows NT) ‘
Referrer=‘http://www.xxxxxx.com/‘
Postdata = {‘username‘ : ‘qiye’,
‘password‘ : ‘qiye_pass’}
# Write user_agent, referer to header information
Headers={‘User-Agent’:user_agent,‘Referer‘:referer}
Data = urllib.urlencode(postdata)
Req = urllib2.Request(url, data,headers)
Response = urllib2.urlopen(req)
Html = response.read()
3 Cookie Processing
- Urllib2 the processing of cookies is also automatic, using the Cookiejar function for the management of cookies, if you need to get the value of a cookie entry, you can:
import urllib2,cookielib
cookie = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
response = opener.open(‘http://www.zhihu.com‘)
for item in cookie:
print item.name+‘:‘+item.name
- But sometimes, we don't want URLLIB2 to handle it automatically, we want to add cookies ourselves, and we can do it by setting the cookie domain in the request header.
Import urllib2, cookielib
Opener = urllib2.build_opener()
Opener.addheaders.append((‘Cookie‘,‘email=‘+‘helloguguobao@gmail.com‘))#Cookie and email can replace any value, but not
Req = urllib2.Request(‘http://www.zhihu.com‘)
Response = opener.open(req)
Print response.headers
Retdata = response.read()
4 Setting Timeout Timeouts
- In python2.6 and the new version, the Urlopen function provides the setting for timeout:
import urllib2
request=urllib2.Request(‘http://www.zhihu.com‘)
response = urllib2.urlopen(request,timeout=2)
html=response.read()
print html
5 Getting HTTP response codes
- An HTTP return code can be obtained as long as the GetCode () method of the response object returned by Urlopen is used.
import urllib2
try:
response = urllib2.urlopen(‘http://www.google.com‘)
print response
except urllib2.HTTPError as e:
if hasattr(e, ‘code‘):
print ‘Error code:‘,e.code
6. Redirection
- URLLIB2 automatically redirects the HTTP 3XX return code by default. To detect whether a redirect action occurred, just check the URL of the response and request URL is consistent:
import urllib2
response = urllib2.urlopen(‘http://www.zhihu.cn‘)
isRedirected = response.geturl() == ‘http://www.zhihu.cn‘
- If you do not want to automatically redirect, you can customize the Httpredirecthandler class:
import urllib2
class RedirectHandler(urllib2.HTTPRedirectHandler):
def http_error_301(self, req, fp, code, msg, headers):
pass
def http_error_302(self, req, fp, code, msg, headers):
result =urllib2.HTTPRedirectHandler.http_error_301(self,req,fp,code,msg,headers)
result.status =code
result.newurl = result.geturl()
return result
opener = urllib2.build_opener(RedirectHandler)
opener.open(‘http://www.zhihu.cn‘)
7 Proxy settings
- In the development of crawlers, agents may be used. URLLIB2 uses the environment variable HTTP_PROXY to set the HTTP Proxy by default. But instead of using this approach, we use Proxyhandler to dynamically set up proxies in the program
import urllib2
proxy = urllib2.ProxyHandler({‘http‘: ‘127.0.0.1:1080‘})# 运行时需要把socketsocks关闭系统代理。并使用1080端口,或者直接退出socketsocks软件
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
response = urllib2.urlopen(‘http://www.zhihu.com/‘)
print response.read()
It is important to note that the Urllib2.install_opener () will be used to set the global opener of the URLLIB2, and then all HTTP accesses will use the proxy, which is convenient, but to use two different proxies in a program You cannot use Install_opener to change global settings, but instead call Urllib2.open () directly.
import urllib2
proxy = urllib2.ProxyHandler({‘http‘: ‘127.0.0.1:1080‘})
opener = urllib2.build_opener(proxy,)
response = opener.open("http://www.google.com/")
print response.read()
The runtime needs to shut down the socketsocks system Agent.
Python implementations of HTTP requests (Urlopen, headers processing, cookie handling, setting timeout timeouts, redirection, proxy settings)