Turn from: http://www.cnblogs.com/kennyhr/p/4018668.html (infringement can contact me to delete)
All along the technical group will have new students to ask questions about Urllib and URLLIB2 and cookielib related issues. So I'm going to summarize here and avoid wasting resources by answering the same questions over and over again.
This is a tutorial class text, if you already know urllib2 and cookielib so please ignore this article.
First, start with a piece of code,
#CookiesImportUrllib2Import Cookielibcookie = Cookielib. Cookiejar () opener = Urllib2.build_opener (urllib2. Httpcookieprocessor (cookie)) request = Urllib2. Request (Url= ' http://www.baidu.com/ ' ) Request.add_header (user-agent ", " mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1 ' ) response = Opener.open (Request) for item in cookie: print item.value
Many students say, I can write the URL directly in Openr.open (), why you need to use request. In fact, I wrote this to tidy up the common steps used to construct a request using URLLIB2.
Preliminary, the URLLIB2 constructs the request common steps (combined with the above code):
1, Handler
Handler Urllib2.build_opener (handler), the following are the official common handler
Urllib2. HttpHandler () Open URL via http
Urllib2. Cacheftphandler () FTP handler with persistent FTP connection
Urllib2. Filehandler () to open a local file
Urllib2. Ftphandler () Open URL via ftp
Urllib2. Httpbasicauthhandler () processing via HTTP authentication
Urllib2. Httpcookieprocessor () Processing HTTP cookies
Urllib2. Httpdefaulterrorhandler () Handling HTTP errors by throwing Httperror exceptions
Urllib2. Httpdigestauthhandler () HTTP Digest validation processing
Urllib2. Httpredirecthandler () Handling HTTP redirection
Urllib2. Httpshandler () via Secure HTTP redirection
Urllib2. Proxyhandler () Redirect request via Proxy
Urllib2. Proxybasicauthhandler Basic Proxy Authentication
Urllib2. Proxydigestauthhandler Digest Agent Validation
Urllib2. Unknownhandler processing of all unknown URLs
2. Request
Request=urllib2. Request (url= ")
Request.add_data (data) If the request is HTTP, the method changes to post. Note that this method does not append data to any of the previously-set settings, but instead uses the current data to replace the previous
Request.add_header (Key,val) key is the header name, Val is the header value, two parameters are string
Request.add_unredirected_header (Key,val) Ibid, but not added to redirect request
Request.set_proxy (Host,type) prepares the request to the server. Replace the original host with host and replace the original request type with type
3, opener
The basic Urlopen () function does not support authentication, cookies, or other advanced HTTP features. To support these features, you must use the Build_opener () function to create your own custom opener object
To build your own custom opener object, there are usually two ways of doing it:
A
Opener=urllib2. Openerdirector ()
Opener.add_handler (Handler)
b
Opener=urllib2. Openerdirector ()
Urllib2.build_opener (Handler)
Install_opener (opener)
Installing opener as the global URL opener used by Urlopen () means that the installed opener object will be used when Urlopen () is called later. Opener is typically the opener object created by Build_opener ().
4, Content_stream
Content_stream=opener.open (Request)
5, Content_stream.read ()
With the above 5 steps, you can get the code similar to the one at the beginning of this article. This completes the construction of a URLLIB2 basic usage pattern. You can also encapsulate the above 5 steps into a class, but I don't think it's a very brief introduction.
The URLLIB2 module can not only use the Urlopen () function but also customize the opener to access the Web page
Note, however, that the Urlretrieve () function is in the Urllib module, and the function does not exist in the URLLIB2 module. However, using the URLLIB2 module is generally inseparable from the Urllib module, because the post data needs to be encoded using the Urllib.urlencode () function
Advanced, Urllib2 more use details:
1. Proxy settings
Import Urllib2
Enable_proxy=true
Proxy_handler=urllib2. Proxyhandler ({' http ': ' http://some-proxy.com:8080 '})
Null_proxy_handler = Urllib2. Proxyhandler ({})
If Enable_proxy:
Opener = Urllib2.build_opener (Proxy_handler)
Else
Opener = Urllib2.build_opener (Null_proxy_handler)
Urllib2.install_opener (opener)
PS: Use Urllib2.install_opener () to set URLLIB2 global opener. The use of the latter will be convenient, but can not be more detailed control, if you want to use the program two different proxy settings. The better way is not to apply install_opener to change the global settings, but simply call opener's Open method instead of the global Urlopen method.
2. Timeout setting
# < py2.6
Import Urllib2
Import socket
Socket.setdefaulttimeoust (#one)
Urllib2.socket.setdefaulttimeout (#anther)
# >=py2.6
Import Urllib2
Response = Urllib2.urlopen (' http://www.google.com ', timeout=10)
3. Add a specific header to the HTTP request
To join the header, you need to use the Request object:
Import Urllib2
Request = Urllib2. Request (URL)
Request.add_header (' user-agent ', ' fake-client ')
Response = Urllib2.urlopen (Request)
For some headers to pay special attention, the server will check for these headers:
User-agent: Some servers or proxies will use this value to determine whether a request is made by a browser
Content-type: When using the rest interface, the server checks the value to determine how the content in the HTTP body should be parsed. The common values are:
Application/xml used in XML RPC, such as Restful/soap calls
Application/json used in JSON RPC calls
application/x-www-form-urlencoded when a Web form is submitted by the browser
4, Redirect
URLLIB2 automatically redirect actions for HTTP 3xx return codes by default, without manual configuration. To detect whether a redirect action has occurred, just check the URL of the response and the URL of the request is always available.
Import Urllib2
Response = Urllib2.urlopen (' http://www.g.cn ')
redirected = Response.geturl () = = ' http://www.google.cn '
If you do not want to redirect automatically, you can customize the Httpredirecthandler class in addition to using the more Stratum httplib library unexpectedly.
Import Urllib2
Class Redirecthandler (Urllib2. Httpredirecthandler):
def http_error_301 (self,req,fp,code,msg,headers):
Pass
defhttp_error_302 (self,req,fp,code,msg,headers):
Pass
Opener = Urllib2.build_opener (Redirecthandler)
Opener.open (' http://www.google.cn ')
5. Cookies
Urllib2 the processing of cookies is also automatic. If you need to get the value of a cookie, the following
Import Urllib2
Import Cookielib
Cookie = Cookielib. Cookiejar ()
Opener =urllib2.build_opener (Urllib2. Httpcookieprocessor (Cookie))
Response = Opener.open (' http://www.google.cn ')
For item in Cookie:
Print Item.value
6. Put and Delete methods using HTTP
URLLIB2 only supports the Get and post methods of HTTP, and if you want to use HTTP put and delete, you can only use the lower-level httplib library.
Import Urllib2
Request = Urllib2. Request (Url,data=data)
Request.get_method=lambda: ' PUT ' #or ' DELETE '
Response = Urllib2.urlopen (Request)
7. Get the return code of HTTP
For 200OK, the return code for HTTP can be obtained as long as the GetCode () method of the response object returned by Urlopen is used. For other return codes, Urlopen throws an exception. At this point, you should check the code of the exception object.
Import Urllib2
Try
Response = Urllib2.urlopen (' http://www.google.cn ')
Except Urllib2. Httperror, E:
Print E.code
8. Debug Log
When using URLLIB2, the debug log can be opened by the following method, so that the contents of the transceiver will be printed on the screen
Import Urllib2
Httphandler=urllib2. HttpHandler (debuglevel=1)
Httpshandler = Urllib2. Httpshandler (debuglevel=1)
Opener = Urllib2.build_opener (Httphandler,httpshandler)
Urllib2.install_opener (opener)
Response = Urllib2.urlopen (' http://www.google.cn ')
Python's urllib,urllib2--Common Steps and advanced