Basic usage of the python urllib2 package
1. urllib2.urlopen (request)
Url = "http://www.baidu.com" # url can also be the path of other protocols, such as ftpvalues = {'name': 'Michael Foord ', 'location': 'northampt', language ': 'python'} data = urllib. urlencode (values) user_agent = 'mozilla/4.0 (compatible; MSIE 5.5; Windows NT) 'headers = {'user-agent': user_agent} request = urllib2.Request (url, data, headers) # You can also set the header: request. add_header ('user-agent', 'fake-client') response = urllib2.urlopen (request) html = response. read ()
In fact, the urlopen () method of urllib2 is the most basic method to open a url. You need to input a request parameter, which is actually a common Request object, which can contain a url, data (transfer data to the server, such as common form data), and set header parameters (some servers reject robot requests that do not contain headers ). The last retrieved webpage must be read using the read () method of the response object. Otherwise, only one object's memory address can be obtained. 2. Create an Opener object to implement Cookie and other HTTP functions. 2.1 The urlopen () function for cookie processing does not support verification, cookie, or other HTTP Advanced functions. To support these functions, you must use the build_opener () function to create your own custom Opener object. WKioL1XtiQjw-nqFAAPvAuCjDDk019.jpg to manage HTTP cookies, you need to create an opener object that has added the HTTPCookieProcessor handler. By default. HTTPCookieProcessor uses the CookieJar object to provide different types of CookieJar objects as parameters of HTTPCookieProcessor and supports different cookie processing.
mcj=cookielib.MozillaCookieJar("cookies.txt")cookiehand=HTTPCookieProcessor(mcj)opener=urllib2.build_opener(cookiehand)u=opener.open(http://www.baidu.com)
2.2 Certification
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()top_level_url = "http://www.163.com/"password_mgr.add_password(None, top_level_url, username, password)handler = urllib2.HTTPBasicAuthHandler(password_mgr)opener = urllib2.build_opener(handler)urllib2.install_opener(opener)
2.3 The agent urllib2 automatically detects Proxy settings. By default, the environment variable http_proxy is used to set HTTP Proxy. If you want to clarify the use of Proxy in the program without being affected by environment variables, you can create a ProxyHandler instance and use the instance as the build_opener () parameter.
import urllib2 enable_proxy = Trueproxy_handler = urllib2.ProxyHandler({"http" : 'http://some-proxy.com:8080'})null_proxy_handler = urllib2.ProxyHandler({}) if enable_proxy: opener = urllib2.build_opener(proxy_handler)else: opener = urllib2.build_opener(null_proxy_handler) urllib2.install_opener(opener)
Note that using urllib2.install _ opener () sets the global opener of urllib2. In this way, the subsequent use is very convenient, but it cannot be more fine-grained. For example, you want to use two different Proxy settings in the program. A better way is to directly call opener's open method instead of the global urlopen method. 2.4 Timeout is set in the old version. The urllib2 API does not expose the Timeout setting. To set the Timeout value, you can only change the global Timeout value of the Socket.
Importurllib2importsocketsocket. setdefatimetimeout (10) # urllib2.socket. setdefatimetimeout (10) # Another Method
In the new Python 2.6 version, timeout can be directly set through the timeout parameter of urllib2.urlopen.
importurllib2response = urllib2.urlopen('http://www.google.com', timeout=10)
2.5 set the Header in the basic usage of urlopen (): request = urllib2.Request (url, data, headers) can also be set after the request object is generated
importurllib2request =urllib2.Request(uri)request.add_header('User-Agent', 'fake-client')response = urllib2.urlopen(request)
Pay special attention to some headers. The Server will check these headers. Some User-Agent servers or proxies will check this value, the Server checks whether the RequestContent-Type initiated by the browser uses the REST interface to determine how to parse the content in the HTTP Body. Common Values: application/xml: use application/json in xml rpc, such as RESTful/SOAP calls: use application/x-www-form-urlencoded in json rpc calls: use the browser to submit Web forms ...... When you use RPC to call Server-provided RESTful or SOAP services, an error in Content-Type setting will cause Server Denial of Service. 2.6 Redirect redirection urllib2 automatically performs a Redirect action on the 3xx HTTP return code by default, without manual configuration. To check whether a Redirect action has occurred, you only need to check whether the Response URL and Request URL are consistent.
importurllib2response =urllib2.urlopen('http://www.google.cn')whether_redirected = response.geturl() == 'http://www.google.cn'
If you do not want automatic Redirect, you can use the custom HTTPRedirectHandler class in addition to using the lower-level httplib library.
importurllib2class RedirectHandler(urllib2.HTTPRedirectHandler): def http_error_301(self, req, fp, code, msg, headers): pass def http_error_302(self, req, fp, code, msg, headers): pass opener =urllib2.build_opener(RedirectHandler) opener.open('http://www.google.cn')
2.7 using the PUT and DELETE methods of HTTP urllib2 only supports http get and POST methods. To use http put and DELETE methods, you can only use a lower-layer httplib library. Even so, we can use the following method to enable urllib2 to send http put or DELETE packets:
importurllib2 request =urllib2.Request(uri, data=data)request.get_method = lambda: 'PUT'# or 'DELETE'response = urllib2.urlopen(request)
Although this method belongs to the Hack method, it is no problem in actual use. 2.8 to get the HTTP return code for 200 OK, you only need to use the getcode () method of the response object returned by urlopen to get the HTTP return code. However, for other return codes, urlopen throws an exception. At this time, we need to check the code attribute of the exception object:
importurllib2try: response =urllib2.urlopen('http://restrict.web.com')except urllib2.HTTPError, e: print e.code
2.9 when using urllib2, Debug Log can be opened through the following method, so that the content of the packet sent and received will be printed on the screen for debugging, to some extent, you can save the packet capture effort.
import urllib2httpHandler =urllib2.HTTPHandler(debuglevel=1)httpsHandler =urllib2.HTTPSHandler(debuglevel=1)opener =urllib2.build_opener(httpHandler, httpsHandler) urllib2.install_opener(opener)response = urllib2.urlopen('http://www.google.com')