The simple introduction to urllib2 is mentioned earlier. The following describes how to use urllib2.
1. Proxy Settings
By default, urllib2 uses the environment variable http_proxy to set HTTP proxy.
If you want to explicitly control the proxy in the program without being affected by environment variables, you can use the proxy.
Create test14 to implement a simple proxy Demo:
import urllib2enable_proxy = Trueproxy_handler = urllib2.ProxyHandler({"http" : 'http://some-proxy.com:8080'})null_proxy_handler = urllib2.ProxyHandler({})if enable_proxy: opener = urllib2.build_opener(proxy_handler)else: opener = urllib2.build_opener(null_proxy_handler)urllib2.install_opener(opener)
Note that using urllib2.install _ opener () sets the global opener of urllib2.
In this way, the subsequent use will be very convenient, but it cannot be more detailed, for example, you want to use two different proxy settings in the program.
A better way is to directly call opener's open method instead of the global urlopen method.
2. timeout settings
In the old version of Python (before python2.6), The urllib2 API does not expose the timeout settings. To set the timeout value, you can only change the global timeout value of the socket.
Import urllib2import socketsocket. setdefatimetimeout (10) # timeout urllib2.socket. setdefatimetimeout (10) # Another Method
After Python 2.6, timeout can be directly set through the timeout parameter of urllib2.urlopen.
import urllib2response = urllib2.urlopen('http://www.google.com', timeout=10)
3. Add a specific header to the HTTP request
To add a header, you must use the request object:
import urllib2request = urllib2.Request('http://www.baidu.com/')request.add_header('User-Agent', 'fake-client')response = urllib2.urlopen(request)print response.read()
Pay special attention to some headers. The server will check these headers.
User-Agent: Some servers or proxies use this value to determine whether the request is sent by the browser.
Content-Type: When the rest interface is used, the server checks the value to determine how to parse the content in the HTTP body. Common values include:
Application/xml: Used in xml rpc, such as restful/soap calls
Application/JSON: used for json rpc calls
Application/X-WWW-form-urlencoded: used when the browser submits a web form
When using the restful or soap service provided by the server, the Content-Type setting error may cause the server to reject the service.
4. Redirect
By default, urllib2 automatically performs a redirect action on the HTTP 3xx return code, without manual configuration. To check whether a redirect action has occurred, you only need to check whether the response URL and request URL are consistent.
import urllib2my_url = 'http://www.google.cn'response = urllib2.urlopen(my_url)redirected = response.geturl() == my_urlprint redirectedmy_url = 'http://rrurl.cn/b1UZuP'response = urllib2.urlopen(my_url)redirected = response.geturl() == my_urlprint redirected
If you do not want automatic redirect, you can customize the httpredirecthandler class in addition to using the lower-level httplib library.
import urllib2class RedirectHandler(urllib2.HTTPRedirectHandler): def http_error_301(self, req, fp, code, msg, headers): print "301" pass def http_error_302(self, req, fp, code, msg, headers): print "303" passopener = urllib2.build_opener(RedirectHandler)opener.open('http://rrurl.cn/b1UZuP')
5. Cookie
Urllib2 automatically processes cookies. To obtain the value of a cookie, you can do this:
import urllib2import cookielibcookie = cookielib.CookieJar()opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))response = opener.open('http://www.baidu.com')for item in cookie: print 'Name = '+item.name print 'Value = '+item.value
After running the command, the cookie value for accessing Baidu is output:
6. Use the put and delete methods of HTTP
Urllib2 only supports http get and post methods. To use http put and delete methods, you can only use httplib libraries of lower layers. Even so, we can make urllib2 send a put or delete request in the following way:
import urllib2request = urllib2.Request(uri, data=data)request.get_method = lambda: 'PUT' # or 'DELETE'response = urllib2.urlopen(request)
7. Get the HTTP return code
For 200 OK, you only need to use the getcode () method of the response object returned by urlopen to obtain the HTTP return code. However, for other return codes, urlopen throws an exception. At this time, we need to check the code attribute of the exception object:
import urllib2try: response = urllib2.urlopen('http://bbs.csdn.net/why')except urllib2.HTTPError, e: print e.code
8. debug log
When using urllib2, you can use the following method to open the debug log, so that the content of the packet sending and receiving will be printed on the screen for debugging convenience. Sometimes you can save the packet capture work.
import urllib2httpHandler = urllib2.HTTPHandler(debuglevel=1)httpsHandler = urllib2.HTTPSHandler(debuglevel=1)opener = urllib2.build_opener(httpHandler, httpsHandler)urllib2.install_opener(opener)response = urllib2.urlopen('http://www.google.com')
In this way, we can see the transmitted data packet content:
9. Form Processing
Do I need to fill in the form for Logon?
First, use a tool to intercept the content of the table to be filled in.
For example, I usually use the Firefox + httpfox plug-in to see what packages I have actually sent.
Take verycd as an example. First, find your POST request and post form items.
You can see that if verycd is used, you need to enter the username, password, continueuri, FK, and login_submit items, where FK is randomly generated (in fact, it is not random, it looks like the epoch time is generated by a simple code. You need to obtain the epoch time from the webpage. That is to say, you must first access the webpage and use regular expressions and other tools to intercept the FK items in the returned data. As the name suggests, continueuri can be written at will. login_submit is fixed, which can be seen from the source code. There is also username and password, which is obvious:
#-*-Coding: UTF-8-*-import urllibimport urllib2postdata = urllib. urlencode ({'username': 'wangxiaoguang ', 'Password': 'why888', 'continuuri': 'http: // www.verycd.com/', 'fk ':'', 'login _ submit ': 'login'}) Req = urllib2.request (url = 'HTTP: // secure.verycd.com/signin', Data = postdata) Result = urllib2.urlopen (req) print result. read ()
10. Disguised as browser access
Some websites dislike crawler visits, so they reject requests from crawlers.
In this case, we need to pretend to be a browser, which can be achieved by modifying the header in the HTTP packet.
#…headers = { 'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}req = urllib2.Request( url = 'http://secure.verycd.com/signin/*/http://www.verycd.com/', data = postdata, headers = headers)#...
11. Deal with "anti-leeching"
Some sites have so-called anti-leeching settings, which are simple to describe,
It is to check whether the Referer site itself is in the header of the request you sent,
So we only need to change the Referer of headers to this website. The cnbeta is used as an example:
#...headers = { 'Referer':'http://www.cnbeta.com/articles'}#...
Headers is a dict data structure. You can put any desired header for some disguise.
For example, some websites like to read X-forwarded-for from the header to see the real IP address of others. You can directly change X-forwarde-.