1. Opener and handler concepts of URLLIB2 1.1
Openers:
When you get a URL you use a opener (a urllib2. Openerdirector instances). Under normal circumstances, we use the default opener: through Urlopen. But you can create a personality openers. You can use Build_opener to create opener objects. Generally available for applications that need to process cookies or do not want redirection (you'll want to the create openers if you want to the fetch URLs with specific handlers ins Talled, for example to get a opener that handles cookie, or to get a opener that does not handle redirections.)
The following are the specific processes for using proxy IP to impersonate logins (which require processing cookies) to use handler and opener.
1 self.proxy = urllib2. Proxyhandler ({'http': Self.proxy_url})2 self.cookie = cookielib. Lwpcookiejar ()3 self.cookie_handler = urllib2. Httpcookieprocessor (Self.cookie)4 self.opener = Urllib2.build_opener (Self.cookie_handler, Self.proxy , Urllib2. HttpHandler)
1.2 Handles:
Openers uses processor handlers, all "heavy" work is handled by handlers. Each handlers knows how to open URLs through a specific protocol, or how to handle various aspects of the URL when it is opened. such as HTTP redirection or HTTP cookies.
More information on openers and handlers. Http://www.voidspace.org.uk/python/articles/urllib2.shtml#openers-and-handlers
2. URLLIB2 Usage Tips 2.1 Proxy IP creation opener
note:currently urllib2 does not support fetching of HTTPS locations through a proxy. This can is a problem. http://www.voidspace.org.uk/python/articles/urllib2.shtml#proxies
1 ImportUrllib22Proxy--handler = Urllib2. Proxyhandler ({'http':'54.186.78.110:3128'})#Note to make sure that the proxy IP is available, the example IP in the US3Opener =Urllib2.build_opener (Proxy_handler)4Request = Urllib2. Request (URL, post_data, login_headers)#This example also requires the submission of Post_data and header information5Response =Opener.open (Request)6 PrintResponse.read (). Encode ('Utf-8')
2.2 Setting timeouts with the timeout parameter
1 Import Urllib2 2 response = Urllib2.urlopen ('http://www.google.com', timeout=10)
2.3 Disguise Browser
Some Web site servers will check the header information of the request, when visiting some websites, there will be httperror:http Error 403:forbidden Such exceptions, this is because some sites now prohibit crawler access, crawler will bring the burden on the server, The difference between a crawler and a browser HTTP request is that when a user sends an HTTP request, the version information that is browsed is also contained in the HTTP request message, and the crawler does not contain header information, and when the server receives a page access request, if it does not know the browser that is used to send the request, the operating system , hardware platform and other information, this information in the HTTP protocol in a field user-agent, missing this information, the server will consider these requests are non-normal access, we use the Fiddler tool can see the browser request information. You can use the request method in Urllib2 to pass the header to resolve.
The following example submits the User-agent information in the header, thus masquerading as a browser sending request. Viewing user-agent information is convenient, and you can use the Chrome browser F12 review element to see the detailed header information in the request header in the network.
Against "anti-hotlinking", some sites will check the header of the Referer is not the site itself, you can set the header when set.
1headers = {2 'user-agent':'mozilla/5.0 (Windows; U Windows NT 6.1; En-us; rv:1.9.1.6) gecko/20091201 firefox/3.5.6',3 'Referer':'Https://login.taobao.com/member/login.jhtml?redirectURL=https%3A%2F%2Fwww.taobao.com%2F '4 }5Request =Urllib2. Request (6URL ="Https://login.taobao.com/member/login.jhtml?redirectURL=https%3A%2F%2Fwww.taobao.com%2F",7data =PostData,8headers =Headers9)
More information about HTTP headers: http://rlog.cn/?p=521
2.4 Use of cookies
Cookies are data (usually encrypted) stored on the user's local terminal by certain websites in order to identify the user and track the session. For example, some sites need to log in to access a page, before you log in, you want to crawl a page content is not allowed. Then we can use the URLLIB2 library to save our registered cookies, and then crawl the other pages to achieve the goal.
A use example of a cookie is shown below.
1 ImportUrllib22 ImportCookielib3 #declaring a Cookiejar object instance to hold a cookie4Cookie =Cookielib. Cookiejar ()5 #Use the Httpcookieprocessor object of the URLLIB2 library to create a cookie processor6Handler=Urllib2. Httpcookieprocessor (Cookie)7 #build opener with handler8Opener =Urllib2.build_opener (handler)9 #The Open method here is the same as Urllib2 's Urlopen method, which can also be passed to the requestTenResponse = Opener.open ('http://www.baidu.com') One forIteminchCookies: A Print 'Name ='+Item.name - Print 'Value ='+item.value
Return codes for 2.5 Urllib2.urlopen
In the case of no exception throw, you can use the GetCode () method to get the status code, so you need to handle the exception.
1 ImportUrllib22 Try:3Request =Urllib2. Request (URL)4Response =Urllib2.urlopen (Request)5 PrintResponse.read (). Decode ('Utf-8')6 exceptUrllib2. Urlerror, E:7 ifHasattr (E,"Code"):8 PrintE.code9 ifHasattr (E,"reason"):Ten PrintE.reason
Reference Links:
http://blog.csdn.net/pleasecallmewhy/article/details/8925978
Original address: http://www.cnblogs.com/wuwenyan/p/4749018.html
"Python Crawler Learning Notes (1)" Summary of URLLIB2 library related knowledge points