"Python Crawler Learning Notes (1)" Summary of URLLIB2 library related knowledge points

Source: Internet
Author: User

1. Opener and handler concepts of URLLIB2 1.1 Openers:

When you get a URL you use a opener (a urllib2. Openerdirector instances). Under normal circumstances, we use the default opener: through Urlopen. But you can create a personality openers. You can use Build_opener to create opener objects. Generally available for applications that need to process cookies or do not want redirection (you'll want to the create openers if you want to the fetch URLs with specific handlers ins Talled, for example to get a opener that handles cookie, or to get a opener that does not handle redirections.)

The following are the specific processes for using proxy IP to impersonate logins (which require processing cookies) to use handler and opener.

1 self.proxy = urllib2. Proxyhandler ({'http': Self.proxy_url})2 self.cookie = cookielib. Lwpcookiejar ()3 self.cookie_handler = urllib2. Httpcookieprocessor (Self.cookie)4 self.opener = Urllib2.build_opener (Self.cookie_handler, Self.proxy , Urllib2. HttpHandler)
1.2 Handles:

Openers uses processor handlers, all "heavy" work is handled by handlers. Each handlers knows how to open URLs through a specific protocol, or how to handle various aspects of the URL when it is opened. such as HTTP redirection or HTTP cookies.

More information on openers and handlers. Http://www.voidspace.org.uk/python/articles/urllib2.shtml#openers-and-handlers

2. URLLIB2 Usage Tips 2.1 Proxy IP creation opener

note:currently urllib2 does not support fetching of HTTPS locations through a proxy. This can is a problem. http://www.voidspace.org.uk/python/articles/urllib2.shtml#proxies

1 ImportUrllib22Proxy--handler = Urllib2. Proxyhandler ({'http':'54.186.78.110:3128'})#Note to make sure that the proxy IP is available, the example IP in the US3Opener =Urllib2.build_opener (Proxy_handler)4Request = Urllib2. Request (URL, post_data, login_headers)#This example also requires the submission of Post_data and header information5Response =Opener.open (Request)6 PrintResponse.read (). Encode ('Utf-8')
2.2 Setting timeouts with the timeout parameter
1 Import Urllib2 2 response = Urllib2.urlopen ('http://www.google.com', timeout=10)
2.3 Disguise Browser

Some Web site servers will check the header information of the request, when visiting some websites, there will be httperror:http Error 403:forbidden Such exceptions, this is because some sites now prohibit crawler access, crawler will bring the burden on the server, The difference between a crawler and a browser HTTP request is that when a user sends an HTTP request, the version information that is browsed is also contained in the HTTP request message, and the crawler does not contain header information, and when the server receives a page access request, if it does not know the browser that is used to send the request, the operating system , hardware platform and other information, this information in the HTTP protocol in a field user-agent, missing this information, the server will consider these requests are non-normal access, we use the Fiddler tool can see the browser request information. You can use the request method in Urllib2 to pass the header to resolve.

The following example submits the User-agent information in the header, thus masquerading as a browser sending request. Viewing user-agent information is convenient, and you can use the Chrome browser F12 review element to see the detailed header information in the request header in the network.

Against "anti-hotlinking", some sites will check the header of the Referer is not the site itself, you can set the header when set.

1headers = {2     'user-agent':'mozilla/5.0 (Windows; U Windows NT 6.1; En-us; rv:1.9.1.6) gecko/20091201 firefox/3.5.6',3     'Referer':'Https://login.taobao.com/member/login.jhtml?redirectURL=https%3A%2F%2Fwww.taobao.com%2F '4 }5Request =Urllib2. Request (6URL ="Https://login.taobao.com/member/login.jhtml?redirectURL=https%3A%2F%2Fwww.taobao.com%2F",7data =PostData,8headers =Headers9)

More information about HTTP headers: http://rlog.cn/?p=521

2.4 Use of cookies

Cookies are data (usually encrypted) stored on the user's local terminal by certain websites in order to identify the user and track the session. For example, some sites need to log in to access a page, before you log in, you want to crawl a page content is not allowed. Then we can use the URLLIB2 library to save our registered cookies, and then crawl the other pages to achieve the goal.

A use example of a cookie is shown below.

1 ImportUrllib22 ImportCookielib3 #declaring a Cookiejar object instance to hold a cookie4Cookie =Cookielib. Cookiejar ()5 #Use the Httpcookieprocessor object of the URLLIB2 library to create a cookie processor6Handler=Urllib2. Httpcookieprocessor (Cookie)7 #build opener with handler8Opener =Urllib2.build_opener (handler)9 #The Open method here is the same as Urllib2 's Urlopen method, which can also be passed to the requestTenResponse = Opener.open ('http://www.baidu.com') One  forIteminchCookies: A     Print 'Name ='+Item.name -     Print 'Value ='+item.value
Return codes for 2.5 Urllib2.urlopen

In the case of no exception throw, you can use the GetCode () method to get the status code, so you need to handle the exception.

1 ImportUrllib22 Try:3Request =Urllib2. Request (URL)4Response =Urllib2.urlopen (Request)5     PrintResponse.read (). Decode ('Utf-8')6 exceptUrllib2. Urlerror, E:7     ifHasattr (E,"Code"):8         PrintE.code9     ifHasattr (E,"reason"):Ten         PrintE.reason

Reference Links:

http://blog.csdn.net/pleasecallmewhy/article/details/8925978

  

Original address: http://www.cnblogs.com/wuwenyan/p/4749018.html

"Python Crawler Learning Notes (1)" Summary of URLLIB2 library related knowledge points

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.