"Python Crawler Learning Notes (1)" Summary of URLLIB2 library related knowledge points

Last Update:2015-08-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Opener and handler concepts of URLLIB2 1.1 Openers:

When you get a URL you use a opener (a urllib2. Openerdirector instances). Under normal circumstances, we use the default opener: through Urlopen. But you can create a personality openers. You can use Build_opener to create opener objects. Generally available for applications that need to process cookies or do not want redirection (you'll want to the create openers if you want to the fetch URLs with specific handlers ins Talled, for example to get a opener that handles cookie, or to get a opener that does not handle redirections.)

The following are the specific processes for using proxy IP to impersonate logins (which require processing cookies) to use handler and opener.

1 self.proxy = urllib2. Proxyhandler ({'http': Self.proxy_url})2 self.cookie = cookielib. Lwpcookiejar ()3 self.cookie_handler = urllib2. Httpcookieprocessor (Self.cookie)4 self.opener = Urllib2.build_opener (Self.cookie_handler, Self.proxy , Urllib2. HttpHandler)

1.2 Handles:

Openers uses processor handlers, all "heavy" work is handled by handlers. Each handlers knows how to open URLs through a specific protocol, or how to handle various aspects of the URL when it is opened. such as HTTP redirection or HTTP cookies.

More information on openers and handlers. Http://www.voidspace.org.uk/python/articles/urllib2.shtml#openers-and-handlers

2. URLLIB2 Usage Tips 2.1 Proxy IP creation opener

note:currently urllib2 does not support fetching of HTTPS locations through a proxy. This can is a problem. http://www.voidspace.org.uk/python/articles/urllib2.shtml#proxies

1 ImportUrllib22Proxy--handler = Urllib2. Proxyhandler ({'http':'54.186.78.110:3128'})#Note to make sure that the proxy IP is available, the example IP in the US3Opener =Urllib2.build_opener (Proxy_handler)4Request = Urllib2. Request (URL, post_data, login_headers)#This example also requires the submission of Post_data and header information5Response =Opener.open (Request)6 PrintResponse.read (). Encode ('Utf-8')

2.2 Setting timeouts with the timeout parameter

1 Import Urllib2 2 response = Urllib2.urlopen ('http://www.google.com', timeout=10)

2.3 Disguise Browser

Some Web site servers will check the header information of the request, when visiting some websites, there will be httperror:http Error 403:forbidden Such exceptions, this is because some sites now prohibit crawler access, crawler will bring the burden on the server, The difference between a crawler and a browser HTTP request is that when a user sends an HTTP request, the version information that is browsed is also contained in the HTTP request message, and the crawler does not contain header information, and when the server receives a page access request, if it does not know the browser that is used to send the request, the operating system , hardware platform and other information, this information in the HTTP protocol in a field user-agent, missing this information, the server will consider these requests are non-normal access, we use the Fiddler tool can see the browser request information. You can use the request method in Urllib2 to pass the header to resolve.

The following example submits the User-agent information in the header, thus masquerading as a browser sending request. Viewing user-agent information is convenient, and you can use the Chrome browser F12 review element to see the detailed header information in the request header in the network.

Against "anti-hotlinking", some sites will check the header of the Referer is not the site itself, you can set the header when set.

1headers = {2     'user-agent':'mozilla/5.0 (Windows; U Windows NT 6.1; En-us; rv:1.9.1.6) gecko/20091201 firefox/3.5.6',3     'Referer':'Https://login.taobao.com/member/login.jhtml?redirectURL=https%3A%2F%2Fwww.taobao.com%2F '4 }5Request =Urllib2. Request (6URL ="Https://login.taobao.com/member/login.jhtml?redirectURL=https%3A%2F%2Fwww.taobao.com%2F",7data =PostData,8headers =Headers9)

More information about HTTP headers: http://rlog.cn/?p=521

2.4 Use of cookies

Cookies are data (usually encrypted) stored on the user's local terminal by certain websites in order to identify the user and track the session. For example, some sites need to log in to access a page, before you log in, you want to crawl a page content is not allowed. Then we can use the URLLIB2 library to save our registered cookies, and then crawl the other pages to achieve the goal.

A use example of a cookie is shown below.

1 ImportUrllib22 ImportCookielib3 #declaring a Cookiejar object instance to hold a cookie4Cookie =Cookielib. Cookiejar ()5 #Use the Httpcookieprocessor object of the URLLIB2 library to create a cookie processor6Handler=Urllib2. Httpcookieprocessor (Cookie)7 #build opener with handler8Opener =Urllib2.build_opener (handler)9 #The Open method here is the same as Urllib2 's Urlopen method, which can also be passed to the requestTenResponse = Opener.open ('http://www.baidu.com') One  forIteminchCookies: A     Print 'Name ='+Item.name -     Print 'Value ='+item.value

Return codes for 2.5 Urllib2.urlopen

In the case of no exception throw, you can use the GetCode () method to get the status code, so you need to handle the exception.

1 ImportUrllib22 Try:3Request =Urllib2. Request (URL)4Response =Urllib2.urlopen (Request)5     PrintResponse.read (). Decode ('Utf-8')6 exceptUrllib2. Urlerror, E:7     ifHasattr (E,"Code"):8         PrintE.code9     ifHasattr (E,"reason"):Ten         PrintE.reason

Reference Links:

http://blog.csdn.net/pleasecallmewhy/article/details/8925978

Original address: http://www.cnblogs.com/wuwenyan/p/4749018.html

"Python Crawler Learning Notes (1)" Summary of URLLIB2 library related knowledge points

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"Python Crawler Learning Notes (1)" Summary of URLLIB2 library related knowledge points

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

"Python Crawler Learning Notes (1)" Summary of URLLIB2 library related knowledge points

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support