[Python] web crawler (iv): Introduction and application of opener and handler __python

Source: Internet
Author: User

http://blog.csdn.net/pleasecallmewhy/article/details/8924889

A better learning site: http://www.voidspace.org.uk/python/articles/urllib2.shtml#openers-and-handlers


The following is a personal learning note.


Before you start, explain the two methods in Urllib2: info and Geturl Urlopen The answer object returned by response (or Httperror instance) has two very useful methods info () and Geturl ()

1.geturl ():

This returns the real URL obtained, which is useful because the Urlopen (or the opener object) may be redirected. The URL you get may be different from the request URL.

As an example of a super link in everyone,

Let's build a urllib2_test10.py to compare the original URL and redirect links:

[python] view plain Copy from URLLIB2 import Request, Urlopen, urlerror, httperror old_url = ' Http://rrur L.cn/b1uzup ' req = Request (old_url) response = Urlopen (req) print ' old URL: ' + old_url print ' real URL: ' + res Ponse.geturl () After running, you can see the URL that the real link points to:



2.info ():

This returns the object's Dictionary object, which describes the obtained page condition. Typically, the server sends a specific header headers. The present is httplib. Httpmessage instance. The classic headers contains "Content-length", "Content-type", and other content.

We built a urllib2_test11.py to test the application of info: [python] view plain Copy from URLLIB2 import Request, Urlopen, Urlerror, HTT perror old_url = ' http://www.baidu.com ' req = Request (old_url) response = Urlopen (req) print ' Info (): ' Prin The results of the T Response.info () run are as follows, and you can see the relevant information about the page:



Here are two important concepts in URLLIB2: openers and handlers.

1.Openers:

When you get a URL you use a opener (a urllib2. instance of Openerdirector).

Normally, we use the default opener: through Urlopen.

But you can create the openers of individuality.

2.Handles:

Openers uses processor handlers, all "heavy" work is handled by handlers.

Each handlers knows how to open URLs through a specific protocol, or how to handle various aspects of the URL opening.

such as HTTP redirection or HTTP cookies.


If you want to use a specific processor to get URLs you will want to create a openers, such as getting a opener that can handle cookies, or getting a opener that is not redirected.


To create a opener, you can instantiate a openerdirector,

Then call. Add_handler (some_handler_instance). Again, you can use Build_opener, which is a more convenient function to create a opener object, and he only needs one function call at a time.
Build_opener adds several processors by default, but provides a quick way to add or update the default processor.

Other processor handlers you might want to process proxies, validations, and other common but somewhat special situations.


Install_opener is used to create (global) default opener. This means that calling Urlopen will use the opener you installed.

The opener object has an open method.

This method can be used directly to obtain URLs like the Urlopen function: it is not usually necessary to invoke Install_opener, except for convenience.


Having said the above two contents, let's take a look at the content of the Basic authentication, here will use the opener and handler mentioned above. Basic Authentication Verification

To demonstrate the creation and installation of a handler, we will use Httpbasicauthhandler.

When basic authentication is required, the server sends a header (401 error code) to request authentication. This specifies SCHEME and a ' realm ' that looks like this: Www-authenticate:scheme realm= "Realm". For example
Www-authenticate:basic realm= "Cpanel Users"

The client must use the new request and include the correct name and password in the request header.

This is "Basic validation", and in order to simplify the process, we can create a Httpbasicauthhandler instance and let opener use the handler.

Httpbasicauthhandler uses a password-managed object to process URLs and realms to map user names and passwords.

If you know what the realm (the head from the server) is, you can use Httppasswordmgr.

Usually people don't care what realm is. In that case, you can use the convenient httppasswordmgrwithdefaultrealm.

This will specify a default user name and password for the URL.

This will be provided when you provide a different combination for a particular realm.

We indicate this by specifying none for the realm parameter to Add_password.


The highest level of URLs is the first one to require authentication. You pass to. Add_password () a deeper URL would be equally appropriate.

Having said so much nonsense, let's use an example to illustrate what is mentioned above.

Let's build a urllib2_test12.py to test the application of info:[Python] View Plain copy # -*- coding: utf-8 -*-   import urllib2       #  Create a password manager    password_mgr = urllib2. Httppasswordmgrwithdefaultrealm ()       #  Add user name and password       top_level_url  =  "http://example.com/foo/"       #  If we know  realm,  we can use him instead of   ' None ' .   # password_mgr.add_password (None, top_level_url, username, password)    Password_mgr.add_password (None, top_level_url, ' why ',  ' 1223 ')         created a new handler   handler = urllib2. Httpbasicauthhandler (password_mgr)       #  create   "opener"   (openerdirector  instance )    Opener = urllib2.build_opener (handler)       a_url =  ' http:/ /www.baidu.com/'       #  use  opener  to get a url   openEr.open (a_url)       #  installation  opener.   #  now all calls  urllib2.urlopen   will use our  opener.   Urllib2.install_opener (opener)          

Note: The above examples we only provide our hhtpbasicauthhandler to Build_opener.

The default openers has a normal condition of handlers:proxyhandler,unknownhandler,httphandler,httpdefaulterrorhandler, HTTPRedirectHandler, Ftphandler, Filehandler, Httperrorprocessor. The Top_level_url in the code can actually be a full URL (including "http:" and the host name and optional port number).

For example: http://example.com/.

It can also be a "authority" (that is, host name and optional include port number).

For example: "example.com" or "example.com:8080".

The latter contains the port number.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.