Web crawler (iv): Introduction and application of opener and handler

Source: Internet
Author: User

Before you start, explain the two methods in Urllib2: info and Geturl Urlopen The answer object returned by response (or Httperror instance) has two very useful methods info () and Geturl ()

1.geturl ():

This returns the real URL obtained, which is useful because the Urlopen (or the opener object) may be redirected. The URL you get may be different from the request URL.

As an example of a super link in everyone,

Let's build a urllib2_test10.py to compare the original URL and redirect links:

From URLLIB2 import Request, Urlopen;

Old_url = ' Http://rrurl.cn/b1UZuP ';
req = Request (old_url);
Response = Urlopen (req);
print ' old URL: ' + old_url;
print ' Real URL: ' + response.geturl ();

After running, you can see the URL that the real link points to:



2.info ():

This returns the object's Dictionary object, which describes the obtained page condition. Typically, the server sends a specific header headers. The present is httplib. Httpmessage instance. The classic headers contains "Content-length", "Content-type", and other content.

Let's build a urllib2_test11.py to test the application of info:

From URLLIB2 import Request, Urlopen;
  
Old_url = ' http://www.baidu.com ';
req = Request (old_url); 
Response = Urlopen (req);  
print ' Info (): ';
Print Response.info ();

The results of the run are as follows, and you can see the information about the page:



Here are two important concepts in URLLIB2: openers and handlers.

1.Openers:

When you get a URL you use a opener (a urllib2. instance of Openerdirector).

In front, we are all using the default opener, which is the urlopen.

It is a special opener that can be understood as a special instance of opener, and the incoming arguments are simply url,data,timeout.

2.Handles:

Openers uses processor handlers, all "heavy" work is handled by handlers.

Each handlers knows how to open URLs through a specific protocol, or how to handle various aspects of the URL opening.

such as HTTP redirection or HTTP cookies.


If you want to use a specific processor to get URLs you will want to create a openers, such as getting a opener that can handle cookies, or getting a opener that is not redirected.


To create a opener, you can instantiate a openerdirector and then call. Add_handler (some_handler_instance). Again, you can use Build_opener, which is a more convenient function to create a opener object, and he only needs one function call at a time.
Build_opener adds several processors by default, but provides a quick way to add or update the default processor.

Other processor handlers you might want to process proxies, validations, and other common but somewhat special situations.


Install_opener is used to create (global) default opener. This means that calling Urlopen will use the opener you installed.

The opener object has an open method.

This method can be used directly to obtain URLs like the Urlopen function: it is not usually necessary to invoke Install_opener, except for convenience.


Having said the above two contents, let's take a look at the content of the Basic authentication, here will use the opener and handler mentioned above.


Basic verification of 2.1 Basic authentication

To demonstrate the creation and installation of a handler, we will use Httpbasicauthhandler.

When basic authentication is required, the server sends a header (401 error code) to request authentication. This specifies SCHEME and a ' realm ' that looks like this: Www-authenticate:scheme realm= "Realm". For example
Www-authenticate:basic realm= "Cpanel Users"

The client must use the new request and include the correct name and password in the request header.

This is "Basic validation", and in order to simplify the process, we can create a Httpbasicauthhandler instance and let opener use the handler.

Httpbasicauthhandler uses a password-managed object to process URLs and realms to map user names and passwords.

If you know what the realm (the head from the server) is, you can use Httppasswordmgr.

Usually people don't care what realm is. In that case, you can use the convenient httppasswordmgrwithdefaultrealm.

This will specify a default user name and password for the URL.

This will be provided when you provide a different combination for a particular realm.

We indicate this by specifying none for the realm parameter to Add_password.


The highest level of URLs is the first one to require authentication. You pass to. Add_password () a deeper URL would be equally appropriate.

Having said so much nonsense, let's use an example to illustrate what is mentioned above.

Let's build a urllib2_test12.py to test the application of info:

#-*-Coding:utf-8-*-

import urllib2  
  
# Create a password manager  
password_mgr = urllib2. Httppasswordmgrwithdefaultrealm ();
# Add user name and password   
top_level_url = "http://example.com/foo/"; 
  
# If you know realm, we can use him instead of ' None '.
# Password_mgr.add_password (None, Top_level_url, username, password) 
Password_mgr.add_password (None, Top_level_ URL, ' Why ', ' 1223 ');

# created a new handler
handler = Urllib2. Httpbasicauthhandler (password_mgr);

# create "opener" (Openerdirector instance)
opener = Urllib2.build_opener (handler);

A_url = ' http://www.baidu.com/';

# Use opener to get a URL
opener.open (a_url);

# Install opener.
# now all call Urllib2.urlopen will use our opener.
Urllib2.install_opener (opener);

Note: The above examples we only provide our hhtpbasicauthhandler to Build_opener.

The default openers has a normal condition of handlers:proxyhandler,unknownhandler,httphandler,httpdefaulterrorhandler, HTTPRedirectHandler, Ftphandler, Filehandler, Httperrorprocessor. The Top_level_url in the code can actually be a full URL (including "http:" and the host name and optional port number).

For example: http://example.com/.

It can also be a "authority" (that is, host name and optional include port number).

For example: "example.com" or "example.com:8080".

The latter contains the port number.

2.2 Cookies

Get cookie Save to variable

First, we first use the Cookiejar object to achieve the function of getting cookies, stored in the variable, first to feel

#-*-Coding:utf-8-*-

import urllib2
import cookielib

#声明一个CookieJar对象实例来保存cookie
cookies = cookielib . Cookiejar ()

#利用urllib2库的HTTPCookieProcessor对象来创建cookie处理器
handler=urllib2. Httpcookieprocessor (cookie)

#通过handler来构建opener
opener = Urllib2.build_opener (handler)

# The Open method here is the same as the Urllib2 Urlopen method, which can also be passed in to the request
response = Opener.open (' http://www.baidu.com ')

print ' name ' + ' | ' + ' + ' value '
for item in cookie:
    print item.name, ' | ', Item.value

We use the above method to save the cookie in a variable and then print out the value in the cookie, and the results are as follows



Save cookies to File

In the above method, we save the cookie to the cookie variable, and if we want to save the cookie to a file, what do we do? At this point, we will use the object of Filecookiejar, where we use its subclass Mozillacookiejar to implement the cookie save.

Import cookielib
import urllib2

#设置保存cookie的文件, cookie.txt
filename = ' cookie.txt ' in the same category
Declares a Mozillacookiejar object instance to save the cookie and then writes the file
cookie = cookielib. Mozillacookiejar (filename)
#利用urllib2库的HTTPCookieProcessor对象来创建cookie处理器
handler = urllib2. Httpcookieprocessor (cookie)
#通过handler来构建opener
opener = Urllib2.build_opener (handler)
#创建一个请求, The principle of urlopen
response = Opener.open ("http://www.baidu.com")
#保存cookie到文件
Cookie.save (ignore) with URLLIB2 _discard=true, Ignore_expires=true)

The two parameters about the last save method illustrate this:

Gnore_discard means that even if a cookie is discarded and it is saved, ignore_expires means that if the cookie in the file is

Is present, overwrite the original file, and here we set both of these to true. After the operation, cookies will be saved to the Cookie.txt text

Part, we look at the contents, the following drawings



Get cookies from a file and access

So we have to save cookies to the file, if you want to use, you can use the following methods to read cookies and visit the site, feel

Import cookielib
import urllib2

#创建MozillaCookieJar实例对象
cookie = cookielib. Mozillacookiejar ()
#从文件中读取cookie内容到变量
cookie.load (' Cookie.txt ', Ignore_discard=true, ignore_expires=true )
#创建请求的request
req = urllib2. Request ("http://www.baidu.com")
#利用urllib2的build_opener方法创建一个opener
opener = Urllib2.build_opener ( Urllib2. Httpcookieprocessor (cookie))
response = Opener.open (req)
print response.read ()

Imagine, if our cookie.txt file is stored in a person login Baidu cookies, then we extract the contents of this cookie file, you can use the above method to simulate this person's account login Baidu.


Using cookies to simulate Web site logins

Can refer to these two article: http://www.cnblogs.com/sysu-blackbear/p/3629770.html

Http://www.jb51.net/article/63759.htm


Original link: http://blog.csdn.net/pleasecallmewhy/article/details/8924889

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.