Python urllib Module Use detailed

Source: Internet
Author: User
Tags urlencode

Brief introduction:
URLLIB2 is a Python module that obtains a URL (Uniform Resource locators, a unified resource addressable device). It provides a very concise interface in the form of a urlopen function. This makes it possible to obtain URLs with a variety of protocols. It also provides a slightly more complex interface to handle common situations-such as basic authentication, cookies, proxies, and so on. These are handled by objects called opener and handler.

Here's the easiest way to get a URL:

Import Urllib2
Response = Urllib2.urlopen (' http://python.org/')
html = Response.read ()

The use of many urlib2 is so simple (note that we could have replaced the URL that begins with "http" with a URL that starts with "ftp:" "File:"). However, the purpose of this tutorial is to explain the more complex scenarios of HTTP. HTTP is built on request and response (requests &responses)-The client manufacturing Request server returns a response. This is reflected by the request object that URLIB2 represents the HTTP request you are requesting. In its simplest form, you set up a request object to explicitly indicate the URL you want to get. Call the Urlopen function to return a Respons object to the requested URL. This respons is a file-like object, which means you can manipulate the Respon object with the. Read () function:

Import Urllib2

req = Urllib2. Request (' http://www.voidspace.org.uk ')
Response = Urllib2.urlopen (req)
The_page = Response.read ()

Note that the URLIB2 uses the same request interface to process all URL protocols. For example, you can request a ftprequest like this:

req = Urllib2. Request (' ftp://example.com/')

For the Http,request object allows you to do two extra things: first, you can send data to the server. Second, you can send additional information to the server (metadata), which can be information about the data itself, or about the request itself – that information is sent as an HTTP hair. Let's look at these in turn.

Data:
Sometimes you want to send data to a URL (usually this data represents some CGI scripts or other Web applications). For HTTP, this is often called a post. When you send a form that you fill out on the web, this is usually done by your browser. Not all post requests come from an HTML form, which needs to be encode in a standard manner and then passed as a data parameter to the Request object. Encoding is done in urlib, not in Urlib2.

Import Urllib
Import Urllib2

url = ' http://www.someserver.com/cgi-bin/register.cgi '
Values = {' name ': ' Michael Foord ',
' Location ': ' Northampton ',
' Language ': ' Python '}

data = Urllib.urlencode (values)
req = Urllib2. Request (URL, data)
Response = Urllib2.urlopen (req)
The_page = Response.read ()

If you do not transfer data parameters, URLIB2 uses a GET request. A GET request differs from a POST request in that a POST request usually has a boundary effect: they change the state of the system in some way. (for example, set up a command via the Web page to transport one-piece canned beef to your home.) Although the HTTP standard clearly illustrates that post often produces a boundary effect, and get never produces a boundary effect, nothing can prevent a GET request from producing a boundary effect, or a POST request has no boundary effect. The data can also be encrypted by the URL itself (Encoding) and then sent out via a GET request.

This is achieved through the following:
>>> Import Urllib2
>>> Import Urllib
>>> data = {}
>>> data[' name '] = ' somebody here '
>>> data[' location '] = ' Northampton '
>>> data[' language ' = ' Python '
>>> url_values = urllib.urlencode (data)
>>> Print Url_values
Name=somebody+here&language=python&location=northampton
>>> url = ' http://www.example.com/example.cgi '
>>> full_url = URL + '? ' + url_values
>>> data = Urllib2.open (Full_url)

Head:
We will discuss a special HTTP header here to illustrate how to add a header to your HTTP request.
Some websites do not want to be browsed by some programs or to return different versions for different browsers. By default, URLIB2 identifies itself as python-urllib/x.y (where XY is the major or minor version number of the Python distribution, such as python-urllib/2.5), which may confuse the site or not work at all. The way the browser distinguishes itself is through the user-agent header. When you create a request object, you can add a header dictionary. The next example is the same as the above request, but it defines itself as a version of IE.

Import Urllib
Import Urllib2

url = ' http://www.someserver.com/cgi-bin/register.cgi '
User_agent = ' mozilla/4.0 (compatible; MSIE 5.5; Windows NT) '
Values = {' name ': ' Michael Foord ',
' Location ': ' Northampton ',
' Language ': ' Python '}
headers = {' User-agent ': user_agent}

data = Urllib.urlencode (values)
req = Urllib2. Request (URL, data, headers)
Response = Urllib2.urlopen (req)
The_page = Response.read ()

Respons also has two useful methods. When we make a mistake, take a look at the section about info and Geturl.

Exception handling:

When a respons cannot be processed, Urlopen throws a urlerror (although as usual for Python APIs, built-in exceptions such as ValueError, TypeError, and so on will be thrown. )
Httperror is a subclass of the urlerror that the HTTP URL is thrown in under special circumstances.
Urlerror:
Typically, Urlerror is thrown because there is no network connection (no connection to a particular server) or a specific server does not exist. In this case, the exception containing the reason attribute is thrown, in a tuple form that contains the error code and the text error message.

e.g.
>>> req = urllib2. Request (' http://www.pretend_server.org ')
>>> Try:urllib2.urlopen (req)
>>> except Urlerror, E:
>>> Print E.reason
>>>
(4, ' getaddrinfo failed ')

When an error is thrown, the server returns an HTTP error code and an error page. You can use the returned HTTP error example. This means that it has not only the code attribute, but also the Read,geturl, and Info,methods properties. >>> req = urllib2. Request (' http://www.python.org/fish.html ') >>> try:>>> urllib2.urlopen (req) >>> except Urlerror, e:>>> print e.code>>> print e.read () >>>404 ... etc

Fault tolerant:
If you are prepared to handle HTTP errors and URL errors Here are two basic methods that I prefer the latter:

1.
From URLLIB2 import Request, Urlopen, Urlerror, Httperror
req = Request (Someurl)
Try
Response = Urlopen (req)
Except Httperror, E:
print ' The server couldn/' t fulfill the request.
print ' Error code: ', E.code
Except Urlerror, E:
print ' We failed to reach a server. '
print ' Reason: ', E.reason
Else
# everything is fine

Note: HTTP error exceptions must be in front, otherwise URL errors will also catch an HTTP error.
2
From URLLIB2 import Request, Urlopen, Urlerror
req = Request (Someurl)
Try
Response = Urlopen (req)
Except Urlerror, E:
If Hasattr (E, ' reason '):
print ' We failed to reach a server. '
print ' Reason: ', E.reason
Elif hasattr (E, ' Code '):
print ' The server couldn/' t fulfill the request.
print ' Error code: ', E.code
Else
# everything is fine

Note: A URL error is a subclass of an IO error exception. This means that you can avoid introducing (import) URL errors by using:

From URLLIB2 import Request, Urlopen
req = Request (Someurl)
Try
Response = Urlopen (req)
Except IOError, E:
If Hasattr (E, ' reason '):
print ' We failed to reach a server. '
print ' Reason: ', E.reason
Elif hasattr (E, ' Code '):
print ' The server couldn/' t fulfill the request.
print ' Error code: ', E.code
Else
# everything is fine

In very few environments, URLLIB2 can throw socket.error.

INFO and GETURL
The response (or HTTP error instance) returned by Urlopen has two useful methods: info and Geturl.

geturl– it returns the real URL of the page being fetched. This is useful because urlopen (or the opener object used) may be accompanied by a redirect.
The URL of the Web page you get may not be the same as the URL requested.

info– it returns a dictionary-like object to describe the acquired Web page, especially the headers sent by the server. It is now generally httplib. An instance of Httpmessage.
The typical head contains ' content-length ', ' content-type ', and so on. Take a look at the quick Reference to HTTP headers, the list of HTTP headers, and
About their simple explanation and use of the method.
Openers and handlers
When you get a URL, you use a opener (an instance that may be named after a confused name –urllib2. Openerdirector). Under normal circumstances
We have been using the default opener through Urlopen, but you can also create custom openers. Opener using the operator (handlers). All the heavy work was given to these handlers. Every handler knows
How to open URLs with a unique URL protocol (HTTP,FTP, etc.), or how to handle certain aspects of opening URLs, such as HTTP redirection, or HTTP cookies.

You will create openers if you want to get a URL with a special handlers, for example, get a opener that handles cookies, or a opener that does not handle redirects.

Enumerates a openerdirector, and then calls. Add_handler (some_handler_instance) multiple times to create a opener.
Alternatively, you can use Build_opener, which is a handy function to create a opener object that has only one function call. Build_opener By default will include many
Handlers, but provides a quick way to add more things and/or invalidate the default handler.
Other handlers you want to handle are agents, authentication and other common but special situations.
Install_opener can be used to create a opener object, the (global) default opener. This means that calling Urlopen will use the opener you just installed.
The opener object has an open method that can be called directly to get the URL in the same way as the Urlopen function: There is no need to call Install_opener unless it is for convenience.

Basic Authentication: (Basic validation)

To explain the creation and installation of a handler, we will use Httpbasicauthhandler. More on this stuff and a detailed discussion-including a commentary on how Basic authentication Works-See Basic Authentication Tutorial.

When authentication is required, the server sends a header (along with 401 code) to request authentication. It specifies a authentication and a domain in detail. This head looks like this:

Www-authenticate:scheme realm= "Realm".
e.g.
Www-authenticate:basic realm= "CPanel Users"

The client then requests the domain again with the correct account and password contained in the header. This is "Basic Authentication". To simplify this process, we can create a
Examples of Httpbasicauthhandler and opener to use this handler.
Httpbasicauthhandler uses a map called password-managed objects to handle the domain of URLs and user names and passwords. If you know what the domain is (authentication sent from the server
Head), then you can use a httppasswordmgr. In most cases people don't care what the domain is. It is convenient to use Httppasswordmgrwithdefaultrealm. It
Allows you to specify a specific user name and password for a URL. This will be provided to you when you do not provide an optional password lock for a particular domain. We indicated by providing none as the parameter of the Add_password method field
This point.
The highest-level URL is the first URL that needs to be authentication. URLs that are deeper than the URLs you pass to. Add_password () will also match.

# Create a password manager
Password_mgr = Urllib2. Httppasswordmgrwithdefaultrealm ()
# ADD the username and password.
# If We knew the realm, we could use it instead of "None".
Top_level_url = "http://example.com/foo/"
Password_mgr.add_password (None, top_level_url, username, password)

Handler = Urllib2. Httpbasicauthhandler (Password_mgr)
# create "opener" (Openerdirector instance)
Opener = Urllib2.build_opener (handler)
# Use the opener to fetch a URL
Opener.open (A_url)
# Install the opener.
# now all calls to Urllib2.urlopen with our opener.
Urllib2.install_opener (opener)

Note: In the example above we have only provided httpbasicauthhandler for Build_opener. The default opener has a general-case operator (handlers)-Proxyhandler, Unknownhandler, HttpHandler, Httpdefaulterrorhandler, Httpredirecthandler, Ftphandler, Filehandler, Httperrorprocessor.
A high-level URL is actually a full URL (including http: Protocol component and host name optional port number), such as "http://example.com" or an authorization (also, hostname, optional port number)
such as "example.com" or "Example.com:8080″ (the latter example contains a port number). Authorization, if rendered, must not contain user information-such as "[Email protected]:example.com"
is not correct,
Agent:
URLLIB2 will automatically detect your proxy settings and use them. This is achieved through Proxyhandler, which is part of the operator chain. Normally, this is a good thing, but there are occasional situations where it is not so useful.
One way to do this is to install our own proxyhandler without any proxy definition. Using a similar procedure to building a Basic authentication operator is possible:

>>> Proxy_support = urllib2. Proxyhandler ({})
>>> opener = Urllib2.build_opener (Proxy_support)
>>> Urllib2.install_opener (opener)
Attention:
Currently URLLIB2 does not support obtaining HTTPS locations through proxies. This is a problem.
Sockets and Layers
Python supports access to the source of cascading Web pages. URLLIB2 uses the Httplib library, and the Httplib library in turn uses the socket library.
For python2.3 You can specify how long a socket should wait response before timing out. This is useful in applications that have to get a Web page. The default socket module is not timed out and can be suspended.
Currently, the socket timeout is not visible in the URLLIB2 or httplib level. However, you can set the default timeout for all sockets globally.

Import socket
Import Urllib2
# Timeout in seconds
Timeout = 10
Socket.setdefaulttimeout (Timeout)
# Urllib2.urlopen now uses the default timeout
# We have a set in the socket module
req = Urllib2. Request (' http://www.voidspace.org.uk ')

Response = Urllib2.urlopen (req)

Python urllib Module Use detailed

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.