URLLIB2 is a component of Python's acquisition of URLs (Uniform Resource Locators). He provides a very simple interface in the form of a urlopen function, which is capable of acquiring URLs using different protocols, and he also provides a more complex interface to handle general situations such as basic authentication, cookies, proxies, and others. They are provided through handlers and openers objects.
URLLIB2 supports obtaining URLs in different formats (strings defined before the ":" of the URL, for example: "FTP" is the prefix of "ftp:python.ort/"), and they are obtained using their associated network protocols (such as Ftp,http). This tutorial focuses on the widest range of application--http.
For simple applications, the Urlopen is very easy to use. But when you encounter errors or exceptions when you open URLs for HTTP, you will need some Hypertext Transfer Protocol (HTTP) understanding.
The most authoritative HTTP document is of course RFC 2616 (http://rfc.net/rfc2616.html). This is a technical document, so it is not easy to read. The purpose of this howto tutorial is to show how to use URLLIB2,
and provide enough HTTP details to help you understand. He is not a urllib2 document, but an adjunct.
Get URLs
The simplest use of URLLIB2
code example:
Import URLLIB2 response = Urllib2.urlopen (' http://python.org/') HTML = Response.read ()
Many of URLLIB2 's applications are simple (remember that, in addition to "http:", URLs can be replaced with "ftp:", "File:" and so on). But this article is a more complex application of teaching HTTP.
HTTP is based on the request and response mechanism-the client requests that the server provides the response. URLLIB2 uses a Request object to map your HTTP request, and in its simplest form you will create a request object with the address you want, by calling Urlopen and passing in the Request object. will return a related request response object, which is like a file object, so you can call. Read () in response.
Import Urllib2 req = Urllib2. Request (' http://www.pythontab.com ') response = Urllib2.urlopen (req) the_page = Response.read ()
Remember that URLLIB2 uses the same interface to handle all URL headers. For example, you can create an FTP request as follows.
req = Urllib2. Request (' ftp://example.com/')
There are two additional things that you are allowed to do in an HTTP request. The first is that you can send data forms, and then you can transmit additional information about the data or send itself ("metadata") to the server, which is sent as HTTP "headers".
Let's take a look at how these are sent.
Data
Sometimes you want to send some data to the URL (usually URL with the cgi[Universal Gateway Interface] script, or another Web application to hook up). In HTTP, this is often sent using a well-known post request. This is usually done by your browser when you submit an HTML form.
Not all posts are sourced from the form, and you can use post to submit arbitrary data to your own program. In general HTML forms, data needs to be encoded in standard form. The data parameter is then passed to the request object. Coding works using Urllib functions rather than urllib2.
code example:
Import urllib import urllib2 url = ' http://www.php.cn ' values = {' name ': ' Michael foord ', ' location ': ' Pythontab ',
' language ': ' Python '} data = Urllib.urlencode (values) req = Urllib2. Request (URL, data) response = Urllib2.urlopen (req) the_page = Response.read ()
Remember that sometimes you need other encodings (such as uploading files from HTML-see http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13 HTML specification, Form Submission for more information).
If Ugoni does not transmit the data parameter, URLLIB2 uses the Get method request. The difference between get and post requests is that post requests often have "side effects" that change the state of the system in some way (for example, by submitting piles of rubbish to your door).
Although the HTTP standard is clear that posts usually have side effects, get requests do not have side effects, but nothing can prevent a GET request from having side effects, and the same POST request may not have side effects. Data can also be transmitted by encoding the URL itself on the GET request.
code example:
>>> Import urllib2 >>> import urllib >>> data = {} >>> data[' name '] = ' somebody here ' >>> data[' location '] = ' pythontab ' >>> data[' language '] = ' Python ' >>> url_values = Urllib.url Encode (data) >>> print url_values name=blueelwang+here&language=python&location=pythontab >> > url = ' http://www.pythontab.com ' >>> full_url = URL + '? ' + url_values >>> data = Urllib2.open (full _url)
Headers
We will discuss specific HTTP headers here to illustrate how to add headers to your HTTP request.
Some sites do not like to be accessed by programs (non-human access) or to send different versions of content to different browsers. The default urllib2 itself as "python-urllib/x.y" (x and Y are Python major and minor versions, such as python-urllib/2.5), which can confuse the site or simply not work. The browser confirms its identity through the user-agent header, and when you create a request object, you can give him a dictionary containing the header data. The following example sends the same content as above, but puts itself
Simulates internet Explorer.
Import urllib import urllib2 url = ' http://www.php.cn ' user_agent = ' mozilla/4.0 (compatible; MSIE 5.5; Windows NT) ' values = {' name ': ' Michael foord ', ' location ': ' Pythontab ', ' language ': ' Python '} headers = { ' User-agent ': user_agent} data = Urllib.urlencode (values) req = Urllib2. Request (URL, data, headers) response = Urllib2.urlopen (req) the_page = Response.read ()
There are also two useful ways to response a Reply object. Looking at the following section info and Geturl, we'll see what happens when an error occurs.
Handle Exceptions Handling Exceptions
When Urlopen is not able to handle a response, Urlerror is generated (but the usual Python APIs exceptions, such as Valueerror,typeerror, are also generated).
Httperror is a subclass of Urlerror, usually generated in a specific HTTP URLs.
Urlerror
Typically, urlerror occurs when there is no network connection (no routing to a particular server), or if the server does not exist. In this case, the exception will also have the "reason" attribute, which is a tuple that contains an error number and an error message.
For example
>>> req = urllib2. Request (' http://www.php.cn ') >>> try:urllib2.urlopen (req) >>> except Urlerror, E: >>> Print E.reason >>>
(4, ' getaddrinfo failed ')
Httperror
Each HTTP reply object on the server response contains a number "status code". Sometimes the status code indicates that the server cannot complete the request. The default processor will handle some of this response for you (for example, if response is a "redirect" that requires the client to fetch the document from another address, URLLIB2 will handle it for you). Other can not handle, Urlopen will produce a httperror. Typical errors include "404" (Page cannot be found), "403" (Request forbidden), and "401" (with authentication request).
See RFC 2616, section Tenth, with all HTTP error codes
The Httperror instance is generated with an integer ' code ' attribute, which is the associated error number sent by the server.
Error codes wrong code
Because the default processor handles redirects (300 + numbers), and a 100-299 range number indicates success, you can only see 400-599 of the error number.
BaseHTTPServer.BaseHTTPRequestHandler.response is a useful answer number dictionary that shows all the answer numbers used by RFC 2616. Here to re-display the dictionary.
When an error number is generated, the server returns an HTTP error number, and an error page. You can use the Httperror instance as the Reply object response returned by the page. This represents the same as the error property, which also contains the Read,geturl, and the info method.
>>> req = urllib2. Request (' http://www.php.cn/fish.html ') >>> try: >>> urllib2.urlopen (req) >>> except Urlerror, E: >>> print e.code >>> print e.read () >>>
404 Error 404:file not Found ... etc...
Wrapping It up packaging
So if you want to prepare for httperror or urlerror, there are two basic ways. I prefer the second kind.
The first one:
From URLLIB2 import request, Urlopen, urlerror, httperror req = Request (Someurl) Try: response = Urlopen (req) except H Ttperror, E: print ' the server couldn/' t fulfill the request. print ' Error code: ', E.code except Urlerror, E: print ' We failed to reach a server. ' print ' Reason: ', E.reason else: # everything is fine
Note: Except Httperror must be in the first, otherwise except Urlerror will likewise receive to Httperror.
The second one:
From URLLIB2 import Request, Urlopen, urlerror req = Request (Someurl) Try: response = Urlopen (req) except Urlerror, E: if Hasattr (E, ' reason '): print ' We failed to reach a server. ' print ' Reason: ', E.reason elif hasattr (E, ' Code '): print ' the server couldn/' t fulfill the request. print ' Error code: ', E.code else: # everything is fine
Info and Geturl
Response (or Httperror instance) of the reply object returned by Urlopen has two useful methods, info () and Geturl ()
Geturl-This is useful for returning the real URL that is obtained, because Urlopen (or opener object) may
There will be redirects. The URL you get may be different from the request URL.
Info-The Dictionary object that returns the object that describes the page condition that was obtained. Typically, the server sends a specific header headers. It is now httplib. Httpmessage instance.
The classic headers contains "Content-length", "Content-type", and others. Check out the Quick Reference to HTTP Headers (http://www.cs.tut.fi/~jkorpela/http.html) for a useful list of HTTP headers, as well as their explanatory meanings.
Openers and handlers
When you get a URL you use a opener (a urllib2. An example of Openerdirector, Urllib2. Openerdirector may be a bit confusing. Normally, we use the default opener--through Urlopen, but you can create a personality openers,openers using the processor handlers, all the "heavy" work handled by handlers. Each handlers knows how to open URLs through a specific protocol, or how to handle various aspects of URL opening, such as HTTP redirection or HTTP cookies.
If you want to get URLs with a specific processor, you'll want to create a openers, such as a opener that can handle cookies, or a opener that doesn't redirect.
To create a opener, instantiate a openerdirector, and then call the. Add_handler (some_handler_instance) constantly.
Similarly, you can use Build_opener, which is a more convenient function for creating opener objects, and he only needs one function call at a time.
Build_opener adds several processors by default, but provides a quick way to add or update the default processor.
Other processor handlers you might want to handle proxies, validations, and other common but somewhat special situations.
Install_opener is used to create a (global) default opener. This indicates that calling Urlopen will use the opener you installed.
The opener object has an open method that can be used directly to get URLs like the Urlopen function: Usually you do not have to call Install_opener, except for convenience.
Basic Authentication Fundamental Validation
To demonstrate the creation and installation of a handler, we will use Httpbasicauthhandler to describe this topic in more detail-including how a basic validation works.
See Basic Authentication Tutorial (http://www.voidspace.org.uk/python/articles/authentication.shtml)
When basic authentication is required, the server sends a header (401 error code) to request validation. This specifies SCHEME and a ' realm ', which looks like this: Www-authenticate:scheme realm= "Realm".
For example
Www-authenticate:basic realm= "CPanel Users"
The client must use the new request and include the correct name and password in the request header. This is "Basic validation", in order to simplify this process, we can create an instance of Httpbasicauthhandler and let opener use this handler.
Httpbasicauthhandler uses a password-managed object to process URLs and realms to map user names and passwords. If you know what realm is (from the server), you can use Httppasswordmgr.
Often people don't care what realm is. In that case, you can use the convenient httppasswordmgrwithdefaultrealm. This will specify a default user name and password for the URL. This will be provided when you provide a different combination for a particular realm. We indicate this by specifying none for the realm parameter to be provided to Add_password.
The highest-level URL is the first one that requires validation. The more profound URLs you pass on to. Add_password () will be equally appropriate.
# Create a password manager password_mgr = urllib2. Httppasswordmgrwithdefaultrealm () # Add user name and password # if you know realm, we can use him instead of ' None '. Top_level_url = "http://php.cn/foo/" Password_mgr.add_password (None, top_level_url, username, password) handler = Urllib2. Httpbasicauthhandler (password_mgr) # create "opener" (Openerdirector instance) opener = Urllib2.build_opener (handler) # Use opener to get A URL opener.open (a_url) # installs the opener. # now all calls to Urllib2.urlopen will be used by our opener. Urllib2.install_opener (opener)
Note: The above example we only provide our hhtpbasicauthhandler to Build_opener. The default openers has a normal condition of handlers--proxyhandler,unknownhandler,httphandler,httpdefaulterrorhandler, HTTPRedirectHandler, Ftphandler, Filehandler, Httperrorprocessor.
Top_level_url can actually be a full URL (containing "http:" and a host name and an optional port number) For example: http://example.com/, or a "authority" (that is, the hostname and optional include port number) for example: " example.com "or" example.com:8080 "(the latter contains the port number). Permission verification, if submitted can not include the "User Information" section, for example: "Joe@password:example.com" is wrong.
Proxies Agent Urllib will automatically monitor your proxy settings and use them. This is handled by Proxyhandler the object in the normal processor chain. Usually, that's a good job. But sometimes it doesn't work. One way to do this is to install our own agent processor Proxyhandler, which does not define the proxy. This is similar to using the Basic authentication processor.
>>> Proxy_support = Urllib.request.ProxyHandler ({}) >>> opener = Urllib.request.build_opener (proxy_ Support) >>> Urllib.request.install_opener (opener)
Attention:
At this point, Urllib.request does not support obtaining HTTPS addresses through proxies. However, this can be achieved by extending the urllib.request.
Sockets and Layers
Python support for acquiring network resources is a hierarchical structure. Urllib uses the Http.client library, and then calls the socket library implementation.
In Python2.3 you can specify the waiting response time-out for the socket. This is useful in applications that need to get a Web page. The default socket model does not have timeouts and hangs. Now, the socket timeout is not exposed to the http.client or urllib.request layer. But you can set a global timeout for all sockets.