How to Use urllib2 to obtain Network Resources in Python

Source: Internet
Author: User

RelatedArticle:
You can also find information about network resources in the following articles.
Example in Python: A Basic verification tutorial
Urllib2 is a python component used to obtain URLs (Uniform Resource Locators. He provides a very simple interface in the form of urlopen functions,
This is the ability to obtain URLs using different protocols. It also provides a complex interface to handle general situations, such as basic verification, cookies, proxies, and others.
They are provided through handlers and openers objects.
Urllib2 supports obtaining URLs in different formats (strings defined before ":" in the URL, for example, "ftp" is "ftp: Python. the prefix of ORT/". They use their network protocols (such as FTP and HTTP)
. This tutorial focuses on HTTP, the most widely used one.
For simple applications, urlopen is very easy to use. However, when you encounter errors or exceptions when enabling HTTP URLs, you will need to understand Hypertext Transfer Protocol (HTTP.
The most authoritative HTTP documentation is of course RFC 2616 (http://rfc.net/rfc2616.html ). This is a technical document, so it is not easy to read. This howto tutorial aims to show how to use urllib2,
And provide enough HTTP details to help you understand. It is not a document description of urllib2, but an auxiliary function.
Obtain URLs
The simplest use of urllib2 will be as follows:

Python code

Import urllib2response = urllib2.urlopen ('HTTP: // python.org/') html = response. Read ()

Many urllib2 applications are so simple (Remember, except for "http:", URLs can also be replaced by "ftp:", "file:", and so on ). However, this article teaches more complex HTTP applications.
HTTP is based on the request and response mechanism-the client initiates a request, and the server provides a response. Urllib2 uses a request object to map your HTTP request. In its simplest form of use, you will use
The URL creates a request object. By calling urlopen and passing in the request object, a response object of the relevant request is returned. This response object is like a file object, so you can call it in response. read ().

Python code

Import urllib2req = urllib2.request ('HTTP: // www.voidspace.org. uk ') response = urllib2.urlopen (req) the_page = response. Read ()

Remember that urllib2 uses the same interface to process all URL headers. For example, you can create an FTP request as follows.

Python code

Req = urllib2.request ('ftp: // example.com /')

In HTTP requests, you are allowed to perform two additional tasks. First, you can send data form data, and then you can send additional information about the data or send itself ("metadata") to the server, this data is sent as the HTTP "headers.
Next, let's take a look at how to send these messages.
Data
Sometimes you want to send some data to the URL (usually the URL is similar to the CGI [Universal gateway interface] script, or other Web ApplicationsProgramAttached ). In HTTP, this is often sent using well-known POST requests. This is usually done by your browser when you submit an HTML form.
Not all posts are from forms. You can use post to submit arbitrary data to your own program. For general HTML forms, data must be encoded in the standard format. Then, it is uploaded as a data parameter to the request object. Encoding uses urllib functions instead
Urllib2.

Python code

Import urllibimport urllib2url = 'HTTP: // www.someserver.com/cgi-bin/register.cgi'values = {'name': 'Michael Foord ', 'location': 'northup', 'language ': 'python'} DATA = urllib. urlencode (values) Req = urllib2.request (URL, data) response = urllib2.urlopen (req) the_page = response. read ()

remember that sometimes other encoding is needed (for example, uploading a file from HTML -- Viewing detailed descriptions of http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13 HTML specification and form submission ).
If ugoni does not transmit the data parameter, urllib2 uses the GET request. The difference between get and post requests is that post requests usually have "Side effects" and they change the system status in some way (such as submitting a pile of garbage to your door ).
although the HTTP standard clearly states that posts usually produces side effects, GET requests do not have side effects, but there is nothing to prevent GET requests from side effects, similarly, post requests may have no side effects. Data can also be encoded in the URL of the GET request
.
see the following example.

Python code
 
>>> Import urllib2
>>> Import urllib
>>> Data = {}
>>> Data ['name'] = 'somebody here'
>>> Data ['location'] = 'northup'
>>> Data ['language'] = 'python'
>>> Url_values = urllib. urlencode (data)
>>> Print url_values
Name = somebody + here & language = Python & location = Northampton
>>> Url = 'HTTP: // www.example.com/example.cgi'
>>> Full_url = URL + '? '+ Url_values
>>> DATA = urllib2.open (full_url)

Headers
We will discuss specific HTTP headers here to illustrate how to add headers to your HTTP request.
Some websites do not like to be accessed by programs (not manually accessed), or send different versions of content to different browsers. By default, urllib2 uses itself as "Python-urllib/x. y" (X and Y are the main Python version and minor version, such as Python-urllib/2.5 ),
This identity may confuse the site or simply stop working. The browser confirms that its identity is through the User-Agent header. When you create a request object, you can give it a dictionary containing the header data. The following example sends the same content as above,
Simulate it as Internet Explorer.

Python code

Import urllibimport urllib2url = 'HTTP: // Container = 'mozilla/4.0 (compatible; MSIE 5.5; Windows NT) 'values = {'name': 'Michael Foord ', 'location': 'northup', 'language ': 'python'} headers = {'user-agent': user_agent} DATA = urllib. urlencode (values) Req = urllib2.request (URL, Data, headers) response = urllib2.urlopen (req) the_page = response. read ()

The response object also has two useful methods. Take a look at the following section info and geturl, and we will see what will happen when an error occurs.
Handle exceptions handling exception
When urlopen cannot process a response, urlerror is generated (however, common Python APIs exceptions such as valueerror and typeerror are also generated ).
Httperror is a subclass of urlerror, which is usually generated in a specific HTTP URLs.
Urlerror
Generally, urlerror is generated when there is no network connection (no route to a specific server) or the server does not exist. In this case, the exception also carries the "reason" attribute, which is a tuple and contains an error code and an error message.
For example

Python code

>>> Req = urllib2.request ('HTTP: // www. pretend_server.org ') >>> try: urllib2.urlopen (req) >>> response t urlerror, e :>> print E. reason >>> (4, 'getaddrinfo failed ')

Httperror
The response of each HTTP Response object on the server contains a number "status code ". Sometimes the status code indicates that the server cannot complete the request. The default processor will process a part of this response for you (for example, if response is a "redirection", the client needs to get the document from another address
, Urllib2 will handle it for you ). Otherwise, urlopen generates an httperror. Typical errors include "404" (page not found), "403" (request prohibited), and "401" (with verification request ).
See all the HTTP Error Codes in Section 10 of RFC 2616.
After an httperror instance is generated, an integer 'code' attribute is generated, indicating the error code sent by the server.
Error Codes error code
Because the default processor processes redirection (a number other than 300) and the number in the range of-299 indicates success, you can only see the error number.
Basehttpserver. basehttprequesthandler. response is a useful answer number dictionary that displays all the answer numbers used by RFC 2616. This document re-displays the dictionary for convenience. (Translator)
When an error code is generated, the server returns an HTTP Error code and an error page. You can use the httperror instance as the response object response returned by the page. This indicates that, like the error attribute, it also contains the read, geturl, and info methods.

Python code

>>> Req = urllib2.request ('HTTP: // www.python.org/fish.html') >>>try: >>> urllib2.urlopen (req) >>> response t urlerror, e :>>> Print E. code >>> print E. read () >>>

404
<! Doctype HTML public "-// W3C // dtd html 4.01 transitional // en"
Http://www.w3.org/TR/html4/loose.dtd>
<? XML-stylesheet href = "./CSS/ht2html.css"
Type = "text/CSS"?>
<HTML> ... Etc...
Wrapping it up Packaging
So if you want to prepare for httperror or urlerror, there are two basic methods. I prefer the second one.
First:

Python code

From urllib2 import request, urlopen, urlerror, httperrorreq = request (someurl) Try: Response = urlopen (req) handle T httperror, E: Print 'the server couldn \'t fulfill the request. 'print 'error code: ', E. code=t urlerror, E: print 'We failed to reach a server. 'print 'reason: ', E. reasonelse: # Everything is fine

Note: The primary T httperror must be in the first one; otherwise, the primary T urlerror will also accept the httperror.
Second:

Python code

From urllib2 import request, urlopen, urlerrorreq = request (someurl) Try: Response = urlopen (req) failed t urlerror, E: If hasattr (E, 'reason '): print 'We failed to reach a server. 'print 'reason: ', E. reason Elif hasattr (E, 'code'): Print 'the server couldn \'t fulfill the request. 'print 'error code: ', E. codeelse: # Everything is fine

Info and geturl
The response object response (or httperror instance) returned by urlopen has two useful methods: Info () and geturl ()
Geturl -- returns the obtained real URL, which is useful because urlopen (or opener object) May
There will be redirection. The obtained URL may be different from the requested URL.
Info -- the dictionary object of the returned object, which describes the page information obtained. It is usually the specific headers sent by the server. Currently, this is an httplib. httpmessage instance.
The classic headers include "Content-Length", "Content-Type", and others. View quick reference to HTTP headers (http://www.cs.tut.fi /~ Jkorpela/http.html)
Obtain a list of useful HTTP headers and their meanings.
Openers and handlers
When you get a URL, you use an opener (an instance of urllib2.openerdirector, the name of urllib2.openerdiremay be confusing .) Normally, we
Use the default opener-through urlopen, but you can create personalized openers. openers uses the processor handlers, and all the "heavy" Work is handled by handlers. Every handlers knows
How to enable URLs through a specific protocol, or how to handle all aspects of URL opening, such as HTTP redirection or HTTP cookies.
If you want to use a specific processor to obtain URLs, you will want to create an openers, such as an opener that can process cookies or an opener that does not redirect.
Create an opener, instantiate an openerdirector, and call. add_handler (some_handler_instance) continuously ).
Similarly, you can use build_opener, which is a more convenient function used to create opener objects. He only needs to call the handler function.
Build_opener adds several processors by default, but provides a quick way to add or update the default processor.
Other Processors handlers you may want to process proxy, verification, and other commonly used but a little special.
Install_opener is used to create (global) default opener. This indicates that calling urlopen will use the opener you installed.
The opener object has an open method, which can be directly used to obtain URLs like the urlopen function. Generally, you do not need to call install_opener, except for convenience.
More please view http://blog.csdn.net/b2b160/archive/2009/03/27/4030702.aspx

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.