One, Python crawler-Learning tutorial "HOWTO-URLLIB2"

Last Update:2016-07-08 Source: Internet

Author: User

Tags urlencode

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, URLLIB2 library introduction

URLLIB2 is a component of Python's acquisition of URLs (Uniform Resource Locators). He provides a very simple interface in the form of a urlopen function.

This is the ability to use different protocols to get URLs, and he also provides a more complex interface to handle general situations such as basic authentication, cookies, proxies and others.

They are provided through handlers and openers objects.

Second, the use of URLLIB2 library

1, the simplest: directly with the Urlopen function to open the URL to get HTML

= Urllib2.urlopen ('http://www.cnblogs.com/xin-xin/p/4297852.html'= Response.read ()

2. Use Requset to package request

HTTP is based on the request and response mechanism-the client requests that the server provides the response. URLLIB2 uses a Request object to map the HTTP request you make, and in its simplest form you will use the

The address creates a request object, which, by calling Urlopen and passing in the Request object, returns a related requested response object, which is like a file object, so you can call. Read () in response.

Request=urllib2. Request (' http://www.cnblogs.com/xin-xin/p/4297852.html ') response=urllib2.urlopen (Request) Html=response.read ()

Remember that URLLIB2 uses the same interface to handle all URL headers. For example, you can create an FTP request as follows.

req = Urllib2. Request (' ftp://example.com/')

There are two additional things that you are allowed to do in an HTTP request. The first is that you can send data forms, and then you can transmit additional information about the data or send itself ("metadata") to the server, which is sent as HTTP "headers".

3. Package data POST request with data

URL ='http://www.cnblogs.com/xin-xin/p/4297852.html'Values= {'name':'Michael Foord',          ' Location':'Pythontab',          'language':'Python'}data=Urllib.urlencode (values) Req=Urllib2. Request (URL, data) response=Urllib2.urlopen (req) the_page= Response.read ()

4. Package data data get request put data in URL

data['name'] ='Somebody here'data[' Location'] ='Pythontab'data['language'] ='Python'url_values=urllib.urlencode (data) print Url_valuesurl='http://www.baidu.com'Full_url= URL +'?'+Url_valuesdata= Urllib2.urlopen (Full_url)

About UrlEncode and UrlDecode

http://blog.csdn.net/wuwenjunwwj/article/details/39522791

5, Headers

We will discuss specific HTTP headers here to illustrate how to add headers to your HTTP request. Some sites do not like to be accessed by programs (non-human access) or to send different versions of content to different browsers. The default urllib2 itself as "python-urllib/x.y" (x and Y are Python major and minor versions, such as python-urllib/2.5), which can confuse the site or simply not work. The browser confirms its identity through the user-agent header, and when you create a request object, you can give him a dictionary containing the header data. The following example sends the same content as above, but simulates itself as an internet Explorer.

Crawler 3 elements: URL Data and Header

URL=user_agent='mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'Headers= {'user-agent': User_agent}values= {'name':'Michael Foord',          ' Location':'Pythontab',          'language':'Python'} Data =Urllib.urlencode (values) Req=Urllib2. Request (URL, data, headers) Response=Urllib2.urlopen (req) the_page= Response.read ()

6. Exception Handling

req = Urllib2. Request ('try

Handle Exceptions Handling Exceptions

When Urlopen is not able to handle a response, Urlerror is generated (but the usual Python APIs exceptions, such as Valueerror,typeerror, are also generated). Httperror is a subclass of Urlerror, usually generated in a specific HTTP URLs.

Urlerror: Typically, urlerror occurs when there is no network connection (no routing to a particular server), or if the server does not exist. In this case, the exception will also have the "reason" attribute, which is a tuple that contains an error number and an error message.

Httperror: Each HTTP Reply object on the server response contains a number "status code". Sometimes the status code indicates that the server cannot complete the request. The default processor will handle some of this response for you (for example: If response is a "redirect", the client needs to get the document from a different address.)

Typical errors include "404" (Page cannot be found), "403" (Request forbidden), and "401" (with authentication request).

Wrapping It up packaging

So if you want to prepare for httperror or urlerror, there are two basic ways. I prefer the second kind.

The first one:

 fromurllib2 Import Request, Urlopen, Urlerror, Httperror req=Request (Someurl)Try: Response=Urlopen (req) except Httperror, E:print'The server couldn/'T fulfill the request.'Print'Error Code:', E.code except Urlerror, E:print'We failed to reach a server.'Print'Reason:', E.reasonElse: # Everything isFine

Note: Except Httperror must be in the first, otherwise except Urlerror will likewise receive to Httperror.

The second one:

 fromurllib2 Import Request, Urlopen, Urlerror req=Request (Someurl)Try: Response=Urlopen (req) except Urlerror, E:ifHasattr (E,'reason'): Print'We failed to reach a server.'Print'Reason:', E.reason elif hasattr (E,'Code'): Print'The server couldn/'T fulfill the request.'Print'Error Code:', E.codeElse: # Everything isFine

7. Info & GETURL

Response (or Httperror instance) of the reply object returned by Urlopen has two useful methods, info () and Geturl ()

Geturl-This is useful for returning the real URL that is obtained, because Urlopen (or opener object) may have redirects. The URL you get may be different from the request URL.

Info-The Dictionary object that returns the object that describes the page condition that was obtained. Typically, the server sends a specific header headers. It is now httplib. Httpmessage instance.

8, did not read

-- through Urlopen, but you can create the personality of the openers,openers using the processor handlers, all the "heavy" work is handled by handlers. Each handlers knows how to open URLs through a specific protocol, or how to handle various aspects of URL opening, such as HTTP redirection or HTTP cookies. If you want to get URLs with a specific processor, you'll want to create a openers, such as a opener that can handle cookies, or a opener that doesn't redirect. To create a opener, instantiate a openerdirector, and then call the. Add_handler (some_handler_instance). Similarly, you can use Build_opener, which is a more convenient function, Used to create a opener object, he only needs one function call at a time. Build_opener adds several processors by default, but provides a quick way to add or update the default processor. Other processor handlers you might want to handle proxies, validations, and other common but somewhat special situations. Install_opener is used to create a (global) default opener. This indicates that calling Urlopen will use the opener you installed. The opener object has an open method that can be used directly to get URLs like the Urlopen function: Usually you do not have to call Install_opener, except for convenience.

9, Basic authentication verification did not read!!!! - -

Httpbasicauthhandler uses a password-managed object to process URLs and realms to map user names and passwords. If you know what realm is (from the server), you can use Httppasswordmgr.

Often people don't care what realm is. In that case, you can use the convenient httppasswordmgrwithdefaultrealm. This will specify a default user name and password for the URL. This will be when you provide a different combination for a particular realm

be provided. We indicate this by specifying none for the realm parameter to be provided to Add_password.

The highest-level URL is the first one that requires validation. The more profound URLs you pass on to. Add_password () will be equally appropriate.

10. Sockets and Layers

Python support for acquiring network resources is a hierarchical structure. Urllib uses the Http.client library, and then calls the socket library implementation.

In Python2.3 you can specify the waiting response time-out for the socket. This is useful in applications that need to get a Web page. The default socket model does not have timeouts and hangs. Now, the socket timeout is not exposed

To the http.client or urllib.request layer. But you can set a global timeout for all sockets

    Import Socket    timeout=ten    socket.setdefaulttimeout (timeout)    this  default  timeout    set in the socket module    = urllib2 . Request ('http://www.voidsspace.org.uk')    = Urllib2.urlopen (req)

One, Python crawler-Learning tutorial "HOWTO-URLLIB2"

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More