Python Crawler Learning notes (i)

Last Update:2015-10-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1.URLLIB2 Introduction

URLLIB2 is the Python module that crawls the URL (Uniform Resource Locator). It provides a very simple interface, using the Urlopen function. It can use many different protocols to crawl URLs.
It also provides a slightly more complex interface for handling common situations-such as basic authentication, cookies, proxies, and so on.

2. Crawling URLs

The simple way to use URLIB2 can be as follows:

Import= Urllib2.urlopen ('http://python.org/'=  Response.read ()print html

The output is the crawled page content.

We can use URLLIB2 crawl format URL, can be ' http: ' With ' ftp: ', ' File: ' and so on instead. HTTP is based on the request-answering pattern, and URLLIB2 uses requests to represent HTTP requests, and the simplest form is to create a request object that specifies the URL to get. Call Urlopen using the Request object to return a requested URL

The response object. This response is a file-like object, which means that you can use. Read () for this object:

Import= urllib2. Request ('http://www.voidspace.org.uk'== response.read () Print The_page

URLIB2 can use a variety of URL patterns, for example, you can use FTP form:

req = Urllib2. Request (' ftp://example.com/')

3.Data

Sometimes you want to send data to a URL (usually a URL will point to a CGI (Universal Gateway Interface) script or other Web application).

With HTTP, this typically uses a POST request. This is when you submit an HTML form that you fill out, the browser usually uses a POST request.

Not all posts come from the form: You can use a post to send arbitrary data to your application.

In the usual case of HTML forms, the data needs to be encoded into a standard way, and then passed to the request object as a data parameter. The encoding is used by functions from the Urllib Library not from URLLIB2.

ImportUrllibImportUrllib2url='http://www.someserver.com/cgi-bin/register.cgi'Values= {'name':'Michael Foord',' Location':'Northampton','language':'Python'}data=Urllib.urlencode (values) Req=Urllib2. Request (URL, data) response=Urllib2.urlopen (req) the_page= Response.read ()

If you do not submit the data parameter, URLLIB2 uses a GET request. Get and post requests differ in that post requests usually have "side effects": They change the state of the system in some way.

Although the HTTP standard explicitly stipulates that post may cause side effects, and a GET request never causes a side effect, data can also be transmitted in an HTTP GET request by encoding it in the URL itself.

>>>ImportUrllib2>>>ImportUrllib>>> data = {}>>> data['name'] ='Somebody here'>>> data[' Location'] ='Northampton'>>> data['language'] ='Python'>>> url_values =urllib.urlencode (data)>>>PrintUrl_values#The order may differ.name=somebody+here&language=python&location=Northampton>>> URL ='http://www.example.com/example.cgi'>>> full_url = URL +'?'+url_values>>> data = Urllib2.urlopen (Full_url)

All URLs need to add one? Behind the URL, follow encoded values.

4 Headers

We will discuss a specific HTTP header here to illustrate how to headers add to your HTTP request. Some websites do not like to be browsed by the program, or send different versions of content to different browsers.

Urllib2 default self-identity is python-urllib/xy (x and Y are python major and minor version numbers, such as python-urllib/2.5), which can confuse the Web site or simply not work properly.

The browser identifies itself through User-agent, and when you create a request object, you can send a dictionary containing the header.

The following example title of the dictionary is made with the same requirements as above, but itself is identified as Internet Explorer 5.

ImportUrllibImportUrllib2url='http://www.someserver.com/cgi-bin/register.cgi'user_agent='mozilla/5.0 (Windows NT 6.1; Win64; x64)'Values= {'name':'Michael Foord',' Location':'Northampton','language':'Python'}headers= {'user-agent': User_agent}data=Urllib.urlencode (values) Req=Urllib2. Request (URL, data, headers) Response=Urllib2.urlopen (req) the_page= Response.read ()

5 Urlerror

Urlopen can not handle the response (usually the Python APIs exception, such as Valueerror,typeerror, etc. will also be generated) he will trigger urlerror.

Httperror is a subclass of Urlerror, which is typically generated in a specific HTTP URL.

Typically, urlerror occurs because there is no network connection (the route to the specified server), or the specified server does not exist. In this case, the proposed exception will have a "reason" property that contains a tuple containing the error code and a text error message.

Import= urllib2. Request ('http://www.pretend_server.org')try:  Urllib2.urlopen (req)except  urllib2. Urlerror as E:  Print E.reason

The output is:

[Errno-2] Name or service not known

6 Httperror

The HTTP response from the server contains a number "status code".

Sometimes the status code indicates that the server cannot complete the request. The default handler handles some of these types of responses (for example, if the response is a "redirect", the requesting client obtains the document from a different URL, URLLIB2 will process it).

For those that it cannot handle, Urlopen will trigger httperror. Typical errors include "404" (Web page not found), "403" (requirement Forbidden), and ' 401 ' (requires authentication).

Here's the error Codes

# Table mapping response codes to messages; Entries has the# form {code: (shortmessage, longmessage)}.responses = {: (' Continue ', ' Request received, please Contin UE '), 101: (' Switching protocols ', ' switching to new protocol; Obey Upgrade header '), $: (' OK ', ' Request fulfilled, documen T follows '), 201: (' Created ', ' Document Created, URL follows '), 202: (' Accepted ', ' Request Accepted, processing continues of F-line '), 203: (' non-authoritative information ', ' request fulfilled from cache '), 204: (' No Content ', ' Request fulfilled, n Othing follows '), 205: (' Reset content ', ' Clear input form for further input '), 206: (' Partial content ', ' partial content F Ollows. '), +: (' Multiple Choices ', ' object have several resources--see URI List '), 301: (' Moved permanently ', ' object move D permanently--see URI List '), 302: (' Found ', ' object moved temporarily--see URI List '), 303: (' See other ', ' object move D--See Method and URL List '), 304: (' Not Modified ', ' Document have not changed since given time '), 305: ('Use proxy ', ' Your must use proxy specified in the location to access this ' resource. '), 307: (' Temporary Redirect ', ' Object moved Temporarily-see URI list '), at: (' Bad Request ', ' bad request syntax or unsupported method '), 401: (' Unauthorized ', ' No PE Rmission-See authorization Schemes '), 402: (' Payment Required ', ' No Payment--see charging schemes '), 403: (' Forbidden ', ' Request Forbidden-Authorization would not "), 404: (' Not Found ', ' nothing matches the given URI '), 405: (' Method not ' A Llowed ', ' Specified method is invalid for the This server. '), 406: (' not acceptable ', ' URI not available ' preferred format. ') , 407: (' Proxy authentication Required ', ' You must authenticate with ' this Proxy before proceeding. '), 408: (' Request Timeo Ut ', ' Request timed out; Try again later. '), 409: (' Conflict ', ' Request Conflict. '), 410: (' Gone ', ' URI no longer exists and has been permanently remo Ved. '), 411: (' Length Required ', ' Client must specify Content-length. '), 412: (' precondition Failed ', ' precondition in HeadeRS is false. '), 413: (' Request entity Too Large ', ' entity was Too Large. '), 414: (' Request-uri Too long ', ' URI is Too long. ') ), 415: (' Unsupported Media Type ', ' Entity body in unsupported format. '), 416: (' requested Range not satisfiable ', ' cannot sa TISFY request range. '), 417: (' expectation Failed ', ' Expect condition could not being satisfied. '), $: (' Internal Server Error ', ' server got itself in trouble '), 501: (' Not implemented ', ' server does ', ' the Operation '), 502: (' Bad Gateway ', ' Invalid responses from another server/proxy. '), 503: (' Service unavailable ', ' The server cannot process ' the request due to A high load '), 504: (' Gateway Timeout ', ' The gateway server does not receive a timely response '), 505: (' HTTP Version not Supp Orted ', ' cannot fulfill request. '),}

When the error is returned an HTTP error code and error page improves the server response. You can use an instance of Httperror such as the response on the page to return. This means, as well as the code attributes, that it also has reading, Geturl and information, methods.
When an error number is generated, the server returns an HTTP error number and an error page.
You can use the Httperror instance as the response reply object returned by the page.
This represents the same as the error property, which also contains the Read,geturl, and the info method.

Import= urllib2. Request ('http://www.python.org/fish.html')try:    Urllib2.urlopen (req)except  urllib2. Httperror as E:    print  e.code    print e.read ()

Run Discovery:

404
<!doctype html>

Python Crawler Learning notes (i)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Crawler Learning notes (i)

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support