Write the first crawler with the Python 0 basics

Last Update:2017-10-08 Source: Internet

Author: User

Tags http cookie http redirect

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The first thing to note is that the code is detected under the python2.7 version.

I. The simplest crawler-download Web page

Import Urllib2

Request=urllib2. Request ("http://www.baidu.com")
Response = Urllib2.urlopen (Request)
Print Response.read ()

How about, it's easy.
URLLIB2 is a standard library in Python, and the above code crawls a specific page and returns the crawled page. Urllib2 the use of this library to be more research, this is the basis for use.

1. URLLIB2 is a component of Python's acquisition of URLs (Uniform Resource Locators). He provides a very simple interface in the form of a urlopen function, which is capable of acquiring URLs using different protocols; It also provides a more complex interface to handle general situations, such as: Basic authentication, cookies, proxies and others, They are provided through handlers and openers objects.

2, Urllib2.urlopen (URL, Data=none, timeout=none)
parameters
　　　　URL: Open a URL that can be a string (like a URL parameter for urllib.urlopen), or a Request object (where this is special)
timeout: Sets the time-out, the type is integer, the unit is seconds, and if the requested server is not responding (for example, due to network conditions) for a specified time Throws an exception and does not wait indefinitely. For HTTP, HTTPS, FTP, ftps
return value
Returns an object similar to the file descriptor (file-like) , the return value of Urllib.urlopen is the same, and Geturl () and info () can also be used with the return value.

3, Urllib2. Request class, the general use method is to use its constructor to get a Request object: Class Urllib2. Request (url[, data][, headers][, origin_req_host][, unverifiable])
Represents a URL request
Parameters
URL: A URL string
Data: The additional data sent to the server, only the HTTP request will use this parameter. When data is not none, it indicates that the HTTP request is post, not get. The type should be a string, typically using Urllib.urlencode () to process a dict or tuple to get the string.
Headers: request header, type is Dict. In addition, the request header can be added to the object by calling Add_header (Key, Val) after it has finished creating the request object. A common method is to add the User-agent request header, impersonating a browser request, to coax the article server, because some servers do not allow program access.
The latter two parameters are generally not used, no longer introduce
return value
A Request object

4, Urllib2. Openerdirector class
When you get a URL, you need to use a opener (Openerdirector). Normally we always use the default opener, which is used by Urlopen, but you can also create custom openers. Opener use handler to handle tasks, all the heavy work is given to these handlers to do. Each handler knows how to open a URL with a specific URL protocol, or how to handle certain aspects of opening a URL, such as an HTTP redirect, or an HTTP cookie.

5. def urllib2.build_opener ([Handler, ...])
Creates a Openerdirector object that can contain multiple handlers
Parameters
Handler, ... : Urllib2 provides a lot of handler to deal with different requests, the common httphandler,ftphandler are better understood. Mention here that Httpcookieprocessor,httpcookieprocessor is handling cookies, and cookies are essential in many requests that require authentication, The operation of the cookie in Python is done by the Cookielib module, and this handler only invokes its method, which adds the cookie to the request and parses the cookie from the response during requests and response.
return value
Openerdirector Object

6. Def urllib2.install_opener (opener)
Install_opener can set a global opener object, which means that calling Urlopen will use the opener you just installed.
Parameters
Opener:openerdirector Object

7, Urllib2. The Httpcookieprocessor class, which typically uses its constructor to get an object, a Handler:class urllib2. Httpcookieprocessor ([Cookiejar])
Parameters
Cookiejar: a cookielib. The Cookiejar object, which is cookielib through the constructor function. Cookiejar () gets
return value
Httpcookieprocessor object, which is a handler

Two. Capturing download exceptions

Import Urllib2
def download (URL):
    print ' Downloading: ', url
    Try
         Html=urllib2.urlopen (URL). Read ()
    Except Urllib2. Urlerror as E:
        print ' Download error: ', E.reason
        Html=none
    return HTML
Download (' http://httpstat,us/500 ')

We introduced the URLLIB2 standard library and defined a download function, which uses the syntax for try/except to handle exceptions.

Three. Web page error and retry download

Import Urllib2

def download (url,num_retries=2):
    print ' Downloading: ', url
    Try
         Html=urllib2.urlopen (URL). Read ()
    Except Urllib2. Urlerror as E:
        print ' Download error: ', E.reason
        Html=none
        If num_retries>0:
            If Hasattr (E, ' Code ') and  500<=e.code<600:
                #recursively Retry 5XX HTTP Errors
                return download (URL, num_retries-1)
    return HTML

Errors encountered during the download are often temporary, such as 503 Service unavilable errors that are returned when the server is overloaded.

Generally speaking , a 4xx error occurs when there is a problem with the request, anda5xx error occurs when there is a problem with the server. So we just need to make sure we re-download it when the 5xx error occurs.

1xx-Information Tips
These status codes represent a temporary response. The client should be prepared to receive one or more 1xx responses before receiving a regular response.
100-Continue.
101-Switch protocol.
2xx-success
This type of status code indicates that the server successfully accepted the client request.
200-OK. The client request was successful.
201-created.
202-accepted.
203-Non-authoritative information.
204-no content.
205-Reset the content.
206-Partial content.
3xx-redirection
The client browser must take more action to implement the request. For example, the browser might have to request a different page on the server, or repeat the request through a proxy server.
301-The object has been permanently moved, that is, permanent redirection.
302-The object has been temporarily moved.
304-not modified.
307-Temporary redirection.
4xx-Client Error
An error occurred and the client appears to be having problems. For example, a client requests a page that does not exist, and the client does not provide valid authentication information. 400-Bad request.
401-access is denied. IIS defines a number of different 401 errors, which indicate a more specific cause of the error. These specific error codes are displayed in the browser, but are not displayed in the IIS log:
401.1-Login failed.
401.2-server configuration caused logon failure.
401.3-not authorized due to ACL restrictions on resources.
401.4-Filter Authorization failed.
401.5-ISAPI/CGI application authorization failed.
401.7– access is denied by the URL authorization policy on the Web server. This error code is dedicated to IIS6.0.
403-Forbidden: IIS defines a number of different 403 errors that indicate a more specific cause of the error:
403.1-execution access is forbidden.
403.2-Read access is forbidden.
403.3-Write access is forbidden.
403.4-Requires SSL.
403.5-Requires SSL128.
The 403.6-IP address is rejected.
403.7-Requires a client certificate.
403.8-site access is denied.
403.9-Excessive number of users.
403.10-Invalid configuration.
403.11-Password change.
403.12-Deny access to the mapping table.
403.13-The client certificate is revoked.
403.14-Reject directory list.
403.15-Client access permission exceeded.
403.16-Client certificate is not trusted or invalid.
403.17-The client certificate has expired or is not yet valid.
403.18-The requested URL cannot be executed in the current application pool. This error code is dedicated to IIS6.0.
403.19-CGI cannot be executed for clients in this application pool. This error code is dedicated to IIS6.0.
403.20-passport Login failed. This error code is dedicated to IIS6.0.
404-not found.
404.0-(None) – No files or directories found.
404.1-Unable to access the Web site on the requested port.
The 404.2-web service extension lockout policy blocks this request.
The 404.3-mime mapping policy blocks this request.
405-The HTTP verb used to access this page is not allowed (method not allowed)
406-The client browser does not accept the MIME type of the requested page.
407-proxy authentication is required.
412-Precondition failed.
413– request entity is too large.
414-The request URI is too long.
415– media types not supported.
The range requested by 416– is not sufficient.
417– execution failed.
423– a locked error.
5xx-Server Error
The server could not complete the request because it encountered an error.
500-Internal server error.
The 500.12-application is busy restarting on the Web server.
The 500.13-web server is too busy.
500.15-Direct Request Global.asa is not allowed.
500.16–unc authorization credentials are incorrect. This error code is dedicated to IIS6.0.
The 500.18–url authorization store cannot be opened. This error code is dedicated to IIS6.0.
500.100-Internal ASP error.
501-The header value specifies the configuration that is not implemented.
An invalid response was received when the 502-web server was used as a gateway or proxy server.
The 502.1-cgi application timed out.
502.2-CGI Application error. Application.
503-The service is not available. This error code is dedicated to IIS6.0.
The 504-gateway timed out.
505-http version is not supported.

Write the first crawler with the Python 0 basics

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More