Python3 web crawler Learning-Basic Library usage (1)

Last Update:2018-08-19 Source: Internet

Author: User

Tags exception handling set time varnish

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently began to learn Python3 web crawler development direction, the beginning of the textbook is Cia Qingcai "Python3 Network crawler developmentpractice," as the temperature of the contents of the learning is also to share their own operation of some experience and confusion, so opened this diary, is also a supervision of their own to learn. In this series of diaries I will always add some of the books do not have the content as a supplement to the knowledge learned.

(1) using the Urllib library

In Python3, the Python2 Urllib and urllib2 Two libraries were merged, and as their built-in HTTP request Library, no additional installation required, this library includes four modules

Request: The most basic HTTP request module that can be used to simulate sending requests. Incoming URLs and additional parameters can be used to access the site like a browser

Error: Exception handling module, when request error, we can catch these exceptions, retry to ensure that the program does not terminate unexpectedly

Parse: A tool module that provides URL processing methods, such as splitting, parsing, merging

Robotparser: Mainly used to identify the site's robots.txt files, and then determine which sites can crawl, use less.

We mainly understand the first three modules

The first is to install, directly with the command to install PIP install URLLIB3 (here if you do not hit 3 words in the CMD will be error) (PS: Spit Groove is not to hit a small L)

Next we use request to send requests, you can use some of the following functions

1. Urlopen (): Can request crawl web content

Request the contents of the Python website as follows

1 Import urllib.request2 response = Urllib.request.urlopen (' https://www.python.org ') 3 print (Response.read (). Decode (' Utf-8 '))

Use the type () function to discover his types

Print (Type (response)) <class ' Http.client.HTTPResponse ' >

is an object of type HttpResponse, so it can be used in some way, such as read () can return the content of the Web page, status can be returned the result of the state code, such as 200 for the successful request, 404 for the page did not find

It is worth mentioning that if the read function is not answered decode (' Utf-8 ') then the reality will be B ' because the Web is binary, you need to convert it to utf-8 format.

>>> print (Response.read ()) B ' '

Using the method of response to find the status Code and response header information, and finally get the server value, Nginx indicates that the server is built with Nginx

>>> print (Response.Status) 200>>> print (Response.getheaders ()) [(' Server ', ' Nginx '), (' Content-type ', ' text/html; Charset=utf-8 '), (' X-frame-options ', ' Sameorigin '), (' X-xss-protection ', ' 1; Mode=block '), (' X-clacks-overhead ', ' GNU Terry Pratchett '), (' Via ', ' 1.1 varnish '), (' Content-length ', ' 48812 '), (' accept-ranges ', ' bytes '), (' Date ', ' Thu, '  2018 02:31:55 GMT '), (' Via ', ' 1.1 varnish '), (' Age ', ' 595 '), (' Connection ', ' Close '), (' x-served-by ', ' cache-iad2126-iad '), Cache-hnd18734-hnd '), (' X-cache ', ' hits, Hit '), (' X-cache-hits ', ' 109 '), (' X-timer ', ' s1534386716.758723,vs0,ve0 '), (' Vary ', ' Cookies '), (' strict-transport-security ', ' max-age=63072000; Includesubdomains ')]>>> print ( Response.getheader (' Server ')) Nginx

You can also pass some parameters to Urlopen, we can look at the function's API (application call interface)

Urllib.request.urlopen (URL, Data=none, [timeout,]*, Cafile=none, Capath=none, Cadefault=false, Context=none )

The following calls are made to these parameters:

Here to pay attention to a thing, use the urllib related to the library, the name of the file must not use some sensitive words, such as http.py, otherwise in the import package, will always error said module ' Urllib ' has no attribute ' request '

After the name is changed, there is no related error, of course, there is a reason for this error, that is, the import is directly imported urllib. But in the Python3, he will not handle the module a piece of import, so we need specific import such as import.urllib.request, of course, there is a final mistake, that is, you misspelled the word, this is a sad tears, OK, then we began to debug the parameters of each.

Data parameter

This parameter is optional, if you want to add this parameter, if he is a byte stream encoding format, that is, the bytes type, then you need to use the bytes () method to convert otherwise if this parameter is passed, his request is no longer a get, but a post.

Import urllib.requestimport urllib.parsedata = bytes (Urllib.parse.urlencode ({' word ': ' Hello '}), encoding = ' UTF8 ') Response = Urllib.request.urlopen (' http://httpbin.org/post ', data = data) print (Response.read ())

Result is

B ' {  "args": {},        "Data": "",         "files": {},         "form": {                       "word": "Hello"                     },         "headers": {                               " Accept-encoding ":" Identity ","                                Connection ":" Close ",                                " content-length ":" Ten ",                                " Content-type ":" Application/x-www-form-urlencoded ",                                " Host ":" httpbin.org ",                                " user-agent ":" python-urllib/3.6 "                          },                          "JSON": null,                          "origin": "182.110.15.26",                          "url": "Http://httpbin.org/post"                         }

The passed parameters appear in the form field, indicating that the form is modeled by the way it was submitted, and the data is passed to post

Timeout parameter

The timeout parameter is used to set the time-out, in seconds, to throw an exception if it exceeds the set time and has not been responded to. If you do not specify this parameter, the global default time is used.

It supports HTTP,HTTPS,FTP requests.

Import urllib.requestresponse = Urllib.request.urlopen (' http://httpbin.org/post ', timeout=1) print (Response.read ())

The setting time-out is 1 seconds, and after one second the server is unresponsive and throws a Urlerror exception. The module belongs to the Urllib.error module and the error is due to timeout.

So we can set the time-out to control the network for a long time without reacting, that is, skipping his crawl. Can be implemented using the Try Expcept statement.

Where the API for the Isinstance () function is Isinatance (object,classinfo), object is an instance object, ClassInfo is a direct or indirect class name, and it is worth mentioning the difference between him and type, Type does not consider a subclass to be a type of parent class, but Isintance considers a subclass to be a type of parent class.

Socket.timeout is the timeout type.

Other parameters:

There are other parameters such as the context parameter that he must be SSL. Sslconetext, which is used to specify SSL settings (SSL: Secure Socket Layer, SSL protocol), which is located between the TCP/IP protocol and various application layer protocols, providing security support for data communication. SSL enables secure communication between clients and servers by authenticating each other, using digital signatures to ensure integrity, and using encryption to ensure confidentiality. The protocol consists of two layers: the SSL logging protocol and the SSL handshake protocol. ）

The Cafile parameter specifies the CA certificate, and the Capath parameter specifies his path, which is valid when the HTTPS link is requested

Python3 web crawler Learning-Basic Library usage (1)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More