Python3 Crawler Example (i)--Simple web crawl

Last Update:2017-09-04 Source: Internet

Author: User

Tags rfc

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reptile before

Before you start writing a crawler, make sure that you have the knowledge you need.

First: Understand the relevant HTTP protocol knowledge

HTTP is an abbreviation for the Hyper Text Transfer Protocol (Hypertext Transfer Protocol). Its development was the result of collaboration between the World Wide Web Association (Wide) and the Internet Working Group IETF (Internet Engineering Task Force), which eventually released a series of RFC,RFC 1945 defines the http/1.0 version. The most notable of these is RFC 2616. RFC 2616 defines a version of--http 1.1 that is commonly used today.
The HTTP protocol (hypertext Transfer Protocol, Hypertext Transfer Protocol) is the transfer protocol used to transfer the text from the WWW server to the local browser. It can make the browser more efficient and reduce the network transmission. It not only ensures that the computer transmits hypertext documents correctly and quickly, but also determines which part of the document is being transmitted, and which content is displayed first (such as text before graphics), and so on.
Request Response Model for HTTP
The HTTP protocol is always a client-initiated request, and the server echoes the response.
This limits the use of the HTTP protocol, which cannot be implemented when the client does not initiate a request, the server pushes the message to the client.
The HTTP protocol is a stateless protocol, and there is no correspondence between this request and the last request of the same client.
Work flow
An HTTP operation is called a transaction, and its working process can be divided into four steps:
1) First the client and the server need to establish a connection. As soon as you click on a hyperlink, the HTTP work begins.
2) After the connection is established, the client sends a request to the server in the form of a Uniform Resource Identifier (URL), protocol version number, followed by MIME information including the request modifier, client information, and possible content.
3) When the server receives the request, it gives the corresponding response information in the form of a status line, including the protocol version number of the information, a successful or incorrect code, followed by MIME information including server information, entity information, and possible content.
4) The information returned by the client receiving server is displayed on the user's display by the browser, and then the client disconnects from the server.
If an error occurs in one of the steps above, the information that generates the error is returned to the client, with the display output. For the user, these processes are done by HTTP itself, the user just click with the mouse, waiting for information to display it.

Second: Learn about the Urllib library in Python

The Python2 series uses Urllib2,python3 to integrate all of them into urllib; what we need to learn is a few common functions. Details can be viewed on the official website.

Third: Development tools

Python comes with a compiler--IDLE, very concise; pycharm-Interactive Python a ide;fiddler-Web request monitoring tool that we can use to understand the detailed steps that occur when a user triggers a page request;

Simple web crawler

Code

"The first example: a simple web crawler crawling Watercress homepage" import urllib.request# URL url = "https://www.douban.com/" #请求request = urllib.request.Request (URL) #爬取结果response = urllib.request.urlopen (request) data = Response.read () #设置解码方式data = Data.decode (' Utf-8 ') # Printing results print (data) #打印爬取网页的各类信息print (type (response)) print (Response.geturl ()) print (Response.info ()) Print ( Response.getcode ())

Results

Some of the results of interception are as follows:

Python3 Crawler Example (i)--Simple web crawl

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More