Reptile before
Before you start writing a crawler, make sure that you have the knowledge you need.
First: Understand the relevant HTTP protocol knowledge
HTTP is an abbreviation for the Hyper Text Transfer Protocol (Hypertext Transfer Protocol). Its development was the result of collaboration between the World Wide Web Association (Wide) and the Internet Working Group IETF (Internet Engineering Task Force), which eventually released a series of RFC,RFC 1945 defines the http/1.0 version. The most notable of these is RFC 2616. RFC 2616 defines a version of--http 1.1 that is commonly used today.
The HTTP protocol (hypertext Transfer Protocol, Hypertext Transfer Protocol) is the transfer protocol used to transfer the text from the WWW server to the local browser. It can make the browser more efficient and reduce the network transmission. It not only ensures that the computer transmits hypertext documents correctly and quickly, but also determines which part of the document is being transmitted, and which content is displayed first (such as text before graphics), and so on.
Request Response Model for HTTP
The HTTP protocol is always a client-initiated request, and the server echoes the response.
This limits the use of the HTTP protocol, which cannot be implemented when the client does not initiate a request, the server pushes the message to the client.
The HTTP protocol is a stateless protocol, and there is no correspondence between this request and the last request of the same client.
Work flow
An HTTP operation is called a transaction, and its working process can be divided into four steps:
1) First the client and the server need to establish a connection. As soon as you click on a hyperlink, the HTTP work begins.
2) After the connection is established, the client sends a request to the server in the form of a Uniform Resource Identifier (URL), protocol version number, followed by MIME information including the request modifier, client information, and possible content.
3) When the server receives the request, it gives the corresponding response information in the form of a status line, including the protocol version number of the information, a successful or incorrect code, followed by MIME information including server information, entity information, and possible content.
4) The information returned by the client receiving server is displayed on the user's display by the browser, and then the client disconnects from the server.
If an error occurs in one of the steps above, the information that generates the error is returned to the client, with the display output. For the user, these processes are done by HTTP itself, the user just click with the mouse, waiting for information to display it.
Second: Learn about the Urllib library in Python
The Python2 series uses Urllib2,python3 to integrate all of them into urllib; what we need to learn is a few common functions. Details can be viewed on the official website.
Third: Development tools
Python comes with a compiler--IDLE, very concise; pycharm-Interactive Python a ide;fiddler-Web request monitoring tool that we can use to understand the detailed steps that occur when a user triggers a page request;
Simple web crawler
Code
"The first example: a simple web crawler crawling Watercress homepage" import urllib.request# URL url = "https://www.douban.com/" #请求request = urllib.request.Request (URL) #爬取结果response = urllib.request.urlopen (request) data = Response.read () #设置解码方式data = Data.decode (' Utf-8 ') # Printing results print (data) #打印爬取网页的各类信息print (type (response)) print (Response.geturl ()) print (Response.info ()) Print ( Response.getcode ())
Results
Some of the results of interception are as follows:
Python3 Crawler Example (i)--Simple web crawl