Python crawler tutorial-quick understanding of HTTP protocol (1), python Crawler
Preface
The basic principle of crawler is to simulate the browser for HTTP requests. Understanding the HTTP protocol is an essential foundation for writing crawlers. The crawlers who recruit websites are also very familiar with the HTTP protocol specifications, the write crawler has to start with the HTTP protocol.
What is HTTP?
Each web page you browse is presented based on the HTTP protocol. The HTTP protocol is a protocol for data communication between clients (browsers) and servers in Internet applications. The Protocol specifies the format in which the client sends a request to the server and the format of the response returned by the server.
As long as you initiate a request and return a response according to the protocol, anyone can implement their own Web Client (browser, crawler) and Web Server (Nginx, Apache, etc.) based on the HTTP protocol ).
The HTTP protocol itself is very simple. It stipulates that the client can only initiate a request. The server returns the response result after receiving the request processing. HTTP is a stateless protocol, and the protocol itself does not record the historical request records of the client.
How does the HTTP Protocol specify the request format and response format? In other words, what format can the client correctly initiate an HTTP request? In what format can the client correctly parse the returned response results?
HTTP Request
An HTTP request consists of three parts: the request line, request header, and request body. The header and request body are optional and not required for each request.
Request Line
A request line is an essential part of a request. It consists of three parts: Request method, request URL, and HTTP Protocol version, separated by spaces.
The most common request methods in HTTP are GET, POST, PUT, and DELETE. The GET method is used to obtain resources from the server, and 90% of crawlers crawl Data Based on GET requests.
The request URL refers to the path address of the server where the resource is located. For example, the client wants to obtain the resource index.html. Its path is under the root directory (/) of the foofish.net server.
Request Header
Because the request line carries a very limited amount of information, the client has to put a lot of things to say to the server in the request Header. The request Header is used to provide additional information to the server, for example, the User-Agent is used to indicate the identity of the client and let the server know whether you are a request or crawler from the browser, whether it is from the Chrome browser or FireFox. HTTP/1.1 specifies 47 header field types. The format of the HTTP header field is similar to the dictionary type in Python. It consists of key-value pairs separated by colons. For example:
User-Agent: Mozilla/5.0
When a client sends a request, the sent data (packets) is composed of strings. to distinguish the end Of the request header from the beginning of the Request body, a blank line is used for representation. When a blank line is encountered, it indicates the end of the header and the beginning of the Request body.
Request body
The request body is the real content submitted by the client to the server. For example, the user name and password required for logon, such as the data uploaded by the file, such as the form information submitted during user registration.
Now we use the original API socket module provided by Python to simulate an HTTP request to the server.
With socket. socket (socket. AF_INET, socket. SOCK_STREAM) as s: #1. establish a connection with the server. connect ("www. seriot. ch ", 80) #2. construct the request line. The requested resource is index. php request_line = B "GET/index. php HTTP/1.1 "#3. build the request header and specify the Host name headers = B "Host: seriot. ch "#4. use blank lines to mark the end position of the Request Header. The content of the three parts, namely, "\ r \ n" # request line, header, and blank line, are separated by line breaks, form a request message string # send to Server message = B "\ r \ n ". join ([request_line, headers, blank_line]) s. send (message) # analyze the response content returned by the server later. response = s. recv (1024) print (response)
HTTP Response
After the server receives and processes the request, it returns the response content to the client. Similarly, the response content must follow a fixed format for the browser to parse correctly. The HTTP response is also composed of three parts: Response line, response header, and response body, which correspond to the HTTP request format.
Response line
Response lines are also composed of HTTP Protocol version numbers, status codes, and brief descriptions of status codes supported by the server.
Status Code is an important field in the response line. Through the status code, the client can know whether the server is properly processing the request. If the status code is 200, it indicates that the client request is successfully processed. If the status code is 500, it indicates that the server encountered an exception when processing the request. 404 indicates that the requested resource cannot be found on the server. In addition, HTTP also defines many other status codes, but it is not the scope of this article.
Response Header
The response header is similar to the request header. It is used to supplement the response content. In the header, you can tell the client what the data type of the response body is? When is the response content returned, whether the response body is compressed, and the last modification time of the response body.
Response body
The response body is the real content returned by the server. It can be an HTML page, an image, a video, and so on.
Let's continue with the previous example to see what the response results returned by the server are? Because I only received the first 1024 bytes, some of the response content is invisible.
b'HTTP/1.1 200 OK\r\nDate: Tue, 04 Apr 2017 16:22:35 GMT\r\nServer: Apache\r\nExpires: Thu, 19 Nov 1981 08:52:00 GMT\r\nSet-Cookie: PHPSESSID=66bea0a1f7cb572584745f9ce6984b7e; path=/\r\nTransfer-Encoding: chunked\r\nContent-Type: text/html; charset=UTF-8\r\n\r\n118d\r\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n\n
From the result, it is the same as the standard format in the Protocol. The first line is the response line, and the status code is 200, indicating that the request is successful. The second part is the response header information, which consists of multiple headers, including the server response time and Cookie information. The third part is the real response body HTML text.
At this point, you should have a general understanding of the HTTP protocol. In essence, crawler behavior is to simulate a browser to send HTTP requests. Therefore, it is necessary to understand the HTTP protocol in the crawler field.
Summary
The above is all the content of this article. I hope the content of this article will help you in your study or work. If you have any questions, please leave a message. Thank you for your support.