Python crawler Primer (1)--Quick understanding of HTTP protocol

Source: Internet
Author: User
HTTP protocol is one of the most important and basic protocols in the Internet, and our crawlers need to deal with HTTP protocol frequently. The following article is mainly about the introduction of Python crawler quick understanding of the HTTP protocol information, the article is described in very detailed, the need for friends can refer to, let's take a look at it.

Objective

The basic principle of the crawler is to simulate the browser to make HTTP requests, understand the HTTP protocol is the necessary basis for writing crawlers, recruitment site crawler is also impressively written to master the HTTP protocol specifications, write crawlers have to start with the HTTP protocol started

What is the HTTP protocol?

Each page you browse is rendered based on the HTTP protocol, which is a protocol for data communication between the client (browser) and the server in an Internet application. The protocol specifies in what format the client should send the request to the server, as well as the format in which the response results returned by the service side should be.

Anyone can implement their own web client (browser, crawler) and Web server (Nginx, Apache, etc.) based on the HTTP protocol as long as they are initiating the request and returning the response as agreed.

The HTTP protocol itself is very simple. It states that the request can only be initiated by the client, and the server will return the response result when the request is processed, while HTTP is a stateless protocol in which the Protocol itself does not record the history request of the client.

How does the HTTP protocol specify the request format and response format? In other words, what format does the client follow to properly initiate an HTTP request? In what format does the server return the response result client to parse correctly?

HTTP Request

The HTTP request consists of 3 parts, namely the request line, the request header, the request body, the header and the request body are optional, not every request is required.

Request Line

The request line is an essential part of each request, and it consists of 3 parts, namely the request method, the request URL (URI), the HTTP protocol version, separated by a space.

The most common request methods in the HTTP protocol are: GET, POST, PUT, DELETE. The Get method is used to fetch resources from the server, and 90% of crawlers crawl data based on get requests.

The request URL refers to the path address of the server where the resource resides, such as an example of the client wanting to get index.html, whose path is below the root directory (/) of the server foofish.net.

Request Header

Because the amount of information the request line carries is so limited that the client has a lot of things to say to the server that have to be placed on the request header, the request header is used to give the server some extra information, such as user-agent to indicate the identity of the client. Let the server know if you're from a browser or a crawler, from a Chrome browser or FireFox. The http/1.1 specifies 47 types of header fields. The HTTP header field is formatted like a dictionary type in Python, consisting of key-value pairs separated by colons. Like what:

user-agent:mozilla/5.0

Because the client sends the request, the data (message) is composed of a string, in order to distinguish the end of the request header and the beginning of the request body, with a blank line to represent, encountered a blank row, it is the end of the header, the beginning of the request body.

Request Body

The request body is the real content that the client submits to the server, such as the user name and password to be used when the user logs in, such as the data uploaded by the file, such as the form information submitted when registering the user information.

Now we use the most primitive API socket module provided by Python to simulate an HTTP request to the server

With Socket.socket (socket.af_inet, socket. SOCK_STREAM) as S: # 1. Establish a connection with the server S.connect (("www.seriot.ch", 80)) # 2. To build the request line, the request resource is index.php Request_line = B "get/index.php http/1.1" # 3. Build request header, specify host name headers = B "Host:seriot.ch" # 4. Mark the end of the request header with a blank line blank_line = b "\ r \ n" # Request line, header, blank line these 3 parts are separated by newline characters, which form a request message string # sent to Server message = B "\ r \ n". Join ([Request_line, he Aders, Blank_line]) s.send (message) # The response content returned by the server is analyzed later response = S.RECV (1024x768) print (response)

HTTP response

After the server receives the request and processes it, it returns the response content to the client and, similarly, the response content must follow a fixed format browser in order to parse correctly. The HTTP response is also composed of 3 parts: Response line, response header, response body, and HTTP request format corresponds.

Response Line

The response line is also composed of 3 parts, the version number of the HTTP protocol supported by the server, the status code, and a brief reason description of the status code.

A status code is a very important field in the response line. Through the status code, the client can know whether the server is processing the request properly. If the status code is 200, the client's request is handled successfully, and if it is 500, an exception occurred while the server was processing the request. 404 indicates that the requested resource could not be found on the server. In addition, the HTTP protocol defines many other status codes, but it is not the scope of this article.

Response header

The response header is similar to the request header, which complements the response content and tells the client what type of data the response body is in the header. The time when the response content was returned, whether the response body was compressed, and when the response body was last modified.

Response body

The response body (body) is the real content returned by the server, which can be an HTML page, a picture, a video, and so on.

Let's continue with the previous example to see what the response from the server returns is. Because I only received the first 1024 bytes, a part of the response is not visible.

B ' http/1.1 ok\r\ndate:tue, APR 16:22:35 gmt\r\nserver:apache\r\nexpires:thu, Nov 1981 08:52:00 GMT\r\nSe t-cookie:phpsessid=66bea0a1f7cb572584745f9ce6984b7e; path=/\r\ntransfer-encoding:chunked\r\ncontent-type:text/html; charset=utf-8\r\n\r\n118d\r\n<! DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 strict//en" "Http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" >\n\n 

From the result, it is the same as the canonical format in the protocol, the first line is the response line, the status code is 200, indicating that the request was successful. The second part is the response header information, consisting of a number of headers, the server returns the response time, cookie information and so on. The third part is the real response body HTML text.

At this point, you should have a general understanding of the HTTP protocol, the crawler's behavior is essentially simulating the browser to send HTTP requests, so to deep plowing in the reptile domain, understand the HTTP protocol is necessary.

"Recommended"

1. Python Crawler Primer (4)--Detailed HTML text parsing library BeautifulSoup

2. Python Crawler Introduction (3)--using requests to build a knowledge API

3. Python crawler Primer (2)--http Library requests

4. Summarize Python's logical operators and

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.