Introduction to HTTP

Source: Internet
Author: User

What is the HTTP protocol?

Each page you browse is rendered based on the HTTP protocol, which is a protocol for data communication between the client (browser) and the server in an Internet application. The protocol specifies in what format the client should send the request to the server, as well as the format in which the response results returned by the service side should be.

Anyone can implement their own web client (browser, crawler) and Web server (Nginx, Apache, etc.) based on the HTTP protocol as long as they are initiating the request and returning the response as agreed.

The HTTP protocol itself is very simple. It states that the request can only be initiated by the client, and the server will return the response result when the request is processed, while HTTP is a stateless protocol in which the Protocol itself does not record the history request of the client.

How does the HTTP protocol specify the request format and response format? In other words, what format does the client follow to properly initiate an HTTP request? In what format does the server return the response result client to parse correctly?

HTTP Request

The HTTP request consists of 3 parts, namely the request line, the request header, the request body, the header and the request body are optional, not every request is required.

Request Line

The request line is an essential part of each request, and it consists of 3 parts, namely the request method, the request URL (URI), the HTTP protocol version, separated by a space.

The most common request methods in the HTTP protocol are: GET, POST, PUT, DELETE. The Get method is used to fetch resources from the server, and 90% of crawlers crawl data based on get requests.

The request URL refers to the path address of the server where the resource resides, such as an example that indicates that the client wants to get the index.html resource, and its path is below the root directory (/) of the server http:/ foofish.net .

Request Header

Because the amount of information the request line carries is so limited that the client has a lot of things to say to the server that have to be placed on the request header, the request header is used to give the server some extra information, such as user-agent to indicate the identity of the client. Let the server know if you're from a browser or a crawler, from a Chrome browser or FireFox. The http/1.1 specifies 47 types of header fields. The HTTP header field is formatted like a dictionary type in Python, consisting of key-value pairs separated by colons. Like what:

User-Agent: Mozilla/5.0

Because the client sends the request, the data (message) is composed of a string, in order to distinguish the end of the request header and the beginning of the request body, with a blank line to represent, encountered a blank row, it is the end of the header, the beginning of the request body.

Request Body

The request body is the real content that the client submits to the server, such as the user name and password to be used when the user logs in, such as the data uploaded by the file, such as the form information submitted when registering the user information.

Now we use the most primitive API socket module provided by Python to simulate an HTTP request to the server

WithSocket.Socket(Socket.Af_inet,Socket.Sock_stream)AsS:# 1. Establish a connection to the serverS.Connect(("Www.seriot.ch",80))# 2. Build request line, request resource is index.phpRequest_line=B"Get/index.php http/1.1"# 3. Build request header, specify host nameHeaders=B"Host:seriot.ch"# 4. Mark the end position of the request header with a blank lineBlank_line=B"\ r \ n"# request line, header, blank line these 3 sections are separated by newline characters to form a request message string# sent to server message = b"\ r \ n".  Join([request_line, headers, blank_line]) s.  Send(message) # The response content returned by the server is parsed later response = s.  Recv(1024x768) print(response)         
HTTP response

After the server receives the request and processes it, it returns the response content to the client and, similarly, the response content must follow a fixed format browser in order to parse correctly. The HTTP response is also composed of 3 parts: Response line, response header, response body, and HTTP request format corresponds.

Response Line

The response line is also composed of 3 parts, the version number of the HTTP protocol supported by the server, the status code, and a brief reason description of the status code.

A status code is a very important field in the response line. Through the status code, the client can know whether the server is processing the request properly. If the status code is 200, the client's request is handled successfully, and if it is 500, an exception occurred while the server was processing the request. 404 indicates that the requested resource could not be found on the server. In addition, the HTTP protocol defines many other status codes, but it is not the scope of this article.

Response header

The response header is similar to the request header, which complements the response content and tells the client what type of data the response body is in the header. The time when the response content was returned, whether the response body was compressed, and when the response body was last modified.

Response body

The response body (body) is the real content returned by the server, which can be an HTML page, a picture, a video, and so on.

Let's continue with the previous example to see what the response from the server returns is. Because I only received the first 1024 bytes, a part of the response is not visible.

b‘HTTP/1.1 200 OK\r\nDate: Tue, 04 Apr 2017 16:22:35 GMT\r\nServer: Apache\r\nExpires: Thu, 19 Nov 1981 08:52:00 GMT\r\nSet-Cookie: PHPSESSID=66bea0a1f7cb572584745f9ce6984b7e; path=/\r\nTransfer-Encoding: chunked\r\nContent-Type: text/html; charset=UTF-8\r\n\r\n118d\r\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n\n

From the result, it is the same as the canonical format in the protocol, the first line is the response line, the status code is 200, indicating that the request was successful. The second part is the response header information, consisting of a number of headers, the server returns the response time, cookie information and so on. The third part is the real response body HTML text.

At this point, you should have a general understanding of the HTTP protocol, the crawler's behavior is essentially simulating the browser to send HTTP requests, so to deep plowing in the reptile domain, understand the HTTP protocol is necessary.

Of course, the HTTP protocol is far more than this content, it is impossible to use an article to try to make it all clear, I am here also just a point, want to learn more about HTTP, can refer to "python Zen" recommended extended reading. "

Introduction to HTTP

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.