HTTP protocol-http authoritative guide

Last Update:2015-10-12 Source: Internet

Author: User

Tags error status code

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1.HTTP Foundation

Features of http:
Support b/S mode mode, no state.
HTTP message: Consists of 2 parts starting line (request line or status line) and MIME information (header and content)
HTTP Mediation: There are 3 kinds-server cache proxy, gateway, tunnel. The agent receives the request according to the absolute format of the URL, rewrites all or part of the message, and sends the formatted request to the server through the URL's identity. A gateway is a receiving agent that acts as the upper layer of a different server and, if necessary, translates the request to a lower-level server protocol. The tunnel acts as a relay point between 2 connections that do not change the message, and is often used for communication that requires passing through an intermediary (such as a firewall) or where the intermediary does not recognize the content of the message.

The URL consists of 9 parts:
< programs >://< usernames >:< passwords >@< hosts >:< ports >/< paths >;< parameters >?< queries >#< snippets >
Where://: @/;? # is the delimiter for each field, and the back part of the path will be sent in the entity portion of the HTTP message
Scenario: Using the protocol, HTTP,HTTPS,FTP
Parameters: Some scenarios can use this to specify parameters
Ftp://prep.ai.mit.edu/pub/gnu.giz;type=d specified with 2-binary transmission
Type is the parameter name and D is the value
Query: Some scenarios use this to pass query conditions to activate an application.
http://www.joes.com/check.cgi?item=12731&color=red query criteria by query field
Fragment: The name of a small piece of resources. The browser does not pass the field to the server, and the server's resources are accessed on a page-by-page, and if the user wants to get some of the resources on a page, the browser requests the entire resource, but only the specified Frage section is displayed.
Http://www.joes.com/tools.html#drills only requests drills partial resources under the/tools.html page
Flow theorem for message downstream: Both the request message and the response message flow downstream

2.HTTP message-Request and response

Receive flag for the entire part of the head: blank line +crlf
HTTP methods: The following 7 types of methods are commonly used
*get request for a document from the server
*put Store the principal part of the request on the server
*head only gets the header of the document from the server
*post sending data to the server that needs to be processed
*trace tracks messages that may be routed to the server through a proxy server. The mediation receives a trace request to add its own host address to the VIA header of the message. The destination host sends back the received message in the entity that responds, so that the client can determine whether the request has been modified by the mediation.
*options Ask the server if some methods or resources are supported
*delete Delete a document from the server
HTTP Header Categories:
* Universal Head Date connection mime-version Trailer via pragma update transfer-encoding ...
* Request Header Client-ip from host Referer ua-color ua-cpu ua-disp ua-os ua-pixels ua-agent Accept Accept-*charset accept-encod ing accept-language TE if-match if-modified-since if-none-match if-range if-*unmodified-since Range Authorization Co Okie Max-forward proxy-authorization ...
* Response Header Server age public retry-after title warning.
* Entity Head content-type content-encoding content-length content-language content-location allow location ...
* Extended Head
Entity can host information: Picture video HTML document e-mail, etc.
Information Status code:
*100~199: Information Status code
*200~299: Success Status Code
*300~399: Redirect Status code
*400~499: Client Error status code
*500~599: Server-side Error status code
TCP connection Management in 3.HTTP
A TCP connection is a <src_ip,src_port,des_ip,des_port>,2 bar that is uniquely identified by 4 values, different connections cannot have exactly the same 4 values, and the active connection at the source end cannot have multiple connections on the same port at the same time src_ The port.
HTTP client-to-server general transactional interaction process:
* The client resolves the host name and port number from the URL
* Client IP queried by DNS for host name
* The client initiates a connection to the host ip+ port number (3-time handshake)
* Client initiated Request
* Server response request, client read request
* The server shuts down the read port and the client closes the read port (connection off)
Transaction latency for HTTP (as can be seen from the interactive process)
* If the server host has not been accessed recently, the DNS resolution server IP may take 10+s time
*TCP Connection (3-time handshake) up to 2S, but with multiple connections arriving at the server, this delay will accumulate (serial processing)
* Server processing requests may take a certain amount of time
* Request send and response loopback will take a certain amount of time
Common TCP Latency
3 Handshakes of the *TCP connection (this process is not visible to HTTP programmers): A small HTTP transaction takes too much time to establish a connection
*tcp slow start with congestion control
Nagle algorithm for *TCP data aggregation
* TCP delay Acknowledgement algorithm for piggyback acknowledgement
*time_wait Latency and Client port exhaustion
3-time handshake delay for TCP connections
* Client requests can be placed in the 3rd client answer ("piggyback")
TCP Delay Acknowledgement algorithm
* Each TCP segment has a sequence number and checksum, and the receiver responds to a small acknowledgment packet when the recipient receives a good segment, and if the sender does not receive a confirmation message within the specified window time, the Sender task group has been corrupted and the data is resent. Because the acknowledgment groupings are small, TCP allows the packets to be "piggyback" in groupings of data destined for the same direction. Many TCP stacks implement a "delay acknowledgement" algorithm in order to find the same data grouping. The algorithm places the output acknowledgment in a buffer within a specific window time 100~200ms, waiting for the data group to be able to piggyback it, and if not, the acknowledgement is sent out in a separate packet.
*http request-response mechanism with Shuangfeng feature reduces the possibility of piggyback information. Therefore, the delay algorithm in the HTTP will cause a large delay, according to the different operating system can suppress the delay confirmation algorithm
TCP slow Start (flow control)
The performance of *TCP transfer data also depends on the lifetime of the TCP connection. The maximum speed at which data is sent is initially limited, and if the data is successfully transmitted (acknowledgement received) at a time, the maximum speed at which data is allowed to be sent is increased, which is known as a slow boot of TCP. The congestion window is slowly opened. Because of this congestion control already tuned connections are faster than the new connection transfer speed.
Nagle algorithm (congestion control)
The *nagle algorithm attempts to bind a large number of TCP data destined to the same destination in a single packet to improve network efficiency and reduce the number of packets in the network. The Nagle algorithm tries to send the data cache in full size (Ethernet 1500 bytes)
The *nagle algorithm causes HTTP Performance problems: Small HTTP packets cannot fill a TCP packet and are cached, resulting in a large delay. At the same time the Nagle algorithm sends a packet to wait for the confirmation message to send the next packet, or resend the last packet, and the delay confirmation algorithm will cache the acknowledgement information 100~200ms at the destination, which will lead to a large delay between the Nagle algorithm sending 2 packets.
The *tcp_nodelay parameter prevents the Nagle algorithm from increasing the packet rate, but ensures that TCP writes chunks of data so that it does not produce a bunch of small groupings
Port exhaustion (time_wait caused by port exhaustion)
*time_wait Port exhaustion is a serious problem that affects the performance fundamentals. Cause: When a TCP endpoint shuts down a TCP connection, a control block is maintained in the NIC memory to record the IP address and port number of the recently closed connection. The Time_wait maintenance time is twice times the maximum segment lifetime (the longest lifetime of TCP packets in the network), called 2MSL, and is typically above 2min to ensure that new connections with the same IP and port number are not created during this period ( Otherwise, data that has that IP and port on the network may reach the data cache of the new TCP connection)
Each time the *TCP client connects to the server, it obtains a new source port for the connection to be unique. Source port is limited, if 60,000, but because the connection in 2MSL is not reusable, the number of times per second can be connected 6000/120=500/sec, which ensures that port exhaustion is not encountered
Handling of HTTP connections
*connection:close requires the other party to close the connection when the next message is sent
* Serial Transaction latency: If the serial processing transaction establishes its own connection for each transaction, then the TCP connection delay and slow start delay are superimposed. Workaround: Parallel connections, send requests simultaneously through multiple connections, persistent connections, reuse of TCP connections, and pipelined connections that initiate concurrent HTTP requests over a shared TCP connection.
* Parallel Connection: Overlapping the connection time and transaction processing may speed up the loading of the page. However, you will also receive network bandwidth restrictions.
* There are 2 types of persistent connections: The Keep-alive and persistent:connection:keep-alive client requests to keep the connection open, if the server's return does not have that header field, Then the client thinks the server does not support keep-alive.persistent persistent connection is similar to keep-alive
* pipelined Connection: A pipelined connection is a pipelined request on a persistent connection that allows the server to cache the client's messages. Persistent connections eliminate TCP connection delays, and pipelined requests eliminate (overlap) transmission delays. Pipelined requests require responses to be echoed in the order in which they are requested. When a server error occurs, the client does not know which request was sent and which was not executed. Therefore, some request methods (non-idempotent requests: Multiple calls to the method will accumulate the resulting results) can not be sent in a pipelined connection,
HTTP Close Connection
* Both server and client can close a TCP connection at any time
* Each TCP response should have a precise content-length header, otherwise it can only rely on the server shutdown connection to describe the true end of the data
*TCP Shutdown and Reset error: It is dangerous to close the read port of a connection, and when B writes data to a connection where a read port has been closed, a will send a TCP "connection to end multiplicity" message to B, and the TCP stack on the B side will clear its input cache and output cache when the B-side reads the cache. Will get "connection is end multiplicity wrong"
* Safe shutdown: A party that does not send data should first close its output, and when the other side also shuts down its output, the connection is completely shut down by the TCP stack, so the reset error is not sent.
Features of the HTTP server
* Establish connection-process new connection, anti-DNS client hostname identification, ident determine client user
* Receive request-parse the request line, read the header of the packet, detect the empty line at the end of the CRLF end, read the Content-length head identification length of the request body.
Multiple servers: single-threaded, multi-process multi-threading, multiplexed io, multiplexed IO multi-threaded server
* Processing Requests
* Access resources-access to resources under Docroot, access rights control
* Construction Response-
* Send response
* Record Transaction process
4.HTTP Proxy
Agent run: The agent is both a client and a server. The client sends the request message to the proxy, and the agent must handle the request and the connection correctly to the server, and then return the response, while the agent itself makes a request to the server that behaves like a client to make a request and receive a response If the agent wants to create its own HTTP proxy, it will follow the rules set by the HTTP client and server.
Agent function
Child filters, document access control, security firewalls, web caches, reverse proxies, content routers (directing requests to specific servers based on network traffic or content type), anonymous (proactively removing client identity attributes such as IP, from, REFERER, URI session ID, cookie, etc. header )
The difference between a proxy and a gateway
The agent connects 2 or more applications that use the same protocol, and the gateway connects 2 or more endpoints that use different protocols, and acts as a function of protocol conversions.
Deployment of proxy servers in the network
* Export deployment: Deployed in the network to connect to the Internet's export point, control the network and the external network traffic, provide firewall protection.
* Access (ingress) agents: often placed on ISP access points, processing aggregate requests from customers.
* Reverse proxy: Typically deployed at the edge of the network, risking the server's name and IP so that all requests are sent to the agent
* Network switching Agent: Deploy at Internet peering points, cache to mitigate congestion on the internet, and monitor traffic
Agent configuration for clients
Can be configured manually or by proxy autoconfiguration (PAC) via the browser

5.HTTP Cache
Caching benefits
* Reduced redundant data transfer (by requesting data from the cache)
* Alleviate the network bandwidth bottleneck problem (customer requests through several different network speeds to reach the server, network bandwidth will be determined by the minimum bandwidth network, so the cache can improve the network bandwidth to a certain extent)
* Reduced requirements for the original server
* Reduced distance delay: The effect is better for the denial of instantaneous congestion (the server receives a large number of requests at the same time period) and the distance delay.
Cache Hits vs. misses
The use of replicas that are already in the cache serves the request to reach the cache, called a hit, and if it is forwarded to the server without a copy of the request, this becomes a cache miss.
Re-verify
In order to ensure that the replica is not in phase, freshness detection is required, which is called the re-authentication revalidation of HTTP. Using the If-modified-since header can be re-verified, if the server is not modified will send a 304 not Modified response, otherwise send a OK response. If the resource has been deleted on the server, the server sends a 404 Not Found response. If-none-match head the verification of the unique label of the entity is also a good verification method. Re-validation generally uses the IF condition header
Caching of work processes
* Receive-Reads the incoming request message from the network
* Parse-cache fetch URL and various headers
* Query-see if there is a local copy, and if not, get a copy and keep it locally
* Freshness Check-see if the cached copy is fresh enough, and if not, ask the server for any updates (re-authentication)
* Create response-Create a response message using the new header and the cached principal
* Send response-Sent to client
* Log-Create and record a log
The ability of the server to control caching
*cache-control:no-store will be sent to the client as appropriate, but it does not cache the copy itself.
The *cache-control:no-cache will be sent to the client, caching a copy itself, but cannot use the copy until it is verified.
*cache-control:max-age the number of seconds the document is in freshness from the time the server has transmitted the document.
*cache-control:expires Freshness Effective Absolute date
*cache-control:must-revalidate forced freshness Detection
The ability of the client to control caching
*cache-control:min-fresh=<s> requires the cache to keep the document fresh for at least the next s seconds
*cache-control:no-cache requires the cache to re-authenticate the resource
*cache-control:no-store requires that the cache cannot keep replicas
The *cache-control:only-if-cached cache is sent with a copy, no, forget it.

6. Integration points: gateways, tunnels, and trunks
The web is a powerful content publishing tool. A Web browser is an application of an HTTP protocol that transmits HTML markup text through the HTTP protocol, HTML tags
The text can tell the browser what to display and how to display the content. Of course, in addition to Web pages, HTTP can also transmit other content.
HTML pages can be made using FrontPage software, with the suffix. htm or. html.
HTML: Hypertext Markup Language designed to display data. XML: Extensible Markup Language, which is not predefined, is a supplement to HTML, designed to transmit data instead of displaying data.

Gateway: Implements the HTTP protocol to communicate with other protocols or applications to access resources other than HTTP. There are many types of gateways, mainly divided into 2 categories: Protocol Gateway (HTTP/FTP,HTTPS/HTTP) and resource Gateway (application Server gateway, database query Gateway)
HTTP traffic is directed to the gateway in the same way that traffic is directed to the proxy, the most common being the configuration browser shown, indicating the gateway address used, and transparently intercepting traffic.
The most common resource gateway is the application Server (gateway). The process is: the client connects to the application server over HTTP, but the application server sends the request through a Gateway application programming interface API to the application running on the server. (Application server and gateway on the same host)
The first popular Application Gateway API is the Universal Gateway Interface CGI (Common Gateway Interface), which is a standard set of interfaces for passing parameters and launching applications between URLs and applications. Applications written in any language can be called by CGI.
An HTTP request sent to the gateway:
GET Ftp://ftp.irs.gov/pub/00-index.txt http/1.0
HOST:ftp.irs.gov
User-agent:superbroser 4.2 #浏览器型号

Tunneling: The role allows users to send non-HTTP traffic over an HTTP connection so that other protocol data can be passed over HTTP, which can pass through firewalls that allow only web traffic.
Tunnels require a tunnel gateway that forwards non-HTTP traffic data through a tunnel gateway, and the tunneling process between the client and the tunnel gateway uses the Connect method. The process is as follows: The client sends a connect request to the tunnel gateway,
The tunnel gateway establishes a TCP connection to a server, and the tunnel gateway returns the client response so that the client-to-tunnel HTTP tunnel is established, and all data sent by the client through the HTTP tunnel is forwarded directly to the tunnel gateway
TCP connection, all data sent by a server is forwarded to the client through an HTTP tunnel.

Web bots: Like so browsers, web bots also belong to HTTP clients, but generally run on high-speed computers. The HTTP specification is subject to compliance. Web bots need to consider more questions: fundamentals, loop avoidance, etc.

7.HTTP Client Identification Method:
1). HTTP header that hosts user identity information: from; User-agent; REFERER; AUTHORIZATION; Client-ip; Cookies, etc.
2). Client IP Address tracking: The client IP can be obtained by using a socket connection, but IP does not necessarily correspond to a user
3). User login; Once the user authentication is successful, the browser will send the authorization header user authentication information every time it accesses.
4). Fat url-Embedding identification information: When a user first accesses a network site (without an ID) the server generates a unique ID for it after the URL is added, and the server directs the identified client to all the fat URLs of the site, and the server receives the FAT URL and goes back to find the user information for the Fat URL.
5). Cookie: Used to identify the user information, in the form of a record by the browser is saved in the cookie database. Can be divided into 2 categories: Session cookies (deleted when the user exits the browser) and persistent cookies (saved as files on the hard disk). The server sends the Set-cookie header to require the client to generate a cookie record information.
There is a privacy issue with cookies, and setting up a browser can prohibit the use of cookies.
Cookie generation Process:
Client request,
Get/index.html http/1.0
Host:www.joes-hardware.com
Server response: Requires that a cookie be generated for use with client-side
http/1.0 OK
Set-cookie:id= "34294";D omain= "yaho.com"
Content-type:text/html
content-length:1903
...
Client resend request with client cookie flag,
Get/index.html http/1.0
Host:www.joes-hardware.com
Cookie:id= "34294"

This article from "Tech record" blog, declined reprint!

HTTP protocol-http authoritative guide

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More