Basic working principle of TCP/IP protocol stack

Source: Internet
Author: User


TCP/IP is the core protocol of the Internet and also the core protocol of most network applications. Here is a simple summary of the TCP/IP questions I have asked during the previous interview. TCP is defined by RFC793, RFC1122, RFC1323, RFC2001, RFC2018, and RFC2581. (1) TCP overview a. TCP provides connection-oriented full-duplex services. All TCP data is matched to a TCP connection consisting of the source address, Destination Address, source port, and destination port. TCP connection is a kind of resource to be established. It can be completed through the handshake mechanism described later. UDP is a protocol based on the best effort mechanism. There is no establishment of UDP connection resources, and resource processing is often done by the application layer protocol. B. TCP is a reliable service. TCP has a validation mechanism to ensure the reliable arrival of data packets. TCP has a CRC verification mechanism to ensure data packet error-free. udp crc is optional, TCP will re-sort the unordered data packets and discard duplicate data. TCP can provide a traffic control mechanism, using the Sliding Window Algorithm, TCP can provide a congestion control and recovery mechanism, there are multiple TCP congestion control models. TCP can negotiate the length of data packets sent. TCP header. 0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 4 5 6 7 8 9 0 1
+- +-+
| Source Port | Destination Port |
+- +-+
| Sequence Number |
+- +-+
| Acknowledgment Number |
+- +-+
| Data | U | A | P | R | S | F |
| Offset | Reserved | R | C | S | Y | I | Window |
| G | K | H | T | N |
+- +-+
| Checksum | Urgent Pointer |
+- +-+
| Options | Padding |
+- +-+
| Data |
+- +-+ TCP Header Format indicates the TCP Header flag, the SYN mark is set only when three or four handshakes are performed. The ACK mark is set to all TCP packets after the handshake. Of course, there are some special cases. For example, in some cases, RST packets do not set ACK. These rules may be useful in configuring complex ACLs. (2) state machine of the TCP protocol stack (from RFC793) a. Establishment of the TCP connection. The establishment of a TCP connection can be active, passive, or simultaneously opened. The three-way handshake is clear. It is important to emphasize ISN, which is the problem of selecting the initial serial number. The serial number is 32 bits. For different operating systems, the initial serial number is usually regular. The maximum length of TCP packets is negotiated in three handshakes. Specifically, it is negotiated only in the SYN message. MSS = MTU-ip_header_len-tcp_header_len. MSS is also used to prevent fragment and improve network bandwidth utilization. In TCP three-way handshake, there is no additional validation mechanism for the last packet ACK. If this ACK is discarded in the network, the TCP protocol stack also has other mechanisms to process it. In addition to the three-way handshake, there is also a special application scenario, that is, the two ends of TCP open at the same time (send syn), this situation is not described in the above state machine.
 
For example, when A initiates A connection to port 7777 of port B through source port 8888, B also initiates A TCP connection to port 8888 of the target port of A through source port 7777. B. the closing of a TCP connection also involves active closing, passive closing, and simultaneous closing, which are described in the preceding TCP state machine. The closing of TCP connection requires four interactions of packets. Because TCP is a full-duplex service, the TCP connection is completely removed after the connection in each direction is closed.
In the state machine, active shutdown and simultaneous shutdown will eventually enter a TIME_WAITE state. The last packet for active TCP shutdown should be ACK, And the FIN packet of the peer is confirmed. The concept of this status is that the resources of the TCP connection are not completely released, because it is necessary to ensure that the last ACK packet can reach the peer without error, and confirm the FIN of the peer, otherwise, ACK will be re-transmitted. This waiting process (or the process where resources are not completely released) needs to wait for 2MSL time (one round-trip packet is considered ). MSL is the maximum message survival time. The RFC793 value is 2 minutes. It is generally 30 s or 1 minute based on different TCP implementations. Therefore, the port and connection resource used by the TCP connection cannot be used in the TIME_WAITE status. However, many TCP implementations do not have this restriction, as long as the ISN used by the new TCP connection is greater than the final serial number used by the TCP connection in the TIME_WAITE status. New ISN = latest ISN in time_waite + 128000
The maximum lifetime of an IP message is the TTL value, while that of a TCP packet is MSL. The maximum lifetime of a TCP packet is not defined on the second layer, and a storm may exist. (3) TCP Sliding Window and a. TCP Message validation mechanism. TCP uses a sliding window mechanism to send data streams. Therefore, TCP allows continuous transmission of multiple TCP groups without waiting for confirmation from the peer end. Therefore, the sent group data is not one-to-one with confirmation. In TCP, data validation is often delayed. Generally, two TCP data correspond to one confirmation, and the delay timer does not overflow. If the latency timer overflows, a confirmation message is sent. However, for TCP applications that interact with a large number of small packets, too frequent validation will lead to inefficient network utilization. Therefore, TCP supports a Nagle algorithm. B. the latency timer starts the latency timer when TCP receives the message, for example, 200 ms. C. only one unconfirmed small packet (41-byte TCP packet) exists on the Nagle algorithm TCP connection. TCP only collects small packets before the confirmation arrives. After the confirmation arrives, it is sent as a group. Of course, some applications need to disable the Nagle algorithm. D. Sliding Window Mechanism window collapse (left shift): After receiving the Peer Data, you confirm the correctness of the data. The data will be stored in the buffer and waiting for the application to obtain it. However, because the correctness of the data has been confirmed, you need to send a confirmation response ACK to the other party, and because the data has not been taken away by the application process, then you need to close the window, the left edge of the buffer window slides to the right. Note that the ACK Number of the response is the serial number of the packet sent by the other party. The ACK may be returned multiple times because of a meeting in the window. Window Opening (shift right): Once the application process extracts data from the buffer after the window shrinks, the TCP sliding window needs to expand. At this time, the right edge of the window expands to the right, in fact, the window is a circular buffer, and the expansion of the right edge of the window will use the buffer that was originally removed by the application process. After the window is expanded, you need to use the ACK notification peer. At this time, the ACK serial number is still the serial number of the packet received last time.
Window shrinking: the right edge of the window slides to the left, called Window shrinking. Host Requirement RFC strongly recommends that you do not do this, But TCP must be able to handle this situation at one end.
E. The retransmission timer is used to obtain the peer validation message. If the retransmission is still not confirmed, the reset message RST is sent.
Here, let's take a look at the TCP three-way handshake. A (initiator) ---> syn ---> B (server) A (initiator) <--- syn/ack <--- B (server) A (initiator) ---> ack? B (server)
If the last ACK of TCP client A is lost and TCP server B does not receive it, what will happen? At this time, A has entered the Establish status, but B is still in the Syn_Recev status, so the server will re-transmit the syn/ack packet, only to the final establishment of the connection. However, Client A has reached the established State, so A may send TCP data to server B. Therefore, the final state machine of TCP may be inconsistent.
The retransmission and congestion control mechanisms will be detailed later.
F. the timer has no ACK validation mechanism for TCP. Therefore, when the receiving end window is restored from 0 to a certain value, if the ACK packet (size of the identification window) sent by the receiving end is lost, the sender never knows the receiver's window restoration. Therefore, the sender periodically sends an ACK with a byte to the receiver to view the window information in the confirmation message of the receiver.
G. When the TCP connection end in the IDLE state crashes due to physical reasons, TCP has a retention mechanism to determine whether the peer end is still working. This design is controversial. Maybe the application layer should implement this function. In RFC1122, the active timer is disabled by default. Some RFC descriptions are provided below. Implementors MAY include "keep-alives" in their TCP implementations, although this practice is not universally accepted. if keep-alives are encoded, the application MUST be able to turn them on or off for each TCP connection, and they MUST default to off.
(4) TCP congestion control algorithms: slow start, congestion avoidance, fast retransmission, and fast recovery for congestion control, there are four main models: tcp tahoe, tcp reno, tcp newreno and tcp sack. The tcp tahoe model is one of the earliest TCP Protocols, which was proposed by Jacob bson.
Jacob observed that there are two reasons for the loss of TCP segments: packet Segment damage and network congestion. At that time, the network was mainly wired, it is not easy to cause packet segment damage. network blocking is the main cause of packet segment loss. In this case, tcp tahoe optimizes the performance of the original protocol. It features that, under normal circumstances, whether the retransmission timer times out and whether repeated confirmation information (dupack) is received) these two packet loss monitoring mechanisms are used to determine whether a packet loss occurs to enable the congestion control policy. In the case of congestion control, Slow Start is used) and the Congestion Avoidance algorithm to control the transmission rate. The TCP Reno version that appeared in 1990 added the "Fast Retransmit" and "Fast Recovery" algorithms, this avoids excessive reduction of the size of the sending window due to the adoption of the "slow start" algorithm when network congestion is not serious. Thus, TCP congestion control is mainly composed of the four core algorithms.
A. timeout and retransmission RTT calculation and RTO Calculation
B. slow Start and congestion avoidance algorithm the slow start algorithm aims to ensure that the rate at which the TCP sender sends packets to the group matches the rate at which the message is sent to the group, this design can be applied to WAN applications with low-speed links. To implement the slow start mechanism, a new window is added for TCP connections. The congestion window cwnd is initialized as a packet segment (not a byte, but the maximum size of the TCP transmission packet segment (MSS ).

In this way, the TCP connection has two windows, one is the receiving window for the receiver's traffic control, and the other is the congestion window for the sender's traffic control. The sender uses the small values in the two windows as the upper limit.
Slow Start Algorithm: exponential algorithm. The value of cwnd is 1 by default. When an ack is received, the value of cwnd is increased to 2. When two ack are received, the value of cwnd is increased to 4, next 8 ,...
The purpose of the congestion avoidance algorithm is to prevent packet timeout or packet loss caused by network congestion on the intermediate router. The congestion avoidance algorithm requires two variables: The cwnd window size and the ssthresh slow start threshold. For a given initial connection, cwnd is 1 and ssthresh is 65535.
When congestion occurs (timeout or repeated confirmation), ssthresh is set to cwnd and half of the small value in the receiving window. If it is caused by timeout, then, cwnd is set to 1.
Congestion Avoidance algorithm: If cwnd is greater than ssthresh, each time a data packet is received for confirmation, cwnd = cwnd + 1/cwnd, And the capacity unit of the cwnd window is still mss.
The congestion avoidance algorithm is actually used in combination with the slow start. Both cwnd and ssthresh are dynamic values, although the initial values are 1 and 65535. When real congestion occurs, ssthreash is set to half the size of the cwnd and receiving window if it is caused by timeout or duplicate ack, cwnd is reduced to 1, and then the slow start algorithm is executed, when cwnd is greater than ssthresh, the congestion avoidance algorithm is executed. During the slow start algorithm and the congestion avoidance algorithm, the transmission rate of TCP is increasing, but the exponential growth mode is used, one is linear growth.
C. There are two situations in the fast retransmission and fast recovery algorithms for TCP connections that may cause duplicate ack. One is out-of-order packets and the other is packet loss.
Fast retransmission: When the sender receives three duplicate ack messages, it does not start slowly, but immediately retransmits the lost packets. Because duplicate ack messages are sent only when the receiver receives a new packet segment, this indicates that data flow still exists on the TCP connection, so you should avoid using slow start to reduce the speed.
Fast Recovery: In the first step, when the third duplicate ack is received, ssthresh is set to half of the current cwnd and re-transmits the lost packet. Set cwnd to ssthresh plus 3 times the packet segment size (cwnd = cwnd/2 + 3 ). Step 2: After receiving a duplicate ack, cwnd adds 1 and sends a group. Step 3: when the next ack to confirm the arrival of new data, set cwnd to the ssthresh value in the first step. This ack should be the confirmation of retransmission packets, it also confirms the intermediate packet after the packet loss.
Finally, when three duplicate ack packets are received, the speed is halved.
The fast retransmission algorithm first appeared in the Tahoe version of 4.3BSD, and quickly restored the Reno version of 4.3BSD for the first time. It is also called the Reno TCP congestion control algorithm.
It can be seen that the Reno fast retransmission algorithm is applicable to the retransmission of a packet. However, in reality, a Retransmission timeout may cause the retransmission of many data packets, therefore, when multiple data packets are lost from one data window and the algorithm for fast retransmission and quick recovery is triggered, the problem arises. Therefore, NewReno appears. It is slightly modified based on the rapid recovery of Reno to restore the loss of multiple packages in a window. Specifically, when a new data ACK is received, Reno exits the quick recovery status, newReno needs to receive the confirmation of all the packets in the window before exiting the quick recovery status, thus further improving the throughput.
SACK is to change the TCP validation mechanism. The initial TCP only confirms the data received continuously, and the SACK will tell the other party all the information such as out-of-order information, thus reducing the blindness of Data sender retransmission. For example, if data from numbers 1, 2, 3, 5, and 7 is received, the normal ACK will only confirm the serial number 4, SACK will inform the Peer of the information received in the SACK option to improve performance. When SACK is used, the NewReno algorithm may not be used, because the information carried by the SACK itself enables the sender to have enough information to know which packets need to be re-transmitted, rather than which packets need to be re-transmitted.
(5) The TCP application chatted with the company's colleagues who set the firewall speed limit a few days ago. Our company's new firewall speed limit implementation solution uses the TCP window mechanism. in addition to classification, speed measurement, queue, and scheduling, QoS is based on the speed limit based on cache or packet loss, it is best to reduce the real transmission rate of TCP End-to-end, otherwise it will easily lead to a series of TCP congestion control actions. The new design of our software is to modify the size of the advertising window in the ACK direction to control the sending rate, which can reduce the transmission rate of the sender based on the speed limit.
 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.