how the TCP/IP protocol stack works 2010-11-14 21:05:42 Tags: tcp congestion control Slow start retransmission restore original works, allow reprint, please be sure to mark the article original source , author information and this statement in the form of hyperlinks. Otherwise, the legal liability will be investigated. http://jasonccie.blog.51cto.com/2143955/422966 TCP/IP is the core protocol of Internet and the core protocol of most network applications. Here is a brief summary of the TCP/IP issues that were asked in the previous interview. TCP is defined by RFC793, RFC1122, RFC1323, RFC2001, RFC2018, and RFC2581. (1) TCP Overview A. TCP provides a connection-oriented full-duplex service. All TCP data is matched to a TCP connection consisting of a source address, a destination address, a source port, and a destination port. A TCP connection is a resource that needs to be established and can be done by the handshake mechanism that is described later. UDP is a kind of protocol based on best effort mechanism, there is no establishment of UDP connection resources, and the processing of resources is often handled by Application layer protocol. B. TCP is the reliable service provided. TCP has a confirmation mechanism to ensure the reliable arrival of packets, TCP has a CRC check mechanism to ensure the error-free packet, UDP CRC is optional, TCP will reorder the chaotic packets and discard duplicate data, TCP can provide flow control mechanism, using sliding window algorithm, TCP can provide congestion control and recovery mechanism, there are many TCP congestion control models, TCP can negotiate the length of data packets sent. TCP header. 0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Port | Destination Port |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Sequence number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Acknowledgment number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Data | | u| a| p| r| s| F| |
| Offset| Reserved | r| c| s| s| y| i| Window |
| | | G| k| h| T| n| n| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Checksum | Urgent Pointer |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Options | Padding |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Data |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ TCP Header Format for the token bit of the TCP header, the SYN flag is set only when the three-time handshake (or four handshake) is placed, and the ACK token is set in all TCP messages after the handshake. There are, of course, some special cases, such as in some cases the RST message does not place an ACK. These rules may be useful in configuring complex ACLs. (2) TCP protocol stack state machine (from RFC793) A. TCP connection establishment. TCP connections are established with active open, passive open, and open three cases at the same time. Three times handshake is clear, to emphasize is isn, is the initial serial number selection problem, the serial number is 32 bits, for different OS, the initial sequence number selection is often also regular. The maximum message length for the TCP transmission is also negotiated in three handshake. Specifically, it is negotiated only in SYN messages. MSS = Mtu-ip_header_len-tcp_header_len. MSS is also here to prevent fragmentation and improve network bandwidth utilization. TCP Three-time handshake, the last message ack, there is no additional confirmation mechanism, if the ACK is discarded in the network, the TCP stack has other mechanisms to handle. In addition to the three handshake, there is a very special application situation, that is, TCP both ends simultaneously open (send SYN), this situation is not described in the state machine above. For example, a through the source port 7777 initiates to B's destination port 8888 connection, B also through the source port 8888 initiates a TCP connection to the destination port 7777. B. The shutdown of TCP connections also has active shutdown, passive shutdown, and simultaneous shutdown of three cases, which are described in three cases in the TCP state machine above. The shutdown of the TCP connection requires four interactions with the message because TCP is a full-duplex service, so the TCP connection is completely removed after the connection is closed in each direction. State machine, actively shut down and simultaneously close the lastWill go into a time_waite state. The last message for TCP active shutdown should be an ACK, confirming the fin message to the end. The concept of this state is that the resources for this TCP connection are not fully released, as it is also ensured that the last ACK message is correctly reached on the peer, confirming the fin at the end, or the ACK will still be re-transmitted. This waiting process (or the process of resource not being fully released) waits for 2MSL time (consider the packet-to-round). The MSL is the maximum message lifetime, 2 minutes in RFC793, based on different TCP implementations, typically 30s or 1 minutes. Therefore, in the Time_waite state, the port and connection resources used by the TCP connection cannot be used. However, many TCP implementations do not have this limitation, as long as the new TCP connection uses a value that is greater than the last sequence number used by the Time_waite state TCP connection. Implementations tend to use the new isn = latest isn in Time_waite + 128000 The maximum lifetime of the IP message is the TTL value, the maximum lifetime of the TCP message is MSL, there is no message maximum survival time on the second layer concept, there is a storm possible. (3) TCP's sliding window and timer a. TCP message acknowledgement mechanism. TCP uses a sliding window mechanism to send traffic, so the TCP protocol allows multiple TCP packets to be sent continuously without waiting for an acknowledgement from the peer. So the packet data sent and confirmed is not a one-to-one relationship. In TCP, the acknowledgment of data is often delayed, typically two TCP data corresponding to a confirmation, in the case of delay timer without overflow. If the delay timer overflows, it will naturally send a confirmation message. However, for TCP applications where there are a large number of small packets interacting, too frequent acknowledgement can lead to inefficient network utilization, so TCP supports a Nagle algorithm. B. Delay timer When TCP receives a message, it starts a delay timer, such as 200ms. C. Nagle algorithm there can only be one unconfirmed tiny message (41 bytes of TCP) on the TCP connection, and TCP collects only tiny packets before the acknowledgment arrives, and sends them in a packet when the acknowledgment arrives. Of course, some applications need to turn off the Nagle algorithm. D. Sliding window mechanism window closure (shift left): After receiving the peer data, it confirms the correctness of the data, which is stored in the buffer, waiting for the application to obtain. However, because the correctness of the data has been confirmed, it is necessary to send a confirmation response ack to the other party, and since the data has not been taken away by the application process, the window will need to be closed, the buffer window leftThe edges slide to the right. Note The ACK sequence of the response is the ordinal of the sender's sending packet, and the sequence number that the other party sends, possibly because the window is open to be answered (ACK) multiple times. window open (shift right): After the window shrinks, the application process once the data from the buffer, the TCP sliding window needs to be expanded, when the right edge of the window to expand to the right, the window is actually a ring buffer, the right edge of the window expansion will use the original application process to take the content of the buffer. After the window is expanded, an ACK notification pair is used, and the ordinal of the ACK is still the sequence number of the last acknowledgement received. window shrinks, the right edge of the window is sliding to the left, called Window shrinking, and Host Requirement RFC strongly recommends that you do not do this, but TCP must be able to handle this situation at one end. E. The retransmission timer is intended to obtain a confirmation message for the peer. If multiple retransmissions are still not confirmed, the reset message rst is sent. Here, let's take a look at the TCP handshake three times. A (initiator)---> SYN---> B (Server) A (initiator) <---syn/ack <---B (server) A (initiator)---> ack ?& Nbsp; b (server) If the last ACK of TCP client A is lost, TCP server B does not receive it, which is a condition. At this time a has entered the establish state, but B is still only SYN_RECEV state, so the server will retransmit Syn/ack messages, only to the final connection of the establishment. However, client A is already in the established state, so a is possible to send TCP data to Server B. So at both ends of TCP, the final state machine is likely to be inconsistent. The retransmission and congestion control mechanisms are described in detail later in . F. Adhere to the timer because TCP does not acknowledge the ACK mechanism, so when the receiving side window from 0 to a certain value, if the receiver sent to the sender of the ACK Message (identity window size) is lost, the sender will never know the receiver side of the window recovery. So the sending side will periodically send an ACK with a byte to the receiving end to view the window information in the acknowledgement message on the receiving side. G. keepalive timer for physical reasons, TCP has a keepalive mechanism to determine whether the peer is still working if the TCP connection at the idle state crashes at one end. This design is controversial, and perhaps the application layer should implement that functionality. As described in RFC1122, the keepalive timer is turned off by default. Some RFC descriptions are captured below. ImpLementors may include ' keep-alives ' in their TCP implementations, although this practice was not universally accepted.  ; If Keep-alives is included, the application must is able to turn them on or off for each TCP connection, and they must de Fault to OFF. (4) TCP Congestion control algorithm: Slow start, congestion avoidance, fast retransmission and fast recovery for congestion control, there are four models, namely TCP TAHOE,TCP reno,tcp Newreno and TCP SACK. The TCP Tahoe model is one of the earliest TCP protocols, which is proposed by Jacobson. Jacobson observed that there are two reasons for the loss of TCP packet segment (TCP Segment), one is the packet damage, the other is the network congestion, and then the network is mainly wired network, it is not easy to break the packet damage situation, network congestion is the main reason for the loss of the packet segment. In this case, TCP Tahoe Performance optimization of the original protocol, it is characterized by, under normal circumstances, through the retransmission timer whether the time-out and whether to receive duplicate confirmation information (dupack) The two kinds of packet loss monitoring mechanism to determine whether the packet loss, to start the congestion control strategy; In the case of congestion control, The slow start (Slow start) algorithm and the "congestion avoidance" (congestion avoidance) algorithm are used to control the transfer rate. The TCP Reno version, which appeared in 1990, increased the fast retransmit, fast Recovery algorithm, avoiding the "slow-start" algorithm when network congestion was not severe and causing excessive reduction of the size of the sending window. So TCP congestion control is mainly composed of these 4 core algorithms.
A. Calculation of timeout and retransmission RTT and calculation of RTO
B. Slow start and congestion avoidance algorithm the slow-start algorithm is designed to ensure that the rate at which TCP senders send packets should match the rate at which the packet acknowledgement message is received, such that the design can be applied to the WAN applications of low-speed links. In order to implement the slow-start mechanism, a new window is added for the TCP connection, the Congestion window is CWnd, the window is initialized to a message segment (not a byte, but a TCP Maximum transmission segment size, MSS). The TCP connection in such a direction has two windows, one is the receiving window for the receiver's traffic control, and the other is the congestion window used for the sender's traffic control. The sender is capped with a small value in these two windows. Slow start algorithm: Index algorithm, CWnd default is 1, when received an ACK acknowledgement, CWnd increased to 2, when received two ACK acknowledgement, CWnd increased to 4, then 8, ... Congestion avoidance algorithm for the purpose of is to prevent packet timeouts or drops from intermediate routers due to network congestion. Congestion avoidance algorithm requires two variables, one is the size of the CWnd window, one is the ssthresh slow start threshold, and for a given initial connection, CWnd is 1,ssthresh 65535. When congestion occurs (timeout or duplicate acknowledgment), when congestion occurs, Ssthresh is set to half of the small value of CWnd and the receive window, and if the timeout is caused by congestion, the CWnd is set to 1. Congestion avoidance algorithm: If CWnd is greater than Ssthresh, every acknowledgment of a data message is received, the Cwnd=cwnd+1/cwnd,cwnd window size unit is still MSS. Congestion avoidance algorithms are actually used in conjunction with slow start. CWnd and Ssthresh are dynamic values, although the initial values are 1 and 65535. When real congestion occurs, if it is a time-out or repeated ack-induced congestion, Ssthreash will be set to CWnd and half of the receive window size, CWnd will drop to 1, and then execute the slow-start algorithm, until the CWnd is greater than Ssthresh, the implementation of congestion avoidance algorithm; During the slow-start algorithm and the congestion avoidance algorithm, the TCP transmit rate is increasing, just one is the exponential growth mode, and the other is the linear growth mode. c . Fast retransmission and fast recovery algorithm there are two situations in TCP connection that can cause duplicate ack, one is disorderly sequence message, and the other is packet loss. Fast retransmission: When the sender receives three duplicate ACK, it does not enter the slow start state, but immediately re-transmits the lost message. Because only the receiver receives the new segment, it sends a duplicate ACK, which indicates that there is still data flow on the TCP connection, so you should avoid slow-start spin-down. &nBsp Fast recovery: The first step, when a third duplicate ACK is received, Ssthresh is set to half of the current CWnd, retransmission the lost message. Set CWnd to Ssthresh plus 3 times times the message segment size (CWND=CWND/2 + 3). In the second step, each receive a duplicate Ack,cwnd increment 1 and send a grouping. The third step, when the next confirmation of the new data Ack arrives, set CWnd as the Ssthresh value in the first step above, this ACK should be the confirmation of retransmission message, but also the intermediate message after the packet loss confirmation. Finally, in the case of receiving three duplicate ACK, the speed is halved. The fast retransmission algorithm first appeared in the Tahoe version of 4.3BSD, and quickly recovered the first Reno version of 4.3BSD, also known as the Reno version of the TCP congestion control algorithm. can see that Reno's fast retransmission algorithm is for a packet retransmission, but in practice, a retransmission timeout can lead to a lot of retransmission of packets, so when multiple packets are lost from a data window and trigger fast retransmission and fast recovery algorithm, the problem arises. As a result, Newreno appeared, and it was slightly modified on the basis of Reno fast recovery, which can recover multiple packets lost within a window. Specifically, Reno exits the fast recovery state when the ACK of a new data is received, and the Newreno needs to receive confirmation from all packets in the window before exiting the fast recovery state, thereby increasing throughput by one step. Sack is to change the TCP confirmation mechanism, the initial TCP only confirms the data that has been continuously received, sack the chaotic sequence and other information will be all told to each other, thus reducing the data sender re-transmission of blindness. For example, serial number 1,2,3,5,7 data received, then the ordinary ACK will only confirm the serial number 4, and sack will be the current 5,7 has received information in the SACK option to inform the peer, thereby improving performance, when using sack, Newreno algorithm can not be used, Because sack itself carries information that allows the sender to have enough information to know which packets need to be re-transmitted, without needing to retransmit which packets. (5) TCP Applications a few days ago and the company to do the firewall speed limit of colleagues chatting, our company's new firewall speed limit implementation of the scheme is used in the TCP window mechanism. As well known, QoS in addition to classification, speed, queue and scheduling a class of hardware-based algorithm, in the cache or packet loss on the basis of speed, it is best to reduce the TCP end-to-end true transmission rate, otherwise prone to TCP a series of congestion control action. The new design of our software is to control the rate of sending the hair by modifying the size of the notification window in the ACK direction toAt the same time, reducing the sender's sending speed.
This article is from the "Jasonccie" blog, so be sure to keep this source http://jasonccie.blog.51cto.com/2143955/422966