[Reprint] TCP those things 1

Last Update:2015-06-09 Source: Internet

Author: User

Tags rfc

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original: http://coolshell.cn/articles/11564.html

TCP is a complex protocol, because he has to solve a lot of problems, and these problems bring out a lot of sub-problems and dark side. So learning TCP itself is a relatively painful process, but the process of learning can make people have a lot of harvest. Regarding TCP this protocol details, I still recommend you to see W.richard Stevens's "TCP/IP Detailed Volume 1: Protocol" (of course, you can also read the RFC793 and the following N-more RFC). In addition, I will use English terminology in this article so that you can find the relevant technical documents through these English keywords.

The reason why I want to write this article, there are three purposes,

One is the ability to work out whether you can use a simple space to describe such a complex TCP protocol.
Another is that many programmers now basically do not seriously read the book, like fast food culture, so, hope this fast food article can let you know the classical technology of TCP, and can appreciate the difficulties in software design. And you can get some software design gains from it.
The most important hope is that these basics can make you understand a lot of things that were specious before, and you can be aware of the importance of the basics.

Therefore, this article will not be exhaustive, just on the TCP protocol, algorithms and principles of science.

I wanted to write an article, but TCP is more complicated than C + +, and over the past more than 30 years, various optimization variants have been debated and modified. So, writing and writing, I found that only two articles were cut.

In the previous chapter, we mainly introduce the definition of TCP protocol and the retransmission mechanism of packet loss.
In the next chapter, we focus on TCP's flow and congestion processing.

Needless to say, first of all, we need to know TCP in the network OSI seven layer model fourth layer--transport layer, IP in the third layer--network layer, ARP in the second Layer--data link layer, on the second layer of data, we call frame, The data on the third layer is called packet, and the fourth layer of data is called segment.

First of all, we need to know that the data of our program will first hit the TCP segment, then the TCP segment will hit the IP packet, and then hit the Ethernet Ethernet frame, to the end, each layer to resolve their own protocol, The data is then handed over to higher-level protocol processing.

TCP Header Format

Next, let's look at the format of the TCP header

TCP header Format (picture source)

You need to pay attention to these points:

TCP packets do not have an IP address, which is something on the IP layer. But there are source and destination ports.
A TCP connection requires four tuples to indicate that the same connection (SRC_IP, Src_port, Dst_ip, Dst_port) is exactly five tuples, and there is a protocol. But because here is just the TCP protocol, so here I only say four tuples.
Note the four very important things:
- Sequence number is the ordinal of the package used to solve the problem of network packet scrambling (reordering).
- Acknowledgement number is the ack--used to confirm receipt, to solve the problem of not dropping packets .
- window is also called Advertised-window, which is the famous sliding window (Sliding windows), which is used to solve the flow control .
- TCP Flag , which is the type of packet, is primarily used to manipulate TCP's state machine .

For other things, refer to the illustration below.

(Photo source)

State Machine for TCP

In fact, the transmission on the network is not connected, including TCP is the same . The so-called "connection" of TCP, in fact, is only in the communication between the two sides to maintain a "connection state", so that it seems to have the same connection. Therefore, the state transformation of TCP is very important.

Here are: "state Machine for TCP protocol " (picture source) and "TCP link ", "tcp broken link ", " data transfer " comparison chart, I put two graphs and emissions together, so it is convenient for you to see the comparison. In addition, the following two figures are very, very important, you must remember. (Spit a trough: see such a complex state machine, know this protocol is how complex, complex things always have a lot of things, so the TCP protocol actually kinda pit dad)

Many people will ask, why build link to 3 times handshake, broken link need 4 wave?

for a linked 3-time handshake, the primary is to initialize the initial value of sequence number. The two sides of the communication should inform each other of their own initialized Sequence number (abbreviated to isn:inital Sequence number)--so called SYN, full synchronize Sequence Numbers. Also in the X and Y. This number is to be used as the serial number of data communication in order to ensure that the application layer will not be ordered by the problem of transmission on the network (TCP will use this serial number to splice data).

for 4 waves, you actually look at it 2 times, because TCP is full-duplex, so both the sender and the receiver need fin and ack. Only one side is passive, so it looks like a 4-time wave. If both sides are disconnected at the same time, it will go into the closing state and then reach the TIME_WAIT state. Both sides are disconnected at the same time (you can also look at the TCP state machine):

Simultaneous disconnection at both ends (picture source)

In addition, there are a few things to note:

about the SYN timeout when the connection was built . Imagine, if the server side received a SYN Clien sent back to Syn-ack after the client dropped the line, the server did not receive the client back ACK, then, the connection is in an intermediate state, that is unsuccessful, and did not fail. Therefore, the server side if not received within a certain time TCP will be re-syn-ack. Under Linux, the default number of retries is 5 times, the retry interval from 1s start each time the sale, 5 times the retry interval of 1s, 2s, 4s, 8s, 16s, a total of 31s, the 5th time after the issuance of 32s all know that the 5th time has expired, so, the total need 1s + 2s + 4s+ 8s+ 16s + 32s = 2^6-1 = 63s,tcp will disconnect this connection.

about SYN flood attacks . Some malicious people have created a SYN flood attack-after sending a SYN to the server, it is offline, so the server needs to wait for 63s to disconnect, so that the attacker can drain the server's SYN connection queue, so that the normal connection request cannot be processed. So, Linux gives a parameter called Tcp_syncookies to deal with this-when the SYN queue is full, TCP creates a special sequence from the source address port, the destination address port, and the timestamp Number is sent back (also called a cookie), if the attacker does not respond, if the connection is normal, then the SYN cookie will be sent back, and then the server can be connected through a cookie (even if you are not in the SYN queue). Please note that please do not use tcp_syncookies to handle normal heavy-load connections . Because, Synccookies is a compromise version of the TCP protocol, not rigorous. For normal requests, you should adjust three TCP parameters to choose from, the first is: Tcp_synack_retries can use him to reduce the number of retries, the second is: Tcp_max_syn_backlog, you can increase the number of SYN connections; the third is: TCP_ Abort_on_overflow can not deal with it simply refused to connect directly.

about the initialization of isn . Isn is not hard code, or there will be problems-for example: If the connection is always used to do isn, if the client sent 30 segment past, but the network is broken, so the client re-connected, and used 1 to do isn, but the previous connected to the package, It is then considered a new connected package, at which point the sequence number of the client may be 3, and the server side thinks the client side is 30. It's all messed up. RFC793 said that the isn will be tied to a fake clock, which will add an operation to the isn every 4 microseconds, until it exceeds 2^32 and starts at 0. Thus, the cycle of a isn is about 4.55 hours. Because, we assume that our TCP Segment will survive on the network no more than maximum Segment Lifetime (abbreviated as Msl–wikipedia bar), so as long as the MSL value is less than 4.55 hours, then we will not reuse to isn.

about MSL and time_wait. Through the description of isn above, I believe you know how the MSL came. We note that in the TCP state diagram, from the TIME_WAIT state to the closed state, there is a timeout setting, the timeout setting is 2*MSL (RFC793 defines the MSL as 2 minutes and Linux is set to 30s) Why is there a time_wait? Why not just turn it into a closed state? There are two main reasons: 1) time_wait Ensure that there is enough time for the peer to receive an ACK, if the passive closed side does not receive an ACK, it will trigger the passive end multiplicity fin, a go to exactly 2 msl,2) have enough time to make this connection will not be mixed with the subsequent connection (you know, Some self-made routers cache IP packets, and if the connection is reused, then these delayed packets may be mixed with the new connection. You can read this article "time_wait and its design implications for protocols and scalable client server systems"

There are too many time_wait on the number . From the above description we can know that time_wait is a very important state, but if in the large concurrent short link, time_wait will be too much, which also consumes a lot of system resources. As long as you search, you will find that nine out of ten ways to handle it is to teach you to set two parameters, one called Tcp_tw_reuse, the other called the tcp_tw_recycle parameter, the default values of both parameters are closed, The latter recyle than the former resue more radical, resue to be gentle. In addition, if you use Tcp_tw_reuse, you must set Tcp_timestamps=1, otherwise it is not valid. Here, you have to be aware that opening these two parameters will have a larger pit--may cause TCP to connect some weird problems (because, as mentioned above, if you do not wait for the timeout to reuse the connection, the new connection may not be built.) As the official document says, "It should not being changed without advice/request of technical experts").

about Tcp_tw_reuse. The official document says Tcp_tw_reuse plus tcp_timestamps (also called Paws, for Protection against wrapped Sequence Numbers) can guarantee the security of the Protocol, but you need to tcp_ Timestamps is open on both sides (you can read the source code of Tcp_twsk_unique). I personally estimate that there are some scenarios that may be problematic.

about Tcp_tw_recycle. If the tcp_tw_recycle is opened, it assumes that the tcp_timestamps is turned on, then the timestamp is compared, and if the timestamp becomes larger, it can be reused. However, if the peer is a NAT network (such as: A company with only one IP out of the public network) or the IP of the end of the other is reused, this thing is complicated. The linked SYN may be dropped directly (you may see connection Time out Error) (if you want to observe Linux kernel code, see source Tcp_timewait_state_process).

about Tcp_max_tw_buckets. This is to control the number of concurrent time_wait, the default value is 180000, if overrun, then, the system will be more to destory off, and then a warning in the log (such as: Time wait bucket table overflow), The official website document says this parameter is used to combat DDoS attacks. Also said that the default value of 180000 is not small. This still needs to be considered according to the actual situation.

Again, using Tcp_tw_reuse and tcp_tw_recycle to solve the time_wait problem is very, very dangerous because these two parameters violate the TCP protocol (RFC 1122)

In fact, time_wait means that you are actively disconnected, so this is called "Do not Die". Imagine, if let the end of the connection, then this problem is the other side, hehe. Also, if your server is an HTTP server, how important it is to set up an HTTP keepalive (the browser reuses a TCP connection to handle multiple HTTP requests) and then let the client break the link (you have to be careful that the browser can be very greedy, They do not have to be the last resort to disconnect the active.

Sequence number in data transfer

I cut a wireshark from the middle of the coolshell.cn when I visited the data transfer diagram to show you how the Seqnum changed. (Use the statistics in the Wireshark menu->flow Graph ... ）

As you can see, theincrease in Seqnum is related to the number of bytes transferred . , after three handshakes, two len:1440 packets were made, and the seqnum of the second package became 1441. Then the first ACK back is 1441, indicating that the first 1440 received.

Note : If you look at 3 handshakes with the Wireshark grab bag program, you will find that SeqNum is always 0, not so, wireshark in order to show more friendly, using the relative seqnum--relative serial number, You can see "Absolute SeqNum" just by canceling the protocol preference in the right-click menu.

TCP retransmission mechanism

TCP ensures that all packets are reachable, so a retransmission mechanism must be required.

Note that the ACK acknowledgement from the receiving end will only confirm the last successive packet, for example, the sending end sends a total of five data, the receiving side received 1, 2, then the ACK 3, and then received 4 (note that at this time 3 did not receive), at this time the TCP will do? We need to know, because as mentioned above,Seqnum and Ack is in bytes, so when the ACK, can not jump to confirm, can only confirm the largest consecutive received packets , otherwise, the sending side thought before the received.

Timeout retransmission mechanism

One is not back ack, Death 3, when the sender finds that the ACK timeout of less than 3 will be re-transmitted 3. Once the receiving party receives 3, the ACK back to 4--means that both 3 and 4 are received.

However, this way there will be a serious problem, that is, because the death of 3, so it will cause 4 and 5 even if they have received, and the sender has no idea what happened, because there is no ACK, so the sender may be pessimistic to think also lost, so it may also cause 4 and 5 retransmission.

There are two options for this:

One is to retransmit only the timeout packet. This is the 3rd data.
The other is to re-transmit all the data after timeout, that is, the three data of section 3,4,5.

Both of these ways are good and bad. The first one will save bandwidth, but slow, the second will be faster, but it will waste bandwidth and may be useless. But overall, it's not good. Because it's all waiting for timeout,timeout to be very long (in the next chapter, how TCP calculates timeout dynamically)

Fast retransmission mechanism

As a result, TCP introduces an algorithm called Fast retransmit , which is not time-driven and data-driven retransmission . That is, if the packet does not arrive continuously, it will ack the last packet that might have been lost, and if the sender receives the same ACK 3 times in a row, it will be re-transmitted. The advantage of Fast retransmit is that you don't have to wait for the timeout to retransmit again.

For example: If the sender sent a 1,2,3,4,5 data, the first to send, and then ACK back 2, the result 2 for some reason did not receive, 3 arrived, so still ack back 2, the back of 4 and 5 are to, but still ack back 2, because 2 still did not receive, So the sender received three ack=2 confirmation, know that 2 has not arrived, so immediately re-turn 2. Then, the receiving side received 2, at this time because 3,4,5 all received, so ack back to 6. As follows:

Fast retransmit only solves one problem, that is, timeout problem, it still faces a difficult choice, is to re-turn the previous one or reload all the problems. For the above example, is it retransmission # or retransmission #, #3, #4, #5呢? Because the sender is not aware of the continuous 3 ACK (2) who sent it back? Perhaps the sending side sent 20 copies of the data, is the #10, #20传来的呢. In this way, it is very likely that the sending side will retransmit this heap of data from 2 to 20 (which is the actual implementation of some TCP). Visible, this is a double-edged sword.

SACK method

Another better way is called:Selective Acknowledgment (SACK)(see RFC 2018), which requires a SACK in the TCP header, an ACK or a fast retransmit ack, Sack is a broken version of the data received. See:

In this way, on the sender side can be based on the sack to know which data to, which is not. The algorithm for fast retransmit is optimized. Of course, this agreement needs to be supported on both sides. Under Linux, this feature can be turned on with the tcp_sack parameter (Linux 2.4 is turned on by default).

There is also a problem to be aware of-the receiver reneging, the so-called reneging means that the receiving Party has the right to have been reported to the sender of the sack in the data lost . Doing so is not encouraged because it complicates the problem, but the receiver may have some extreme situations, such as giving memory to something more important. Therefore, the sender can not rely entirely on sack, or to rely on ACK, and maintenance time-out, if the subsequent ACK does not grow, then still want to sack things retransmission, in addition, the receiver side can never be sack packet marked as ACK.

Note: Sack consumes the sender's resources, just imagine that if an attacker sends a bunch of sack options to the data sender, this can cause the sender to start retransmission or even traverse the data that has been emitted, which consumes a lot of resources on the sending side. See "TCP Sack Performance Tradeoffs" for more details.

Duplicate sack– repeated data receipt problem

Duplicate SACK, also known as D-sack, uses the SACK to tell the sender what data is being repeatedly received . Detailed descriptions and examples are available in the RFC-2833. Here are a few examples (from RFC-2833)

D-sack used the first segment of sack to make a mark,

If the range of the first segment of the SACK is overwritten by an ACK, then it is D-sack

If the range of the first segment of SACK is covered by the second segment of SACK, then it is D-sack

Example one: ACK packet loss

In the following example, two Ack is lost, so the sending end multiplicity transmits the first packet (3000-3499), so the receiver finds a duplicate, and then returns a sack=3000-3500 because the ACK is 4000 means that all data before 4000 is received. So this SACK is d-sack--is designed to tell the sender that I received duplicate data, and our sender also know that the packet is not lost, the missing is the ACK packet.

1234567 transmitted received ACK Sent segment segment (Including SACK Blocks) 3000-3499 3000-3499 3500 (ACK dropped) 3500-3999 3500-3999 4000 (ACK dropped) 3000-3499 3000-3499 4000, sack=3000-3500 ---------

Example two, network delay

In the following example, the network packet (1000-1499) was delayed by the network, causing the sender to not receive an ACK, and the subsequent arrival of the three packets triggered the "Fast retransmit algorithm", so retransmission, but re-transmission, the packet was delayed again, so, back to a sack= 1000-1500, because the ACK has reached 3000, so, this SACK is the d-sack--identity received the duplicate package.

In this case, the sender knows that the retransmission triggered by the "Fast retransmit algorithm" was not lost because the packet was sent, nor was it because the ACK packet of the response was lost, but because the network was delayed.

1234567891011 Transmitted Received ACK SentSegment Segment (Including SACK Blocks)500-999 500-999 10001000-1499 (delayed)1500-1999 1500-1999 1000, SACK=1500-20002000-2499 2000-2499 1000, SACK=1500-25002500-2999 2500-2999 1000, SACK=1500-30001000-1499 1000-1499 3000 1000-1499 3000, SACK=1000-1500 ---------

Visible, the introduction of D-sack, there are so several benefits:

1) can let the sender know, is sent out of the bag lost, or back the ACK packet lost.

2) is not your own timeout too small, resulting in retransmission.

3) on the network after the first sent packets to the situation (also known as reordering)

4) The network is not a copy of my data packet.

know that these things can be very good to help TCP understand the network situation, so that the network can better do the flow control .

The Tcp_dsack parameter under Linux is used to turn on this feature (Linux 2.4 is turned on by default)

Okay, here's the end of the story. If you think I write more easily, then, welcome to see the next article of the "TCP" (next)

[Reprint] TCP those things 1

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More