Long-term confusing problem: real-time detection of TCP connection interrupts

Last Update:2018-07-26 Source: Internet

Author: User

Tags ack time interval

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

At present, TCP/IP has become the dominant technology of network. Through the analysis of TCP bottom implementation, a long puzzling problem in TCP/IP programming----Real-time detection of network connection interruption is analyzed, and the corresponding solution is put forward.
　　
0 Introduction
As the dominant technology in modern networks, TCP/IP programming looks very simple, but after initial efficiencies, it often stagnates in detail, often due to a lack of understanding of the low-level details of the TCP protocol.
TCP is a connection-oriented protocol, and UDP is a connectionless protocol, and many beginners find that there is no data flow through an idle TCP connection, and if neither side of the TCP connection sends data to the other, no information is exchanged between the two TCP modules. This means that a client can start a connection with the server and then leave for a few hours to a few weeks until the connection remains. Intermediate routers can crash and reboot, telephone lines can be hung up and reconnected, so long as the hosts on both ends are not restarted, the connection remains established.
As a result, programmers who first contacted the TCP/IP protocol group were puzzled: TCP did not have polling at the connection stage that could be found in other network protocols, and even found that TCP did not give the application a notification that the network connection was interrupted. Some programmers conclude that TCP does not apply to general application-application communication. Why does TCP not provide notifications?
1 principle Analysis
TCP is often referred to as a reliable protocol, that is, "TCP guarantees the transmission of data", which is often misleading: TCP does not go wrong. The fact is that as long as the two sides remain connected, TCP can guarantee the correct transmission of data, however, when the connection is interrupted, the problem arises because of 3:1 permanent or temporary network disturbances, 2 peer application crashes, 3 peer host crashes, and when these problems occur, the two applications cannot communicate with each other. And one of the apps is not immediately aware of it. An application that sends data to a peer may know that TCP is not aware of a connection interruption until it abandons the resend. If your application does not send data, you may never find that the network has been interrupted. For example, an application might be a server waiting for a peer to issue the next request, because the client cannot communicate with the server, the next request never arrives, or even the client's TCP discards and disconnects, causing the client to break, and the server is unaware of this.
Other communication protocols such as SNA and X.25 provide notification to the application when the connection is interrupted. For example, any protocol with a simple direct point-to-point proprietary link must use a polling protocol to continuously test whether the peer exists. Polling-select protocol may be sent explicitly "Do you have any data to send to me?" The form of messages like this, or they will use the form of a background static frame to detect whether the virtual circuit still exists. Each polling message consumes network resources that could have been used for the transmission of "payload" data.
The consumption of available network bandwidth is one reason TCP does not provide immediate notification of network outages. Because most applications do not require immediate notification, there is no need to provide this functionality at the expense of reducing bandwidth. Applications that need to know in a timely manner that peers are unreachable can implement their own mechanisms for discovering network outages, as described later.
One of the basic principles used in TCP/IP design is that the terminal parameter [saltzerelal.1984], when applied to the network, can be expressed that all intelligences should be as close to the connected terminal point as possible, and that the network itself should be relatively less intelligent. This is why TCP handles error control itself rather than relying on the network to provide it. When this principle is applied to monitoring the connection between Peer-to-peer applications, the application should provide the functionality it needs, not whether the application requires this functionality to be provided by the lower level.
The most important reason that TCP does not provide a timely connection to interrupt notifications is the ability to maintain communication while the network is suddenly interrupted. The result of a study by the U.S. Department of Defense, TCP, is that it requires a network protocol that can still maintain reliable communication between computers when a network outage is caused by a war or a natural disaster. Typically, network disturbances are temporary, and routers may also find another path to the connection. By allowing a temporary interruption of the connection, TCP has handled the disorder even before the terminal application realizes the interruption.
2 Solutions
2.1 Programme I: Using the tcpkeep-alive mechanism
People want to know if the connection has been interrupted, so many of the specific implementations of TCP provide a mechanism called keep-alive to detect dead connections, but they are not often used in applications. If the application enables the keep-alive mechanism, TCP sends a special segment to the peer after the connection has been idle for an interval of time. If the peer host is reachable and the peer application is still running, the peer TCP responds with an ACK reply. In this case, the TCP send keep-alive resets the idle time to zero, and the application does not receive any notification of the message exchange.
If the peer host can arrive but the peer application is not running, the peer TCP responds to the RST message, sends the TCP undo connection for the keep-alive message, and returns the Econnreset error to the application. This is usually the result of a peer host crash, since the peer TCP might have sent the fin message if it was only a peer application that was interrupted or crashed.
If the peer host does not respond to an ACK or RST message, TCP sending the keep-alive message continues to send keep-alive probing messages until it considers the peer unreachable or has crashed. It then revokes the connection and notifies the application of the Etimedout error, returning the Ehostunreach or Enerunreach error if the router has returned ICMP messages to the host or network.
With the keep-alive mechanism, TCP provides a protocol-level network interrupt notification feature, but this mechanism has many problems that are rarely used in applications.
First, keep-alive is not a part of the TCP specification, and it has long been controversial to offer keep-alive mechanisms in TCP, so keep-alive not all TCP implementations are available, and implementation details are different.
The second problem with applications that require immediate notification of network outages using the Keep-alive feature is related to the time interval. RFC1122[BRADEN1989] considers that if TCP implements Keep-alive, the live interval must be configurable, but its default value must be no less than two hours before it can send a keep-alive inquiry message. Because the peer's ACK message is not submitted reliably, it must repeatedly send an inquiry message before discarding the connection. 4.4BSD implementation sends 9 inquiry messages at 75-second intervals before undoing the connection.
This means that a specific implementation of the BSD derivation takes approximately 2 hours 11 minutes and 15 seconds to discover that the connection has been interrupted. This time value only makes sense when we know that keep-alive is used to release the resources that the dead connection occupies. For example, such a connection can occur when a client connects to a server and the client host crashes. Without the keep-alive mechanism, the server waits for the client's next request forever, because it never receives a fin message. This is becoming more common because PC-based users simply shut down the computer and modem rather than shut down the application correctly. ）
Because 2 hours of time is almost meaningless for real time detection, some implementations allow for a change of one or two time interval values, but because the retention interval is a system-level variable, the changes in these values affect all TCP connections on the system, This is the main reason that keep-alive is not actually used as a connection monitoring mechanism: The default time period is too long, and if the default value is changed, they lose the initial meaning of purging a long dead connection.
Another problem with keep-alive is that they not only detect dead connections, but also undo them. This may be what the application wants, but it may not be what the application wants.
2.2 Scenario Two: Using heartbeat detection
The second scheme is to detect the connection interruption in the application layer, and its basic idea is to send the probe to the peer in time, like Keep-alive, because it is implemented in the application layer, and it can be flexibly mastered by the application. In fact, the Boundary Gateway Protocol BGP detects the link or host failure of TCP connection by sending Keep-alive message to its neighboring station periodically, and the time interval between two messages is 30 seconds. This detection of connection interrupts implemented at the application layer is often referred to as "heartbeat detection."
The above algorithm principle, although discussed on TCP, is equally applicable to UDP. The greatest advantage of this approach is that it provides maximum flexibility.
2.3 Scenario Three: Using the tcp-keepalive socket option
The third scenario is to use the new posix1003.1g socket option tcp-keepalive, which allows the timeout interval to be specified on a per-connection basis, but it is not widely implemented, so there is not much use in the application. In the TCP implementation after Linuxkernel2.4, you can set the probe interval for the socket in this way:
#ifdefTCP_KEEPALIVE
intsecs=120;/*2minutes*/
SetSockOpt (s,ipproto_tcp,tcp_keepalive,&secs,sizeof (secs));
#endif
The disadvantage of this scenario is that, because the tcp-keepalive socket option is a newer POSIX feature, not all TCP implementations are supported, there is a portability problem, and because it is detected at the transport layer, flexibility is not as good as implementation at the application level.
3 Summary

To sum up, the detection of TCP connectivity interruption, the principle is through the timing of the peer to send the probe data to carry out detection, the difference between different schemes is to achieve at different levels, the application can be different according to the needs of the choice, the key is the principle of TCP connection to a thorough understanding.

Original address: http://www.guigu.org/news/guiguvip/201206117802.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More