Analysis of TCP connection disconnection

Source: Internet
Author: User
Tags connection reset
Analysis of TCP connection disconnection

Document options

Print this page

Zheng Yong(Zhengyzy@cn.ibm.com), Software engineer, IBM
Feng Rui(Fengrui@cn.ibm.com), Software engineer, IBM

August 21, 2008

TCP connection persistence does not require any additional operations, but in actual applications, it takes a long time to maintain a TCP connection to be affected by many factors. This article describes several common causes of TCP connection disconnection. Based on this, taking the abnormal disconnection of TCP connections on the AIX system as an example, using the corresponding network analysis tools, gradually uncover the cause of TCP disconnection on AIX and provide two feasible solutions.

Introduction

In official documents, TCP/IP protocol clusters are also known as Internet Protocol clusters. TCP/IP protocol cluster is currently the most widely used Global Internet technology. Its layered structure 1 is shown in:

Figure 1. Hierarchy of TCP/IP protocol Clusters

As shown in 1, the data link layer is mainly responsible for processing the transmission media and many other physical interface details; the network layer is responsible for processing the activities of data groups in the network, including the upper layer data packet segmentation, routing phost2008-08-21T00: 00: 00; the transport layer is responsible for providing end-to-end communication for the two hosts; the application layer is responsible for handling specific application details. The IP protocol is the core protocol at the network layer to provide unreliable and connectionless data transmission services, while the TCP protocol is at the transport layer, based on the unreliable and connectionless IP protocol, it can provide connection-oriented and reliable communication for the two hosts. UDP?

Because TCP is a connection-oriented protocol, a connection must be established before two hosts communicate. Next we will briefly introduce the establishment of the TCP connection and how the communication parties maintain the established TCP connection.

Establish and maintain TCP connections

The establishment of a TCP connection must be completed through the famous "three-way handshake. The following example shows how to establish a TCP connection.

In the following description, the client host is testclient.cn.ibm.com (Linux), and the server host is testserver.cn.ibm.com (Aix ). Run the tcpdump-I eth0 host testserver command on a terminal of the testclient host to start tcpdump to listen to network data (eth0 is the NIC used for communication between the client host and the external network ); at the same time, execute the following command on another terminal of the client host: (root @ testclient/)> Telnet testserver. The output of tcpdump on the client host is shown in List 1.

Listing 1. Create a three-way handshake for a TCP connection

# tcpdump –S -i en0 host testServer1 14:02:38.384918 IP testClient.cn.ibm.com.43370 >  testServer.cn.ibm.com.telnet: S 3392458353:3392458353(0) …2 14:02:38.629578 IP testServer.cn.ibm.com.telnet >  testClient.cn.ibm.com.43370: S 881279296:881279296(0) ack 3392458354 …3 14:02:38.629592 IP testClient.cn.ibm.com.43370 >  testServer.cn.ibm.com.telnet: . ack 881279297 …

Note: We deleted some irrelevant information in the output result of tcpdump. For ease of understanding, we convert the above output to the actual sequence diagram 2.

Figure 2. TCP creates the actual sequence of three-way handshakes

As shown in figure 2, the following three handshakes are performed to establish a connection between testclient and testserver:

  • Testclient actively sends a handshake protocol to testserver. The message serial number is 3392458353 and the size is 1 byte.
  • Testserver sends a handshake protocol to testclient. The serial number of the packet is 881279296 and the size is 1 byte. At the same time, Ack 3392458354 is returned as a response to the 3392458354 packet sent by testclient.
  • Testclient returns ack 881279297 to testserver as a response to the 881279296 packet sent by testserver.

A TCP connection is established after the preceding three handshakes are completed. After that, information can be transmitted between the two ends of the connection. Therefore, a TCP connection can be considered as a communication channel identified by the IP address and port at both ends. The establishment of a TCP connection is the process of registering the above communication channel with both parties. Once a TCP connection is established, as long as the intermediate nodes (including gateways, switches, routers, and other network devices) between the two parties work normally, before either party actively closes the connection, all TCP connections will be retained.

This feature enables a idle connection that does not exchange any information for a long time to be maintained for several hours, days, or even months. The intermediate router can crash and restart, and the network cable can be hung up and connected. TCP connections can be kept as long as the hosts at both ends are not restarted.



Back to Top

Causes of TCP connection disconnection

Ideally, a TCP connection can be maintained for a long time. However, in actual applications, a seemingly normal TCP connection maintained on the client or server may have been disconnected. The TCP connection is interrupted due to two major impacts: What are the two-party nodes in the communication between the intermediate network node and the client/server node?

In actual network applications, communication between two hosts often needs to traverse multiple intermediate nodes, such as routers, gateways, and firewalls. Therefore, the TCP connection between two hosts is also affected by the intermediate node, especially the firewall (software or hardware firewall. A firewall is a device that has multiple implementation methods (software implementation, hardware implementation, or software and hardware combination). It needs to scan inbound and outbound information flows according to a series of rules, it also allows secure (compliant with rules) information interaction and prevents insecure (violating rules) information interaction. The working characteristics of the firewall determine that it takes a lot of resources to maintain a network connection, and the enterprise firewall is often located at the entrance and exit of the enterprise network, long maintenance of inactive TCP connections will inevitably lead to a decline in network performance. Therefore, most firewalls disable connections that are not active for a long time by default, resulting in TCP connection disconnection. Similarly, if an intermediate node is abnormal and the request from the client to close the connection cannot be passed to the server, the connection on the server will also be disconnected.

On the other hand, it takes some system resources to create a TCP connection for a host at both ends of a TCP connection. If a connection is no longer used, we always hope that the two hosts that communicate can actively close the connection to release the system resources occupied. However, if the connection fails to be properly closed due to exceptions on the client (for example, crashes or restarts abnormally), the connection on the server may be disconnected.

Whether it is a client node or a server node, the disconnected TCP connection cannot transmit any information. Therefore, maintaining a large number of disconnected TCP connections will result in a waste of system resources. This waste of system resources may not cause too many problems to client nodes. However, for server hosts, this may lead to system resources (especially memory resources and socket resources) the service is denied for new user requests. Therefore, in practical applications, the server needs to take appropriate methods to detect whether the TCP connection has been disconnected.



Back to Top

Three common methods for detecting TCP connection disconnection

The principle of detecting whether a TCP connection is disconnected or working normally is relatively simple: regularly send a certain format of information to the connected Remote Communication Node and wait for feedback from the remote communication node, if you receive the correct feedback from the remote node within the specified time, the connection is normal. Otherwise, the connection has been disconnected. Based on this principle, there are currently three common detection methods.

Application self-detection

The application itself comes with the function of detecting its own TCP connections. This method has great flexibility and can select the corresponding detection mechanism and function implementation based on the characteristics of the application. However, in practical applications, most applications do not have the function of self-detection.

Third-party application Detection

This method is to install a third-party application on the service node to detect whether all TCP connections on the node are normal or have been disconnected. The biggest disadvantage of this method is that all clients that support detection can identify data packets from the application. Therefore, this method is rare in practical applications.

TCP protocol layer live Detection

The most common test method is to use the active detection function provided by the TCP protocol layer, that is, the active timer of the TCP connection. Although this function is not part of the RFC specification, it is widely used in almost all UNIX-like systems.

In the following section, we will focus on the live-preserving detection methods from the TCP protocol layer.



Back to Top

TCP connection life-saving timer on Unix-like systems

The retention timer of TCP connections can be implemented at the application layer or in TCP. This issue is controversial, so the TCP connection retention test is not part of the TCP specification. For convenience, almost all UNIX-like systems provide functions in TCP.

List 2. Active timer on Common Unix systems

Operating System Active Timer
AIX # No-A | grep keep
Tcp_keepcnt = 8
Tcp_keepidle= 14400
Tcp_keepintvl = 150
Linux # Sysctl-A | grep keep
Net. ipv4.tcp _ keepalive_intvl = 75
Net. ipv4.tcp _ keepalive_probes = 9
Net. ipv4.tcp _ keepalive_time = 7200
FreeBSD # Sysctl-A | grep net. inet. TCP
Net. inet. tcp. keepidle =...
Net. inet. tcp. keepintvl =...

The time units of parameters on different systems are different. On Aix, the time unit of tcp_keeidle/tcp_keepinit/tcp_keepintvl is 0.5 seconds. on Linux, the time unit of net. ipv4.tcp _ keepalive_intvl and net. ipv4.tcp _ keepalive_time is second. In addition, the above parameters are only valid for connection to the server application running on it.

Note: On Solaris, Run "NDD/dev/tcp /?" Command to display the above similar parameter information, and on hp unix, you can use the nettune or NDD command to query.

Because all UNIX-like systems support this function, we will introduce the meaning and mechanism of the above parameters in the following sections based on the AIX system.



Back to Top

Mechanism and principle of TCP connection active-active detection in Aix

As listed in Listing 2, the active-active probe mechanism on AIX is controlled by four parameters. For more information, see listing 3:

Listing 3. Active Timer control parameters on AIX

Control Parameters Parameter description
Tcp_keepcnt The maximum number of times a test is performed before an inactive connection is disabled. The default value is 8.
Tcp_keepidle Maximum inactive time interval before a connection is tested for validity. The default value is 14400 (2 hours)
Tcp_keepintvl Interval between two probes. The default value is 150, that is, 75 seconds.

Let's look at a specific example. Use the tcp_keepidel = 240 (2 minutes): tcp_keepcnt = 8: tcp_keepintvl = 150 (75 seconds) parameter value on the testserver (Aix host; start tcpdump on testserver to view the interaction of network packets; initiate a request from testclient to establish a telnet connection with testserver. After the connection is established, unplug the network cable of testclient and observe the data output of the server (see figure 4 ).

Listing 4. tcpdump output of Telnet connection on the server

1 # tcpdump -i en1 host testServer.cn.ibm.com2 04:51:51.379716 IP testClient.cn.ibm.com.telnet.40621 >  testServer.cn.ibm.com.telnet: S 4097149880:4097149880(0)3 04:51:51.379755 IP testServer.cn.ibm.com.telnet >  testClient.cn.ibm.com.40621: S 2543529892:2543529892(0) ack 40971498814 04:51:51.380609 IP testClient.cn.ibm.com.telnet.40621 >  testServer.cn.ibm.com.telnet: . ack 1 5 ...6 04:51:54.924058 IP testServer.cn.ibm.com.telnet >  testClient.cn.ibm.com.40621: P 676:696(20) ack 87 7 04:51:54.924909 IP testClient.cn.ibm.com.telnet.40621 >  testServer.cn.ibm.com.telnet: . ack 696 8 04:53:54.550192 IP testServer.cn.ibm.com.telnet >  testClient.cn.ibm.com.40621: . 695:696(1) ack 869 04:55:09.550997 IP testServer.cn.ibm.com.telnet >  testClient.cn.ibm.com.40621: . 695:696(1) ack 8610 04:56:24.552053 IP testServer.cn.ibm.com.telnet >  testClient.cn.ibm.com.40621: . 695:696(1) ack 8611 04:57:39.552615 IP testServer.cn.ibm.com.telnet >  testClient.cn.ibm.com.40621: . 695:696(1) ack 8612 04:58:54.553446 IP testServer.cn.ibm.com.telnet >  testClient.cn.ibm.com.40621: . 695:696(1) ack 8613 05:00:09.554287 IP testServer.cn.ibm.com.telnet >  testClient.cn.ibm.com.40621: . 695:696(1) ack 8614 05:01:24.555117 IP testServer.cn.ibm.com.telnet >  testClient.cn.ibm.com.40621: . 695:696(1) ack 8615 05:02:39.555958 IP testServer.cn.ibm.com.telnet >  testClient.cn.ibm.com.40621: . 695:696(1) ack 8616 05:03:54.557282 IP testServer.cn.ibm.com.telnet >  testClient.cn.ibm.com.40621: . 695:696(1) ack 8617 05:05:09.559795 IP testServer.cn.ibm.com.telnet >  testClient.cn.ibm.com.40621: R 696:696(0) ack 87

From list 4, we can see that 6th rows of packets are the final data sent by the connection, while 7th rows are the confirmation of 6th rows of data. Then, the connection does not have any data interaction, so that the connection remains inactive. After two minutes (the difference between 8th-row datagram time 04:53:54 and 7th-row datagram time 04:51:54, that is, the value of tcp_keepidle, row 8th is the first live-preserving test datagram initiated by the server. Because the server does not receive the corresponding detection packet from the client, after the time interval (75 seconds) of tcp_keepintvl, Row 3 shows that the server initiates a new test data packet. The server continuously sends tcp_keepcnt detection packets (the above results show that tcp_keepcnt + 1 Detection packet is continuously sent on AIX), but still does not receive any response from the client, therefore, the server sends a reset packet to the client in line 3 and closes the connection on the server.

It should be noted that, although the active-active probe does not affect normal TCP connections even though the TCP probe packet is sent. From list 4, we can see that the number of TCP packets sent from line 4 is the 1 byte data starting from line 1, and the data has been sent and confirmed by the client in line 3. For normal connections, the client will return an ACK packet shown in row 7th after receiving the detection packet, and then indicate to the server that the connection works properly.

Next, we will analyze the impact of the above mechanism on TCP connection persistence through an actual TCP disconnection example, two optional solutions are proposed for applications that require TCP connection for a long time.



Back to Top

TCP disconnection and data analysis on AIX

Figure 3. network topology with TCP disconnection

All server hosts are classified as one LAN and placed behind firewall B. Due to work requirements, the testclient host from the LAN in the work zone needs to establish a connection with the database on the testserver in the LAN of the server using TCP/IP, the upper-layer applications on testclient perform operations on the databases on testserver through the connection.

In the actual test, we found that when both testclient and testserver are working normally, the client on testclient did not receive any exception information in advance, unexpected disconnection may occur for the connection (the connection is reset by foreign host error will be reported when you try to operate the database through the connection ).

As this phenomenon continues to occur and the intermediate nodes (such as routers and switches) in the network work normally, physical factors (such as power loss and downtime) can be ruled out. To facilitate analysis of the reason for disconnection, we first checked the default retention settings on the testserver machine:

# no -a | grep keeptcp_keepcnt = 8 tcp_keepidle = 14400tcp_keepintvl = 150

The tcp_keepidle on testserver is 14400, that is, 2 hours. Since the intermediate node works normally, why does the active mechanism not work? For analysis, we use tcpdump to capture the packet information on testclient and testserver, as shown in listing 5 and 6.

Listing 5. server-side tcpdump data output

1 10:18:58.881950 IP testClient.cn.ibm.com.59098 >  testServer.cn.ibm.com.telnet: S 1182666808:1182666808(0) ...2 10:18:58.882001 IP testServer.cn.ibm.com.telnet >  testClient.cn.ibm.com.59098: S 3333341833:3333341833(0) ack 1182666809 ...3 10:18:58.882845 IP testClient.cn.ibm.com.59098 >  testServer.cn.ibm.com.telnet: . ack 1 ...4 ...5 10:19:03.165568 IP testServer.cn.ibm.com.telnet >  testClient.cn.ibm.com.59098: P 1010:1032(22) ack 87 ...6 10:19:03.166457 IP testClient.cn.ibm.com.59098 >  testServer.cn.ibm.com.telnet: . ack 1032 ...7 12:19:05.445336 IP testServer.cn.ibm.com.telnet >  testClient.cn.ibm.com.59098: . 1031:1032(1) ack 86 ...8 12:19:05.445464 IP testClient.cn.ibm.com.59098 >  testServer.cn.ibm.com.telnet: R 86:87(1) ack 1031 ...

Listing 6. tcpdump data output of the Client

1 # tcpdump -e -i eth0 host testServer.cn.ibm.com2 10:18:55.800553 IP testClient.cn.ibm.com.59098 >  testServer.cn.ibm.com.telnet: S 1182666808:1182666808(0) ...3 10:18:55.801778 IP testServer.cn.ibm.com.telnet >  testClient.cn.ibm.com.59098: S 3333341833:3333341833(0) ack 1182666809 ...4 10:18:55.801799 IP testClient.cn.ibm.com.59098 >  testServer.cn.ibm.com.telnet: . ack 1 ...5 ...6 10:19:00.084662 IP testServer.cn.ibm.com.telnet >  testClient.cn.ibm.com.59098: P 1010:1032(22) ack 87 ...7 10:19:00.084678 IP testClient.cn.ibm.com.59098 >  testServer.cn.ibm.com.telnet: . ack 1032 ...

From list 5, we can see that the time when the connection is inactive reaches the 2 hours set by tcp_keepidle, the server host sends the first active connection detection packet (the first line in listing 5 ). Then, the server host receives the Connection Reset packet from testclient (Row 5 in listing 5 ). Then, the server closes the connection (you can view it through netstat-Ni ). However, the tcpdump data in Listing 6 shows that the testclient does not send any packets. So who sent the reset message to testserver?

To view the sender of the reset packet, use the tcpdump command to capture the packet information of server and firewall B again (note: capture the egress nic and entry nic data of network data on the firewall host). The result is displayed, after receiving the first probe packet from testserver, firewall B immediately sends a reset packet to testserver.

The above analysis shows that the connection has been terminated by firewall B after the last interactive data is transmitted to the server for the first active-active detection, any packets transmitted Based on the connection will be discarded by the firewall and sent a reset packet when attempting to pass through the firewall.



Back to Top

Two common solutions

There are two common solutions for TCP disconnection:

Solution 1: extend the time for the firewall to terminate inactive TCP connections. For example, in the above case, you can adjust the firewall settings to set the time to 2 hours later than the server-side setting.

Solution 2: shorten the TCP connection retention time on the server. The purpose of shortening this time is to send a live detection packet before the connection is terminated by the firewall, which can detect the client status and make the connection active.

In the first solution, increasing the TCP connection retention time may lead to a decrease in the firewall performance, especially when a large number of connections are in inactive state for a long time; in the second solution, if the TCP connection retention time of the server is shortened, the number of data packets in the network is increased and the extra network bandwidth is occupied. Therefore, the two solutions have their own advantages and disadvantages and need to be selected based on different actual application situations.



Back to Top

Summary

This article introduces the concepts of TCP connection establishment and persistence and common factors that affect TCP connection persistence. The configuration parameters for TCP connection life-saving test on Common Unix-like systems are provided, and an actual TCP disconnection case is analyzed based on the tcpdump tool of Aix. Finally, two feasible solutions are provided for TCP disconnection.

References

  • IPv4 and IPv6 network interface operations in Aix v5.3: In this article, you will learn more about socket I/O Control (IOCTL) commands and how to use them to complete various network-related operations. the operating system provides control operations for sockets, route tables, ARP tables, global network parameters, and interfaces.

  • Communication Connection Mode of TCP/IP applications: By analyzing the different methods used by TCP/IP programs at different levels, the author tells you how to design the communication mode of TCP/IP applications and relevant issues that need attention.
  • Lower timer granularity for TCP re-transmission: In this article, we will study how to use the aix tcp quick timer to lower the granularity of the re-transmission timer, and learn other advantages of using a lower timer granularity.
  • TCP System Call Sequence: In this article, we will review and learn more about the TCP call sequence, including references to FreeBSD, and important function calls that occur in the TCP stack after system calls are performed at the user level.
  • AIX and Unix Zone: The "Aix and Unix zone" on developerworks provides a wealth of information related to all aspects of Aix system management that you can use to extend your UNIX skills.
  • Getting started with AIX and Unix: Visit the "Aix AIX" page to learn more about AIX and UNIX.
  • Summary of Aix and Unix topics: The AIX and Unix area has already introduced many technical topics for you and summarized many popular knowledge points. We will continue to launch many related hot topics later. To facilitate your access, we will summarize all the topics in this area here, it makes it easier for you to find the content you need.
  • ObtainRSS feed in this area. (Learn more aboutRSS.)

Author Profile

Zheng Yong, a software engineer, is currently engaged in Aix performance testing at the IBM Development Center.

Feng Rui, a software engineer, is currently engaged in Aix performance testing at the IBM Development Center.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.