Network troubleshooting ideas from a Tso/nat bug in the Linux 2.6.8 kernel (with a SKB optimization point)

Source: Internet
Author: User
Tags ack network troubleshooting ssl connection

There is no wrong and right in the dream, there is no hatred and regret in the dream ... It's better to close your mouth. That's cute ... I will not say: This is not fair, I can not accept. I will use plain words to record the drip, 4 in the morning more up, one go recent harvest and evaluation, indignation and repentance.

A past four years ago.

About 2010, I was troubleshooting a problem.

The problem descriptive narration is for example the following:

server: Linux Kernel 2.6.8/
Client:windows xp/
Business process (simplified version):
1.client initiating an SSL connection to the server
2. Data transfer

Phenomenon: When the SSL handshake occurs. The server sends certificate particularly slowly.

Detailed ideas, that is how it was thought, I have forgotten, but remember a conclusion, that is to correct the Linux 2.6.8 Nat module a bug.
After fetching a lot of data packets. I found that this machine always send itself an ICMP need frag error message, found that the server certificate too large, more than the MTU of this machine out of the network card, the following step-by-step idea. Finally corrected the bug:

1. Verify that the service-side program has the DF flag set.

This is obvious. The ICMP need frag information is triggered by packets that have only the DF flag.

2. Question: When TCP sends data to IP. Will detect the MTU, and then determine the MSS, knowing the value of MSS. How can I send an overrun package? There is little likelihood of calculating errors. Linux, after all, is quasi-industrial.

3. Question: Fortunately I knew some nouns at that time. So think of TCP Segment offload this technology.

TCP Segment offload referred to as TSO, it is a hardware segmentation technology for TCP, not for IP shards, the difference should be understood. So this has nothing to do with the DF flag of the IP header.

For IP shards, only the first shard will have complete high-level information (if the head length can be included in an IP shard), and for TSO caused by IP packets, each IP packet will have a standard TCP header, the network card hardware to calculate the check value of each segmented head, Header fields, such as serial numbers, and actively encapsulate IP headers.

It is designed to improve TCP performance.

4. Verify that: Sure enough, the server enabled TSO

5. Doubt: an IP packet greater than the MTU is sent to the IP layer, and it is a TCP segment of the data, which indicates that TCP already knows that the machine has TSO function, otherwise, for the native originating packets, TCP will strictly follow the MSS package. It does not encapsulate a large package. Then let the IP go to Shard, which is due to the native origin. TCP MSS is aware of the MTU.

For forwarding, this is not the case, however, for the situation here, is clearly the origin of the machine, TCP is aware of the existence of TSO.

6. The test: Since TCP has a sense of the presence of TSO, however, when the IP is sent, it loses the memory, from TCP to the IP entrance. To the end of the IP Shard decision, something serious must have happened in the middle, forcing TCP to lose the TSO's memory.

7. Question: This kind of fault condition is simulated by me in the company. By reporting the person's information. I've learned that not all of this is going to happen. In fact. I have not been quite sure that the Linux protocol stack itself is a problem. Otherwise it would have been fix, I have always suspected that external modules or some external behavior, such as the capture of packets caused.

8. Available information: So far, my other message, is simply to load the NAT module (this is actually analyzed.) The reporter is not aware of the so-called NAT module, only know the NAT rule) will have this phenomenon, so the goal is very clear, dead stare at the NAT module.

9. Start Debug: Because the Linux netfilter NAT module is relatively simple. There is no need for a high-end touch to memory-level tools, just need to PRINTK, but where print is a problem.

10. Error point: Before calling Ip_fragment (that is, ICMP need Frag is sent inside the function). There is an inference (omitting irrelevant):
if (Skb->len > Dst_pmtu (SKB->DST) &&!skb_shinfo (SKB)->tso_size) {    return ip_fragment (SKB, Ip_ Finish_output);}

The previous inference was clearly true. Assuming that you want to invoke Ip_fragment, the latter inference must be false, in fact. Assuming that TSO is turned on, you should not call ip_fragment.

11. Find the Tso_size field: Things are very obvious, it must be where the Tso_size set to 0!

And it must be in the NAT module (more than 98% of the possibilities ...). In the NAT module, find the place where the tso_size is set.

12. Tracking IP_NAT_FN: This is the entry of NAT, when entering this entrance, Tso_size is not 0, but Tso_size is 0 after calling Skb_checksum_help. The problem must be in this function, note that invoking this help has a premise. That is, the hardware has computed the checksum. In this help function, there is a skb_copy operation, exactly after this copy, Tso_size becomes 0. So look further skb_copy, finally positioning, Copy_skb_header Finally, and did not copy the original SKB Tso_size copied to the new SKB. This is where the problem lies.

13. Trigger condition: When will the skb_copy be called? Very easy. Assuming that the SKB is not entirely in the current running stream, a copy is required in accordance with the principle of copy-on-write.

The phenomenon of malfunction is slow. And the data originates from the native. and is TCP. We know. TCP is not preceded by an ACK. SKB cannot be deleted, so the current SKB is definitely just a copy. Therefore, a copy of the document is required.

14. Impact: A function of such a lower level. Search the code, the impact is huge. All kinds of slow! For that slow, its slow process is: Socket send DF data-sense tso--lost tso--icmp need frag--tcp cut into small segments continue to send ... It is assumed that the ICMP of lo is forbidden. Then slower, because TCP triggers a time-out retransmission instead of ICMP's proposed cut, and retransmission is not successful. Until the user program senses, the sending length is reduced by itself.

There are two reasons why we brought up that story, one of which was that there was no record of the whole process, but maybe the patch was always in use. Finally, I'm not sure what it is. Second. was through that analysis. According to today's understanding, you can find an optimization point of the Linux protocol stack, in the case of TCP, because the data SKB queue until the ACK, it is possible that all the SKB processing flow down at least once skb_copy, such a copy operation can not be avoided? Assume that some netfilter hooks are loaded. SKB is required to write, such serialization behavior can seriously affect the processing efficiency of Linux network protocol stack, which is one of the common problems of netfilter.

Attached: Optimization point for SKB operation 1. Is it better to assume that the data and meta-data are completely separated?
2. Subdivide the granularity of the write operation further

Some write operations are for each packet. These have to be copied, but can the local replication, and then take the scattered aggregation io to splicing it? Using pointer manipulation rather than copying the data itself is the cow of Unix fork model and virtual address space. Suppose that the space of SKB is divided into fine granularity. Then you can do it. The part that needs to be cow is just the part that does not cause global replication.

A few days ago a TCP troubleshooting process phenomenon and process

have been accustomed to the thrilling three rules of the system (the time, the prescribed location, and the rules of the people to solve the problem together), but not accustomed to step by step.

This is the way things are.

At the weekend. At noon, chatting with friends for dinner. Received a message from the company. said that there is a possible TCP/IP-related failures, need to locate, I did not immediately reply, because such things often require a lot of information, and these messages are usually transmitted by the time of the N-hand, so in order not to do no work, and so on the people call me again.


(the description below is Simplified)
Our service side: Linux/ip not sure (in the intranet, do not know the NAT policy and whether there are agents and other seven layers of processing)
Intermediate Link: Public Internet
Available access mode: 3g/Wired dial
Service-side devices: third-party load-balancing devices. Fire extinguishing device, etc.
Business process: Client establishes SSL connection with server
The client connects to the 3G network card using the Wireless Link, the business is normal; The client uses a wired link, the SSL handshake is unsuccessful, and the client certificate transmission fails for the SSL handshake process.


1. Through the packet capture analysis, on the cable link. After sending the client certificate (longer than 1500). You receive an ICMP need Frag message that is the length overrun, the link MTU is 1480, and the actual transmission is 1500.

Through the wireless Link, the same received this ICMP need frag. Only the MTU of the report is different. The Wireless link corresponds to 1400.

2. Wired link, the client accepts the ICMP need Frag, and sends again, only to intercept the length of 20 bytes, but the packet discovery client will continue to retransmit the packet, always receiving the service side of the ACK. During Because the client cannot send the successful data to the server for a long while, the server will reply to the DUP ACK to show the urge.

3. Conjecture: At first. I thought it was the reason for the timestamp, because the TCP timestamp is not turned on at both ends, there will be errors in the RTT and retransmission interval estimates, but this does not explain the 100% failure scenario, which is assumed to be due to the timestamp calculation, which does not fail 100%, because the calculated result is more affected by the wave weight.

4. The only difference between a wireless link and a wired link is that the MTU report has an ICMP difference.

5. Intermediate Summary:
5.1. At this point. I did not lead the way to the operator link. Because I always feel that there is no problem, the same, I do not think it is the problem of SSL, because the error is always sent after the presentation of large packets, in fact. Once the ICMP need Frag has been accepted, the previously issued overrun packet has been discarded, and once again it is sending a smaller packet, which is completely normal for the other end of the TCP.
5.2. There is no need to view the service log at all. Because it hasn't reached that level yet.

Grasping the results of the package is very clear, that is, the big package is only to go, in fact, has been in accordance with the MTU found the value of transmission, or to pass. And the wireless Link can pass. Therefore, it should not be an MTU issue.

5.3. In addition to operator link, MTU, service-side processing. What's the problem? In fact. The bug of the program is not impossible, or it is a few unknown actions, anyway. Need to isolate the problem.

6. The test is that a device in the middle cannot handle a large package. This is not related to the MTU, it may be that it can not handle or fundamentally do not want to deal with large packets, how big? Anyway, 1480 of the package can not handle, minus the IP header, TCP header, the remaining is 1440 of the pure data.

Then write a simple TCP client program, immediately after the TCP handshake is sent (in order to prevent the active disconnection due to not client Hello, it must be sent immediately, only to observe the TCP ACK for large packets, at this time unrelated to the service) length 1440 of the data, verify!

7. There is no ACK to return quickly, the client constantly retries to send 1440 of the packet (after 10 seconds to 20 seconds, there will be an ACK, but not every time it will come, it is obviously not normal). In order to prove the rationality of such a way. Sends the MTU limit on the wireless link to the data size, which is the 1400-20-20=1360 data, ack seconds back.

Therefore, the critical length of the packet processing for the intermediate device is pushed between 1360 and 1440.

8. After constant testing, the dichotomy method inquires the critical point. 1380 was found to handle the length of the critical point. Sending 1380 of pure data is normal. Sending 1381 of pure data is not normal.

The target address of the grab packet is, referred to as MA, is now unsure of what the MA is, is our equipment, or its side of the equipment, assuming that our equipment, the wrong continuation, assuming that it is not, the wrong termination. In short, 1380 this tipping point is a doubt, the general is not normal, but also can not rule out the normal reasons for such limitations. The wireless link does not have a problem because the wireless Link has a smaller MTU. The maximum pure data length is 1360 small with a critical value of 1380.

9. Add a test. Simulate the problem machine and change its native MTU to 1380+20+20=1420. Transmission is normal, but instead of 1421, it is not.

Note Only the local MTU changes are valid, because only the TCP data originating device, MSS is associated with the MTU)


1x. After the 9th step, I did not participate in the investigation. But finally. Our equipment did not receive the CLIENTSSL handshake process out of the certificate, indicating that white is an intermediary device to prevent the transmission of this "big package". As to who it is, what is going on, it has nothing to do with us. But for me personally. is more interested in it.

This is a typical network problem for the summary of this sub-row error. IP and TCP are involved, but not much detail, but typical enough. In fact, this problem and finally business logic is not related, but the fact is often, only in the business logic is not normal, this kind of underlying problems will be exposed, which is the nature of the TCP/IP protocol stack.

The key to this sort of problem is that you want to isolate it from high-level protocols as quickly as possible, and you can't get bogged down in whatever detail.
TCP Details: Why not consider TCP details? This kind of scenario is neither special. It's not complicated, assuming it's stuck in TCP detail. Will obscure or ignore a lot of horizontal problems, such as you will be staring at TCP retransmission mechanism to do a careful study, or carefully study the RTT calculation method, and finally can not get any conclusions. In other words, you must believe that TCP is normal.

Service Program Details: This is also to be isolated. Because the server does not actually start the service, and the failure is 100% reproducible. So to be able to determine that this is not a complex problem, the real complex problem is often not 100% to reproduce, even if you dig out the law of recurrence, but also enough to drink a pot of.

TCP and IP issues are different: although they are all part of the network protocol stack, they are used in a much different way.

In fact, TCP increases the user's threshold. In general, TCP is for the program to use, so you want TCP to run up. At least understand the general principle, or understand the socket mechanism. Let's say you surf the web. Although it is also used TCP, it does run up, but the user is not you. It's your browser.

IP is different, IP configurator can be small white, and arbitrary configuration will not error.

Further down, cabling problems, topological problems. Almost no threshold, but more easy error.

So the first thing to exclude is this kind of problem.

Firewall Policy or program bug: In fact, the first step is to ask the Administrator, is not a special firewall on the cause of the policy, but you can not get this message, you can not start from here. Next, equal to the suspicion of the process of processing bugs, at this time, isolation of the original business logic details is important, the phenomenon is that large packets can not receive an ACK. At this point it is necessary to ignore the contents of this large package and its context, directly send a random large package to test.

So. Troubleshooting This type of problem is a step-by-phase separation process, which is technically easier than the troubleshooting of the NAT bug four years ago. All the complexity and time delays are all in the coordination of personnel communication. Misinformation or leakage of information between people is also a difficult problem, the NAT bug, four years ago, is a technically more in-depth issue that involves the kernel stack code level. At the same time, I have to find this point, but the easy point is. This question only touches me, and it's 100%.

Heaven and the ground, expensive in no memory, all wounds will always be washed, all glory, will always be no trace ...

Network troubleshooting ideas from a Tso/nat bug in the Linux 2.6.8 kernel (with a SKB optimization point)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.