Network troubleshooting ideas from a Tso/nat bug in the Linux 2.6.8 kernel (with a SKB optimization point)

Last Update:2015-07-08 Source: Internet

Author: User

Tags ack network troubleshooting ssl connection

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

There is no wrong and right in the dream, there is no hatred and regret in the dream ... It's best to close your mouth so it's cute ... I will not say: This is not fair, I can not accept. I will use plain words to record the drip, 4 in the morning more up, one go recent harvest and evaluation, anger and repentance.

A past four years ago.

About 2010, I was troubleshooting a problem. The problem is described as follows:

server: Linux Kernel 2.6.8/192.168.188.100
Client: Windows xp/192.168.40.34
Business process (simplified version):
1. The client initiates an SSL connection to the server
2. Transferring data

Phenomenon: When the SSL handshake, the server sends certificate especially slow.

Analysis:
Specific ideas, that is how it was thought, I have forgotten, but remember a conclusion, that is to correct the Linux 2.6.8 Nat module a bug.
After fetching a lot of data packets, I found that the machine always send itself an ICMP need frag error message, found that the service side of the certificate is too large, more than the MTU of this machine out of the network card, the following step-by-step thinking, finally corrected the bug:

1. Verify that the service-side program has the DF flag set. This is obvious, because only packets with the DF flag will trigger the ICMP need frag information.

2. Question: When TCP sends data to the IP, it detects the MTU, and then determines the MSS, knowing the value of the MSS, how to send an overrun packet? Computing errors are unlikely, after all, Linux is quasi-industrial.

3. Question: Fortunately I knew some nouns, so I thought of TCP Segment offload this technology.

TCP Segment offload called TSO, which is hardware segmentation technology for TCP, not for IP shards, which should be understood, so this is not related to the DF flag of the IP header. For IP shards, only the first shard will have complete high-level information (if the head length can be included in an IP shard), and for TSO caused by IP packets, each IP packet will have a standard TCP header, the network card hardware to calculate each segment head of the checksum value, Header fields such as serial numbers and automatically encapsulate IP headers. It is designed to improve TCP performance.

4. Verify that: Sure enough, the server enabled TSO

5. Doubt: A greater than the MTU of the IP packet sent to the IP layer, and it is a TCP segment of the data, which indicates that TCP has known that the machine has TSO function, otherwise, for the originating packet, TCP will be strictly in accordance with the MSS package, it will not encapsulate a large packet, and then let the IP to the Shard, This is because the TCP MSS is perceptible to the MTU for native origination. For forwarding, this is not the case, however, for the situation here, is clearly the origin of the machine, TCP is aware of the existence of TSO.

6. Guess: Since TCP has the presence of TSO perception, but in the IP sent, but also lost this memory, from TCP to the IP entrance, to the end of the IP shard decision, there must be something serious in the middle, forcing TCP to lose the TSO's memory.

7. Question: This failure situation is simulated by me in the company, through the information of the reporter, I understand that not all the situation will be like this. As a matter of fact, I have not quite admitted that the Linux stack itself is a problem, otherwise it would have been fix, and I have always suspected external modules or some external behaviors such as grabbing packets.

8. Available information: So far, I have another message, that is, as long as the NAT module is loaded (in fact, this is analyzed, the reporter is not aware of the so-called NAT module, only know the NAT rules) will have this phenomenon, so the target is clear, dead stare at the NAT module.

9. Start Debug: Because the Linux netfilter NAT module is relatively simple, there is no need for high-end touch to memory-level tools, only need to PRINTK, but where print is a problem.

10. Error point: Before calling Ip_fragment (that is, ICMP need Frag is sent inside the function), there is a judgment (omitting irrelevant):

if (Skb->len > Dst_pmtu (SKB->DST) &&!skb_shinfo (SKB)->tso_size) {    return ip_fragment (SKB, Ip_ Finish_output);}

The previous judgement is obviously true, if you want to call Ip_fragment, the latter judgment must be false, in fact, if you open TSO, you should not call ip_fragment.

11. Find the Tso_size field: it is obvious that there must be a place where the tso_size is set to 0! and must be in the NAT module (more than 98% of the possibility of it ...), so in the NAT module to find the place to set tso_size.

12. Tracking IP_NAT_FN: This is the entry of NAT, when entering this entrance, Tso_size is not 0, but after calling the Skb_checksum_help tso_size is 0, the problem must be in this function, notice, call this help has a premise , that is, the hardware has computed the checksum. In this help function, there is a skb_copy operation, it is in this copy, Tso_size became 0, so further look at Skb_copy, finally to the end of the Copy_skb_header, and not the original SKB Tso_ The size is copied to the new SKB, which is where the problem lies!

13. Trigger condition: When will the skb_copy be called? Very simply, if the SKB does not belong to the current execution stream, it is necessary to copy the document according to the principle of copy-on-write. The symptom is slow, and the data originates from the native and is TCP. We know that TCP SKB cannot be deleted until it has an ACK, so the current SKB is definitely a copy, so it needs to be copied.

14. Impact: A function of such a lower level. Search code, Impact huge, various slow! For that slow, its slow process is: Socket send DF data-sense tso--lost tso--icmp need frag--tcp cut into small segments continue to send ... If you disable ICMP for lo, it is slower because TCP triggers a time-out retransmission instead of the recommended cut for ICMP, and retransmission is not successful until the user program senses and reduces the sending length by itself.

Why the old mention of that thing there are two reasons, one is not recorded the entire process, but the subsequent patch has been used, and finally I do not know its why, and the other, through the analysis, according to the current understanding, you can find the Linux protocol stack of an optimization point, In the case of TCP, since the data SKB queue is retained until the ACK, then all subsequent SKB processes must go through the skb_copy at least once, can't this copy operation be avoided? If some netfilter hooks are loaded and SKB is required to write, this serialization behavior can seriously affect the processing efficiency of the Linux network protocol stack, which is one of the common problems of netfilter.

Attached: Optimization point for SKB operation 1. Is it better to completely separate the data from the meta-data?
2. Subdivide the granularity of the write operation further
Some write operations are for each packet, these have to replicate, but can be replicated locally, and then take the scattered aggregation io to splice it? Try to use pointer manipulation instead of copying the data itself, which is the cow of Unix fork model and virtual address space. If the SKB space is divided into fine granularity, then you can do it, you need to cow which part is only that part, will not lead to global replication.

A few days ago a TCP troubleshooting process phenomenon and process

have been accustomed to the thrilling three rules of the system (the time, the prescribed location, and the rules of the people to solve the problem), but not accustomed to step by step. This is the way things are.

Weekend, noon, is chatting with friends to eat, received the company's text message, said there is a possible TCP/IP related to the fault, need to locate, I did not immediately reply, because this kind of things often need a lot of information, and these messages are usually sent by the time passed N hand, so in order not to do not work hard , wait for the people to call me again.

...

(simplified description below)
Our service side: Linux/ip not sure (in the intranet, do not know the NAT policy and whether there are agents and other seven layers of processing)
Test client: WINDOWS/192.168.2.100/GW 192.168.2.1
Intermediate Link: Public Internet
Available access mode: 3g/Wired dial
Service-side devices: third-party load-balancing devices. Fire extinguishing device, etc.
Business process: Client establishes SSL connection with server
Fault:
The client connects to the 3G network card using the Wireless Link, the business is normal; The client uses a wired link, the SSL handshake is unsuccessful, and the certificate transmission fails for the SSL handshake process.

Analysis:

1. Through the packet analysis, on the cable link, sending the client certificate (longer than 1500), you will receive an ICMP need Frag message, that is, the length overrun, the link MTU is 1480, and the actual transmission is 1500. Through the wireless Link, also received this ICMP need Frag, only the reported MTU is different, the wireless link corresponds to 1400.

2. Wired link, the client accepts ICMP need Frag, resend, just cut off the length of 20 bytes, but the packet discovery client will continue to retransmit this packet, always receive the service side of the ACK, in the meantime, because the client can not send the success data to the server, the server will reply to the DUP ACK, To show the urge.

3. Conjecture: At first, I thought it was the time stamp, because there is no TCP timestamp on both ends, there will be errors in the RTT and retransmission interval estimates, but this does not explain the 100% failure situation, if it is due to the timestamp calculation, it will not 100% fail, Because the result of the calculation is affected by the fluctuation weights will be relatively large.

4. The only difference between the wireless link and the wired link is that the ICMP report has a different MTU.

5. Intermediate Summary:
5.1. At this point, I did not take the idea to the operator link, because I always think that there is no problem, again, I do not think it is the problem of SSL, because the error is always sent after the delivery of large packets, in fact, after receiving the ICMP need Frag, the previously sent the overrun packet has been discarded, The resend is a smaller packet, which is perfectly normal for the other end of the TCP.
5.2. There is no need to view the service log at all, as it has not yet reached that level. Grasping the results of the package is very clear, that is, the large package can not pass through, in fact, has been in accordance with the MTU found in the value of transmission, or not, and wireless link to the past. Therefore, it should not be an MTU issue.

5.3. What is the problem in addition to operator link, MTU, and service-side processing? In fact, the bug of the program is not impossible, or is a few unknown actions, in any case, need to isolate the problem.

6. Guess is the middle of a device can not handle the large package, this and the MTU does not have a relationship, it may be that it cannot handle or fundamentally do not want to deal with large packets, how big? Anyway, 1480 of the package can not handle, minus the IP header, TCP header, the remaining is 1440 of the pure data. Then write a simple TCP client program, sent immediately after the TCP handshake (in order to prevent the active disconnection due to not the client hello, it must be sent immediately, just to observe the TCP ACK for large packets, at this time unrelated to the service) length 1440 of the data, verify!

7. There is no ACK to return quickly, the client constantly retries to send 1440 packets (after 10 seconds to 20 seconds, there will be an ACK arrives, but not every time, this is obviously not normal). To prove the rationality of this method, the data size of the MTU limit on the wireless link is sent, i.e. the 1400-20-20=1360 data, ack seconds back. therefore guessing the length of the packet processing of the intermediate devices is at a critical point between 1360 and 1440.

8. After continuous testing, the dichotomy query critical point, found that 1380 is the processing length of the critical point. Sending 1380 of pure data is normal, sending 1381 of pure data is not normal. The target address of the grab packet is 12.23.45.67, referred to as MA, is now not sure what the MA is, is our equipment, or its side of the equipment, if it is our equipment, the wrong to continue, if not, the wrong termination. In short, 1380 this tipping point is a doubt, the general is not normal, but also can not rule out the normal reasons for such limitations. The wireless link does not have a problem because the wireless Link has a smaller MTU, a maximum pure data length of 1360 and a critical value of 1380.

9. Supplementary testing, simulation of the problem machine, the MTU of its native to 1380+20+20=1420, transmission is normal, but instead of 1421, it is not. (Note that only the local MTU modification is valid, because only the TCP data originating device, MSS is associated with the MTU)

.....

1x. After the 9th step of the troubleshooting I did not participate, but in the end, our equipment did not receive the client SSL handshake process out of the certificate, it is true that the intermediary device blocked the transmission of the "big package", as to who it is, what is going on, it has nothing to do with us, but for me personally, is still more interested in it.

The summary of this error is a typical network problem, involving IP and TCP, with little detail but typical enough. In fact, this problem is not related to the final business logic, but the fact is that only when the business logic is not normal, this kind of underlying problems will be exposed, which is caused by the nature of the TCP/IP protocol stack. The key to this sort of problem is that you need to isolate it from the high-level protocol as quickly as possible and not get bogged down in any detail.
TCP Details: Why not consider TCP details? Such scenarios are neither special nor complex, and if you fall into TCP details, you will cover up or ignore a large number of horizontal problems, such as you will be staring at TCP retransmission mechanism for detailed research, or careful study of the RTT calculation method, and ultimately may not be able to get any conclusions. In other words, you must believe that TCP is normal.
Service Program Details: This is also to be isolated. Because the server does not really start the service, and the fault is 100% reproducible, it can be determined that this is not a complex problem caused by the real complex problem is often not 100% to reproduce, even if you dig out the law of reproduction, but also enough for you to drink a pot.
TCP and IP issues differ: Although they are all part of the network protocol stack, they are used in a very different way. In fact, TCP improves the user's threshold, in general, TCP is to let the program go to use, so you want to run up TCP, at least understand its general principle, or understand the socket mechanism, if you surf the web, although it is used TCP, it does run up, but the user is not you, It's your browser. IP is different, IP configurator can be small white, and arbitrary configuration will not error. Further down, cabling problems, topological problems, almost no threshold, but more error-prone. So the first thing to exclude is this kind of problem.
Firewall Policy or program bug: In fact, the first step is to ask the Administrator, is not a special firewall on the cause of the policy, but you can not get this message, you can not start from here. Next, the equality is the suspicion of the process of processing bugs, at this time, the isolation of the original business logic details is important, the phenomenon is that large packets can not receive an ACK, it is necessary to ignore the contents of the large package and its context, directly send an arbitrary large package for testing.

Therefore, this kind of problem is a gradual isolation process, relative to the NAT bug four years ago, the fault is more technically easier, all the complexity and time delay all in the coordination of personnel communication, information between personnel misinformation or leakage is a difficult point, four years ago that Nat Bug, is a technically more in-depth issue, involving the kernel stack code level, and before that I have to find this point, but it's easy to point out that this problem only involves me, but also 100% reproduce.

Heaven and the ground, expensive in no memory, all wounds will always be washed, all glory, will always be no trace ...

Network Troubleshooter from a Tso/nat bug in the Linux 2.6.8 kernel (with a SKB optimization point)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More