Do not blindly add ip_conntrack_max-understand Linux kernel memory

Source: Internet
Author: User
1. Linux memory ing derived from ip_conntrack

There are many articles discussing the problem of discarding data packets when the ip_conntrack table is full. For this study, we know that Linux has a Kernel Parameter ip_conntrack_max, Which is 65536 by default for machines with large memory, so I am crazy to add this parameter, such as setting it to 10000... 00, as long as no setting error is reported, it must be set to the maximum value. In this way, the software is truly amazing, but the technical content of the software is not as good as that of the boiler!
If you consider more comprehensively, such as experienced programmers or network administrators, you may think of memory problems. They know that all connection trace information is stored in the memory, therefore, we will consider simply enlarging the memory occupied by the ip_conntrack_max parameter, which will weigh the memory usage. if the system does not have much memory, it will not set this value too high.
But what if your system has a large amount of memory? For example, if you have 8 GB of memory, it is reasonable to track connections by 1 GB. However, in a traditional 32-bit architecture, Linux cannot. Why? Because you may not understand the Linux kernel memory management method at all.
Memory is getting cheaper today, and the Linux memory ing method is indeed a little outdated. However, the fact is there, ip_conntrack is in the kernel space, and the memory required by it must be mapped to the kernel space, while the traditional 32-bit Linux memory ing method only 1 GB belongs to the kernel, in this 1g address space, the first 896m is linearly mapped to the physical memory, and there are several vmalloc spaces after several holes. Compared with the one-to-one ing space, small and small. Calculate the small ing space under the 4G cap, so that the address space used by the kernel cannot exceed 1 GB. For ip_conntrack, because it uses the slab distributor, it must also use a one-to-one ing address space, which means that it can only use less than MB of memory!

Why haven't Linux improved its memory ing mechanism that is so "outdated" for so many years? In fact, this design of kernel space memory is greatly improved in the 64-bit architecture, but the problem still exists, even if the 64-bit architecture, the kernel cannot transparently access all the physical memory. It also needs to map the physical memory to the kernel address space before access. for one-to-one ing, this ing is determined in advance, for non-one-to-one ing spaces with limited size (in fact very small), you need to dynamically create page tables and page directories. There is another explanation, that is, "the kernel should not have done ip_conntrack", which is a protocol stack. Unfortunately, the liunx protocol stack is fully implemented in the kernel, ip_conntrack, which may be processed in the SKB receiving Soft Interrupt, cannot sleep, and therefore cannot be handed over to the process, the process address space cannot be used (the process address space [user State + kernel state] can access all physical memory ).

Linux is so demanding on kernel memory that you don't want to use it at will because it is precious and you need to cherish it.
2. Experiment on a 32-bit Linux System

The following is an experiment designed to prove the above facts. Some of the methods used in the experiment may not be in line with common sense, after all, this solution will never be available in the company's standard documentation, which will make people learn to be opportunistic or be called lazy. But to be prepared, there is still a place to keep, so write it as a blog.
Another parameter affects the time complexity and space complexity of searching for connection traces, namely ip_conntrack_buckets. This value describes the number of hash buckets. Theoretically, the larger the value, the smaller the hash collision, and the faster the query time, however, you need to pre-allocate a small amount of memory for each bucket. If the number of buckets is large, it will occupy a large amount of memory, in addition, these memories are valuable "kernel memory with only 1 GB Space", which is different from the allocation policy of the ip_conntrack struct, this hash bucket can be allocated in vmalloc space, not necessarily in one-to-one linear ing space.
2. 1. Method for quickly filling ip_conntrack

LoadRunner is definitely a method. However, I have specialized skills and I hate everything on windows, so I need to use other methods. I am at home from work and I am only one person, I also don't want to use "Swiss Army Knife" such as Netcat. I am afraid that the port is full, and I am afraid that my Macbook is crazy. So I need to try again. The purpose is only to test the maximum memory occupied by ip_conntrack. In fact, I have known this for a long time. I just want to confirm it all at once, so there is a way to increase the size of the ip_conntrack struct, this is easy. You only need to add a large field after the struct. The following changes are made based on the 2.6.18 kernel of Red Hat Enterprise 5.
2. Changes to the ip_conntrack kernel module before testing

Edit the $ build/include/Linux/netfilter_ipv4/ip_conntrack.h file and add the following sentence at the end of the structure ip_conntrack:

Char AAA [102400]; // This 102400 is obtained through the binary method. If it is set to 2xxxxx, the kernel crash will be generated during loading, because this array is directly allocated (similar to stack allocation) rather than dynamically allocated, it is likely to overwrite the key data of the kernel during loading, so it is recommended to select a feasible value, then add connections. After all, the connections have increased by at least 100000 times ~~

Enter $ src/NET/IPv4/netfilter and execute:

make –C /lib/modules/2.6.18-92.e15/build SUBDIRS=`pwd` modules

After ip_conntrack.ko is loaded, the kernel log is printed as follows:
Ip_conntrack version 2.4 (8192 buckets, 65536 max)-102628 bytes per conntrack
From this we can see that the ip_conntrack struct has increased, so that the network connection pressure required to fully support the entire available memory is greatly reduced, so there is no need for anything like LoadRunner. In order to fill up the available memory as soon as possible, we also need to set a relatively long timeout for ip_conntrack:

sysctl -w net.ipv4.netfilter.ip_conntrack_generic_timeout=600000sysctl -w net.ipv4.netfilter.ip_conntrack_icmp_timeout=300000sysctl -w net.ipv4.netfilter.ip_conntrack_tcp_timeout_close=1000000sysctl -w net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait=120000sysctl -w net.ipv4.netfilter.ip_conntrack_tcp_timeout_last_ack=300000sysctl -w net.ipv4.netfilter.ip_conntrack_tcp_timeout_close_wait=60000sysctl -w net.ipv4.netfilter.ip_conntrack_tcp_timeout_fin_wait=120000sysctl -w net.ipv4.netfilter.ip_conntrack_tcp_timeout_established=432000sysctl -w net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_recv=600000sysctl -w net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_sent=120000

In this way, some streams will be "permanently maintained" and occupy the ip_conntrack struct until the available memory overflows.
After the ip_conntrack module is loaded, all data packets in the past will be tracked automatically. The following script is written:

for (( i=1; i<255; i++));do    for (( j=1; j<255; j++));    do        ping 192.168.$i.$j -c 1 -W 1        curl --connect-timeout 1 http://138.$i.$j.80/        curl --connect-timeout 1 http://38.$i.$j.80/        curl --connect-timeout 1 http://$i.1.$j.80/        curl --connect-timeout 1 http://$j.$i.9.8/    donedone
2. 3. Test process

Local Configuration:
Memory: 3032 M, free command recognition 3003 m
Run the preceding script, smoke a cigarette, and pull a pair (PAO) to obtain the following data:
Maximum number of connections: 6149
Memory usage: 886 MB
Ping 127.0.0.1 on the local machine does not work at this time, indicating that ip_conntrack has reached the limit. At the same time, a print statement is inserted in the place of alloc ip_conntrack (it must be a bunch of # numbers ), the kernel prints information about memory allocation failures. A total of 3G memory, only 886 M (and I keep using sysctl-W VM. drop_caches = 3 clear cache), the rest cannot be used for ip_conntrack. To make the results more convincing, I inserted the following code in the initialization FUNCTION OF THE ip_conntrack module:

for (j=0; I < 400; j++)    __get_free_pages(GFP_KERNEL, 8);

This means that I first occupy MB of memory in the kernel space and check whether the total number of connection traces will decrease accordingly. The data obtained is as follows:
Maximum number of connections: 3421
Memory usage: 879 MB
It can be seen that the kernel memory is occupied, and the number of items that can be allocated to ip_conntrack is reduced. Further, keep the above _ get_free_pages unchanged and add the following code:

for (j=0; I < 400; j++)    __get_free_pages(GFP_HIGHUSER, 8);

The final result is as follows:
Maximum number of connections: 3394
Memory usage: 1203 MB
It can be seen that the highuser memory does not affect the kernel memory. You must know that the user process memory is almost all allocated using this highuser ID. If the gfp_kernel allocation is removed, only the gfp_highuser allocation is retained. The following result is displayed:
Maximum number of connections: 6449
Memory usage: 1312 MB
It can be seen that the allocation of highuser memory is at the high end and will not affect the kernel's one-to-one ing space.
2. 4. Test Results

To sum up, the 32-bit Linux ip_conntrack uses only one-to-one memory ing in the kernel address space. In other words, it can only use the first 896 m of physical memory, this is also the result of removing the char AAA [] added to the ip_conntrack struct. It is not easy to fill all available memory, but several machines and stress tools such as LoadRunner need to be used.
3. Experiment on a 64-bit Linux System

MD! I have moved the kernel data structure, but the loading module has not been successful. So far I am still debugging and troubleshooting. It has been more than a few hours...
4. Conclusion

Finally, it should be noted that the initial value of ip_conntrack_max is calculated by the kernel based on the memory of your machine, including ip_conntrack_buckets. The kernel sets this initial value, this is a well-tested experience. Therefore, you must not increase the value unless necessary. If your machine faces a large number of connections and you increase the value of ip_conntrack_max, the cost is that it occupies a large amount of valuable kernel memory, which may cause other kernel memory allocation failures, once the kernel memory usage exceeds the threshold of the kernel memory space ing, the system will silently discard your data packets without reporting:
Table full, dropping packet error and Solution
This is a sad thing.

The impact of the operating system kernel on protocol stack behavior brings the illusion that I have so much memory, why not let me use it ?! In fact, you don't need to use it, but the kernel. What you can control is the process. Programmers cannot control the kernel. Of course, you can re-compile, even modify the kernel, or even modify the ip_conntrack memory allocation method. Instead of using the Slab Memory of the partner system, you can redirect a userspace. However, this development requires costs, several hundred yuan a day, few companies are willing to spend the money. Therefore, the final conclusion is: Do not blindly add ip_conntrack_max. Echo the question.

Appendix: 1. About gfp_xxx

Gfp_atomic: sets this identifier. It is allocated to the emergency pool in the kernel ing area. If it fails, null is returned. If no other memory is released, no kill process is performed and the allocation path is sleep. The ip_conntrack struct is allocated using this identifier (most of which are in Soft Interrupt paths)
Gfp_kernel: Set This identifier and allocate it in the kernel ing area. If it fails, try to release the memory that can be released and call oom_killer to sleep.
Gfp_highuser: sets this identifier and can be allocated to all physical memory (excluding a small portion of fixed memory and memory for bus Io, which is obvious on x86, for details, see/proc/iomem to obtain detailed physical memory ing information ). In general, the memory required by user-State processes is allocated using this identifier. Try to use the high-end memory that is "difficult" to use in the kernel (the physical memory after 1 GB, and the kernel must be dynamically mapped, this requires page tables), so that the first 896 m (32-bit architecture) is left to the kernel as much as possible.

2. Hash of ip_conntrack

In Linux, ip_conntrack uses the jhash function for hash calculation. For the implementation of this function, refer to the following URL:
Http://burtleburtle.net/bob/hash/doobs.html
The author's explanation is clear.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.