Recently the game has been online operation, server memory optimization, found a very strange problem, our authentication server (Authserver) is responsible for dealing with the third-party channel SDK (login and recharge), because of the use of curl blocking way, so here open 128 threads, It is strange that each time you start the virtual inside of the 2.3G, and then each processing the message to increase the 64M, the increase to 4.4G no longer increased, because we use the pre-allocation method, the thread inside there is no large memory, then where is the memory from? Make people baffled.
1. Explore
First of all to exclude memory leaks, it is impossible to leak 64M memory every time so coincidence, in order to prove my point of view, first of all, I used valgrind.
1:valgrind--leak-check=full--track-fds=yes--log-file=./authserver.vlog & |
Then start the test and run to memory no longer, and sure enough valgrind show no memory leaks. Repeated trials many times, the result is this.
After using valgrind many times, I began to wonder if the program was using MMAP and other calls, and then used Strace to detect the system functions such as MMAP,BRK:
1:strace-f-E "Brk,mmap,munmap"-P $ (pidof authserver) |
The results are as follows:
[PID 19343] Mmap (NULL, 134217728, Prot_none, map_private| map_anonymous| Map_noreserve,-1, 0) = 0x7f53c8ca9000 [PID 19343] Munmap (0x7f53c8ca9000, 53833728) = 0 [PID 19343] Munmap (0x7f53d0000000, 13275136) = 0 [PID 19343] Mmap (NULL, 8392704, prot_read| Prot_write, map_private| map_anonymous| Map_stack,-1, 0) = 0x7f53d04a8000 Process 19495 attached |
I checked the trace file and did not find a lot of memory mmap actions, even the BRK action caused the memory growth is not small. So I feel that life is not the direction, and then suspect that the file cache has taken up virtual memory, commented out the code of all read and write log code, virtual memory is still increasing, excluding this possibility.
2. An epiphany
Later, I began to reduce the number of thread to start testing, when the test accidentally found a very strange phenomenon. That is, if a process creates a thread and allocates a very small amount of 1k of memory within that thread, the entire process virtual memory immediately increases by 64M, then allocates, and the memory does not increase. The test code is as follows:
#include <iostream> #include <stdio.h> #include <stdlib.h> #include <unistd.h> using namespace Std; volatile bool start = 0; void * Thread_run (void *) { while (1) { if (start) { cout << "Thread malloc" << Endl; Char *buf = new char [1024]; start = 0; } Sleep (1); } } int main () { pthread_t th; GetChar (); GetChar (); Pthread_create (&th, 0, Thread_run, 0); while ((GetChar ())) { start = 1; } return (0); } |
Its running results such as, at the beginning, the process consumes virtual memory 14M, enter 0, create a child thread, process memory reaches 23M, this increase of 10M is the size of the thread stack (view and set thread stack size available ulimit–s), first input 1, program allocated 1k memory, The whole process adds 64M of virtual memory, then input 2, 3, each again allocated 1k, memory no longer changes.
This result makes me ecstatic, because previously studied Google's Tcmalloc, each thread has its own buffer to solve the competition of multi-thread memory allocation, estimated the new version of GLIBC also learned this technique, so look at Pmap $ (pidof main) to see the memory situation, as follows:
Note the 65404 line, with all the indications that this plus the line above it (here is 132) is the added 64M). Then increase the number of thread, there will be a new thread number corresponding 65404 of the memory block.
3. Inquiring
After some Google and code review. Finally know that the original is glibc of malloc here mischief. GLIBC version greater than 2.11 will have this problem: on the official document of Redhat: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/ Html/6.0_release_notes/compiler.html
Red Hat Enterprise Linux 6 features version 2.11 of GLIBC, providing many features and enhancements, including ... An enhanced dynamic memory allocation (malloc) behaviour enabling higher scalability across many sockets and cores. This was achieved by assigning threads their own memory pools and by avoiding locking in some situations. The amount of additional memory used for the memory pools (if any) can be controlled using the environment variables Mallo C_arena_test and Malloc_arena_max. MALLOC_ARENA_TEST specifies that a TEST for the number of cores was performed once the number of memory pools reaches this Value. Malloc_arena_max sets the maximum number of memory pools used, regardless of the number of cores. |
The developer, Ulrich Drepper, have a much deeper explanation on his blog:http://udrepper.livejournal.com/20948.html
Before, malloc tried to emulate a per-core memory pool. Every time when contention-existing memory pools was detected a new pool is created. Threads stay with the last used pool if possible ... This never worked 100% because a thread can is descheduled while executing a malloc call. When some other thread tries the memory pool used in the call it would detect contention. A second problem is so if multiple threads on multiple core/sockets happily use malloc without contention memory from th E same pool is used by different cores/on different sockets. This can leads to false sharing and definitely additional cross traffic because of the meta information updates. There is more potential problems not worth going to here in detail. The changes which is in glibc now create per-thread memory pools. This can eliminate false sharing in the most cases. The meta data is usually accessed only on one thread (which hopefully doesn ' t get migrated off its assigned core). To prevent the memory handling from blowing up the address space use too much the number of memory pools is capped. By default we create the up to the memory pools per core on 32-bit machines and the eight memory per core on 64-bit machines . The code delays testing for the number of cores (which was not cheap, we had to read/proc/stat) until there was already t Wo or eight memory pools allocated, respectively. While these changes might increase the number of memory pools which is created (and thus increase the address space they Use) The number can be controlled. Because using the old mechanism there could be a new pool being created whenever there is collisions the total number cou LD in theory is higher. Unlikely but true, so the new mechanism are more predictable. ... Memory use isn't that much of a premium anymore and most of the memory pool doesn ' t actually require memory until it's U SED, only address space ... We have done internally some measurements of the effects of the new implementation and they can quite dramatic. |
New versions of glibc present in RHEL6 include a new arena Allocat or design. In several clusters we've seen this new allocator cause huge amounts of the virtual memory to being used, since when multiple thr EADS perform allocations, they each get their own memory arena. on a 64-bit system, these arenas is 64M mappings, a nd the maximum number of arenas is 8 times the number of cores. We ' ve observed a DN process using 14GB of Vmem for only 300M of resident set. This causes any kinds of nasty issues for obvious reasons. Setting Malloc_arena_max to a low number would restrict the number of memory arenas and bound the virtual memory, with N o noticeable downside in performance–we ' ve been recommending malloc_arena_max=4. We should set this on hadoop-env.sh to avoid this issue as RHEL6 becomes + and more common. |
To sum up, glibc in order to allocate memory performance problems, using a lot called arena, memory pool, the default configuration under 64bit is each arena 64M, a process can have a maximum of cores * 8 arena. Assuming that your machine is 4-core, you can have up to 4 * 8 = 32 Arena, which is the use of 2048M memory. Of course you can also change the number of ARENA by setting the environment variable. For example, export malloc_arena_max=1
Hadoop recommends setting this value to 4. Of course, since it is a multi-core machine, and the introduction of arena is to solve the problem of multi-thread memory allocation competition, it is also a good choice to set the number of CPU cores. After setting this value, it is best to stress test your program to see if changing the number of arena will affect the performance of the program.
Mallopt (M_arena_max, XXX) If you are going to set this up in the program code, then you can call Mallopt (M_arena_max, XXX) to implement, because we authserver to use a pre-allocated way, There is no memory allocated within each thread, so this optimization is not required, it is switched off with Mallopt (M_arena_max, 1) at initialization, and set to 0, indicating that the system is automatically set by the CPU.
4. Accidental discovery
Think of Tcmalloc small objects from the thread's own memory pool allocation, large memory is still allocated from the central allocation area, do not know how glibc is designed, so the above program in the middle of each allocation of memory from 1k to 1M, as expected, and then after the allocation of 64M, still each will increase 1M, Thus, the new version of glibc completely borrowed from the Tcmalloc thought.
Busy a few days the problem finally solved, the mood is good, through today's question let me know, as a server programmer, if do not understand the compiler and operating system kernel, is completely unqualified, later to strengthen this aspect of learning.
- This article is from: Linux Tutorial Network
Why Linux under Multithreaded threads consumes virtual memory so much