Trace memory leakage using process information
Abstract: Memory leakage is a software problem frequently encountered by background server programs. There are many methods to locate Memory leakage, such as valgrind, but the process needs to be restarted. In some cases, it is difficult or long to reproduce the same memory leakage after restarting the process. This article discusses a method to analyze a process instance that has encountered a memory leak and try to obtain a memory leak point.
I. symptom
Bigpipe is a distributed transmission system in Baidu. Its server module Broker adopts an asynchronous programming framework and uses reference counting to manage the lifecycle and release time of object Resources. During the stress test on the Broker module, it is found that after the Broker runs for a long time, the memory usage increases gradually, causing memory leakage.
Ii. Preliminary Analysis
For the recent Broker upgrade and transformation points, determine the objects in the Broker that may have memory leakage. The Broker has added the monitoring function. One of them is monitoring statistics on the parameters of the server, which must have read operations on the parameter object. Each operation will reference the count "1 ", after the operation is completed, "subtract 1 ". Currently, there are several parameter objects. You need to determine which parameter object is leaked.
Iii. Code & Business Analysis
1. To prove the results of the previous preliminary analysis, the possible methods are: Use Valgrind to run the Broker and start the stress program to reproduce possible memory leaks. However, this method is used:
1) The trigger conditions for Memory leakage are not simple, which may lead to a long recurrence cycle and even the same memory leakage cannot be reproduced;
2) objects with Memory leakage are stored in the container. After valgrind Exits normally, no memory leakage is reported;
After a short run attempt of another test cluster, the Valgrind reported no exception.
2. analyze the existing conditions: Fortunately, the Broker process with the "memory leak" problem is still running, and the truth is inside the process. The existing site should be fully utilized to locate the problem. We hope to use GDB for debugging.
3. challenge: using the GDB attach pid method will cause the process to be suspended. According to the Broker design, when pairing another Master/Slave Broker does not send heartbeat to each other, the Broker will automatically exit the program, the field cannot be saved after exiting, which means there is only one chance to use GDB.
4. Solution: Use gdb to print memory information and observe possible memory leakage points from the information.
5. Step 1: pmap-x {PID} to view memory information (for example:Pmap-x 24671); Get the following information, note the location marked as anon:
SHAPE \ * MERGEFORMAT
24671:./bin/broker Address Kbytes RSS Anon Locked Mode Mapping 0000000000400000 11508---r-x -- broker 201700000103c000 388---rw --- broker 0000000000000d000 144508---rw --- [anon] Rj7fb3f583b000 4---rw --- libgcc_s-3.4.5-20051201.so.1 ---------------------------------------- Total kB 610180 --- |
6. Step 2: Start gdb./bin/broker and use the attach {PID} command to load existing processes. For example, if the process number is 24671, use:Attach 24671.;
7. Step 3: UseSetheight 0AndSetlogging onStart gdblog and save the log to the gdb.txt file;
8. Step 4: Use x/{memory Byte Count} a {memory address} to print out a piece of memory information. For example, if the above anon address is the heap address and occupies KB of memory, use:X/18497024a0x0000000000000d000Running, and then useSourcecommand.txtTo execute the command set in the file, which is the content of the command.txt file;
SHAPE \ * MERGEFORMAT
Set height 0 Set logging on X/18497024a 0x0000000000000d000 X/23552a 0x000000317ae09000 X/2048a 0x000000317b65e000 X/512a 0x000000318a821000 X/2560a 0x000000318b18d000 |
9. Step 5: analyze the information in the gdb.txt file. The content in gdb.txt is as follows:
SHAPE \ * MERGEFORMAT
0x1071000 <_ ZN7bigpipe13bmq_handler_t16_heart_beat_bodyE + 832>: 0x0x0 0x1071010 <_ ZN7bigpipe13bmq_handler_t16_heart_beat_bodyE + 848>: 0x0x0 ... 0x000010c0 <_ zgvz5getippce4lock>: 0x0x0 0x000010d0 <_ zgvzn7bigpipe13bmq_handler_t14get_heart_beaterie4 _ sl>: 0x0x0 0x000010e0 <_ zst8 _ ioinit>: 0x0x0 0x000010f0 <_ zgvz5getippce4lock>: 0x0x0 ... 0x22c2f00: 0x10200d0 <_ ZTVN7bigpipe14BigpipeDIEngineE + 16> 0x4600000001 0x22c2f10: 0x1 0x117087b 0x22c2f20: 0x0 0x1214495 ... 0x22c2f70: 0x0x0 0x22c2f80: 0x0 0x0 0x22c2f90: 0x0x0 ... Gdb.txtDescription and analysis: the first column is the current memory address, as shown in figure0x22c2f00The second, third, and fourth columns are the stored values (expressed in hexadecimal notation) corresponding to the current memory address, and the debug information of gdb, for example:0x10200d0 <_ ZTVN7bigpipe15BigpipeDIEngineE + 16> 0x4600000001, Indicating "first 16 bytes", "Symbol Information (note that there is a + 16 offset)", and "Last 16 bytes", respectively ", however, not all the addresses print the debug symbol information of gdb. Sometimes the symbol information is displayed in the third column, and sometimes in the second column. Memory Address of the above line0x22c2f00StoredBigpipe: BigpipeDiEngineFunction pointer of the virtual destructor of one of the objects generated by the class, that isVirtual function table pointer (vptr), where the address0x10200d0Nearby memory storage should be of the BigpipeDiEngine classThe virtual function table (vtbl) is as follows: |
SHAPE \ * MERGEFORMAT
(Gdb) x/a 0x10200d0 0x10200d0 <_ ZTVN7bigpipe15BigpipeDIEngineE + 16>: 0x53e2c6 (Gdb) x/I 0x53e2c6 0x53e2c6: push % rbp (Gdb) x/a 0x53e2c6 0x53e2c6: 0xec834853e5894855 Address0x10200d0The value in is the address of the Destructor pointing to the BigpipeDiEngine class, that is, the real address of the Destructor code segment header.0x53e2c6. You can see from the preceding execution results that the address0x53e2c6Is the Destructor name.The Assembly command is push. Therefore, you can see0x22c2f00An address is a virtual destructor pointer of an object and has "Symbol Information"BigpipeDIEngineAccording to this information, you can determine the number of instances generated by this class (class with virtual destructor) and further judge the number of instances. In this case, sort gdb.txt and perform proper processing to obtain the list of times the symbol (Class Name/function name) appears. For example, to filter out the "Symbol Information" section with Angle brackets and sort the information by the number of occurrences, you can use a command similar to the following,Catgdb.txt | grep "<" | awk-F '<''{print $2}' | awk-F'>'' {print $1} '| sort | uniq-c | sort-rn> result.txt, Filter outProject-related variable prefixes (such as bmq, Bigpipe, and bmeta)Cat result.txt | grep-P "bmq | Bigpipe | bigpipe | bmeta" | grep "_ ZTV"> result2.txtTo obtain a list similar to the following: |
SHAPE \ * MERGEFORMAT
35782 _ ZTVN7bigpipe14CConnectE + 16 282 _ ZTVN3bsl3var4IVarE + 16 179 _ ZTVN7bigpipe19bmeta_stripe_info_tE + 16 26 _ ZTV13AutoKylinLockI5MutexE + 16 21 _ ZTVN6google8protobuf8internal26GeneratedMessageReflectionE + 16 8 _ ZTVN6comcfg17ConstraintLibrary12WrapFunctionE + 16 8 _ ZTVN3bsl3var11BasicStringINS_12basic_stringIcNS_14pool_allocatorIcEEEEEE + 16 6 _ ZTVN7bigpipe19bmeta_broker_info_tE + 16 6_ztvn7bigpipe15bigpipedienginee + 16 |
10. find the CConnect object that is related to the project and has the most frequent occurrences. After identifying the objects that may leak, locate the objects in the asynchronous framework, which of the following causes the CConnect object to be deleted and released.
11. After tracing the new "monitoring" Function Code related to CConnect, as follows.
SHAPE \ * MERGEFORMAT
If (atomic_add (& _ count,-1) = 0 ){ _ Free (_ conn) } |
4. Truth and truth
Check the implementation of the atomic_add function (as shown below). We can know that the returned value is the value before auto-increment (subtraction), but the function name atomic_add does not show this meaning in particular, as a result, the caller misuse this function and considers it as an auto-increment value. The reference count is incorrectly regarded as not 0, leading to the absence of the _ free operation, resulting in Memory leakage. In general, the function corresponding to _ sync_fetch_and_add also has _ sync_add _ and_fetch. The difference between the two is: "First get value plus" or "first add value in get ".
SHAPE \ * MERGEFORMAT
Atomic_add (volatile int * count, int add) { Register int _ res; _ Res = _ sync_fetch_and_add (count, add ); Return _ res; } |
5. Solutions
Therefore, the program is improved as follows:
SHAPE \ * MERGEFORMAT
If (atomic_add_and_fetch (& _ count,-1) = 0 ){ _ Free (_ conn) } |
Vi. Summary
1. Because the program implemented by the asynchronous framework is difficult to locate and track problems, it needs to be integrated with log, gdb, pmap, and other means to reproduce and locate problems;
2. Valgrind is not the only method for detecting memory leaks and has certain limitations;
3. The function name definition should be as intuitive as possible, so as to avoid some errors of the caller;
4. carefully read the description documents of database functions to learn how to use them;
Scenarios and limitations of this method: 1) When gdb is used to print memory information, it must conform to the one-to-one relationship between the number of instances and memory information symbols, in the above practice, the CConnect class has a virtual destructor. Therefore, the virtual function table pointer can be viewed in the memory information and has a one-to-one correspondence with the symbols that appear, this can be used as a speculative condition for Memory leakage. If there is no "trace" in the leaked memory information, the effective information of Memory leakage cannot be obtained. 2) after a memory leak occurs online, the process (on-site) with a memory leak still exists online. You can try to use the above method from the existing process (on-site) (3) This method can be used to analyze existing Memory leakage processes (on-site) and make full use of existing problematic processes; (4) the above method can be used as a supplement to other memory leak debugging methods. It is worth trying and can be used as a reference.
Baidu MTC is an industry-leading mobile application testing service platform, providing solutions for the costs, technologies, and efficiency problems faced by developers in mobile application testing. At the same time, we will share the industry's leading Baidu technology, written by Baidu employees and industry leaders.
> If you have any questions, please contact me.