Process segment error search

Last Update:2018-12-04 Source: Internet

Author: User

Tags valgrind

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently, I took over a server project, and the server program was launched for nearly two months. During the crash, coredump was not generated during the crash. According to the deamon response of the server process supervision, the server process crashes when SIGSEGV is received. The log output can only find the crash location in the company's basic code library used in the code, and this library has been proved to be correct.

Because the crash was not serious, the person in charge of the investigation thought that the server program had received an exception message, and the problem failed with the departure of the colleague.

After I took over the service, I developed the service in the future due to the need to add features. After the development was completed, there was no problem in the Intranet test. Once it went online, it would inevitably crash after 1-2 hours of operation, the crash is still not generated by core files. From the log output, the previous crash problem has surfaced again.

Due to this serious crash, subsequent work could not be carried out, so the above deadline, you must find out the cause of the crash within two weeks, and then enter the development of subsequent functions, the tragedy is that the colleague who developed this code has already gone, and the most likely to crash is that this colleague should be clear, so I have to read the fucking code.

It is strange that if my new function runs independently, the process will not crash at all. However, the old functions run stably independently. Once both functions are enabled, they will crash in two hours. At first, I suspected that my new code caused the memory write to be out of bounds. So I noted out all the Code involved in the memory write operation for the newly added code, but the problem still occurs. So I thought about whether the newly added code crossed the memory write when interacting with the old function. So I analyzed all the code that might affect the modified Code, there were no problems with the results, and several assumptions were made during the process. All the boundary conditions were strictly judged, and the results did not solve the problem.

Then I concentrated on determining whether the thread's stack space was insufficient, the number of files opened at the same time by the process, and the non-thread-Safe System Call (strerror) used in the process ), and a third-party logstore used by the thread. In the end, all these questions are ruled out.

After a week, there was no progress.

In the first despair, I thought of coredump. Because the process crashes and does not generate core files, I thought of whether there was a problem with the process signal processing. So I analyzed the signal processing problems in the code. It can be said that the signal processing write is messy. In a multi-threaded program, each thread sets its own barrier. Some threads also use sigwait, and different threads modify the signal processing function. Even worse, this multi-threaded process actually executes system ("curl") in one place and fork a sub-process shell in a thread, in this shell, the Fork sub-process executes a curl command. This item sends a sigchld signal when exiting. This is the shortest part of the entire code. After I overwrite the signal processing part of the entire process, the sigchld and sigpipe are blocked, I once thought that these two signals would interrupt the non-reentrant new or malloc operations in the process, resulting in subsequent memory write operations errors, however, the results show that these two signals are not a problem.

In this case, after the signal processing is corrected, the process can generate coredump normally, after the service process is launched, a large number of coredump files can be generated to analyze the internal status of codes in the crash era. During this period, the print and X variables and the memory value of the object were repeatedly paushed in GDB. At this time, I found that the program crash point was indeed in the company's basic class library, only my code calls the interface of this library, but the memory space pointed to by a pointer in the library is often invalid. At this time, I had a question about my new code. I think the crash was caused by some of my code. As a result, I wrote a cross-border error to write the value of the pointer saved in the stack, you can check the code carefully and find no problems. Because the Pointer Points to different values each time it crashes, sometimes the pointer points to something better like what the object is referring to after being destructed, as a result, I began to suspect that a system call in the library had a bug and I did not get any useful information on the Internet. For this reason, I have rewritten the class destructor and asked him to write some specific values to verify that the object referred to by the pointer is automatically destructed, but it is used again in another place, however, this assumption is not true.

Because my sight has been stuck near the collapse point, I have been looking for some impossible reasons for verification, but during the verification process, I gradually found that, in most cases, the memory objects pointed to by this pointer are basically complete, and only some of them are rewritten. The pointer is on the stack, and the pointer refers to the space on the stack. The problem is that the pointer is not rewritten, but the heap space pointed to by the pointer is rewritten. At this point, the situation becomes clearer. I am looking for a possible error in writing memory in the program.

The third-party tools were used for searching, so valgrind was used in detail and the program was running. However, because the server program was allocated a large amount of memory during runtime, it was not released when it exited, valgrind reports many memory leakage errors. According to my analysis, the process does not leak the memory during the running process, but it is not released when it exits, so the operating system can recycle it on its own. In addition, valgrind also reported some other tips for using uninitialized memory, but they were all excluded from the issues that caused the crash. The legendary valgrind does not help me find out if the write memory is out of bounds. During this period, I also wanted to use GDB to run the program. However, if the program runs to provide online processing for tens of thousands of users, I had to give up.

At this time, I gradually understood the entire program framework, and the analysis of coredump gave me a clear understanding of the running status of the program. At this time, I have observed a common phenomenon. This server program is compatible with an earlier version of the protocol module, which is also compatible with an older logon protocol that causes code execution of the ugly system () currently, there are very few users using this logon protocol. Only a few hundred of the millions of users logging on every day use this logon method. However, every time a crash occurs, the thread responsible for processing the old protocol must execute the system () call. The top of the function call stack must be the sys_wait kernel call interface. However, at this moment, I am still very reassured by the system () function, because man said it can handle the error state of its internal command execution. I trust that dark code, so I look for all the memory write operations in the code.

I analyzed all snprintf () in the Code and all the places where the memory is written. At this time, I am tortured by string objects everywhere in the Code. Due to the unique copy capability of string during write, I am not allowed to carefully analyze the use and reference of all string objects. Originally, as a server program, when processing protocol packets, it takes efficiency first. However, all protocol messages processed in the Code are implemented using string objects. Various resize, Data, c_str calls, the pointer inside the string is passed everywhere. The worst thing is that after the const pointer is passed, the memory content it refers to will also be modified. Although the designers of the string class repeatedly stressed not to modify the memory content returned by data () and c_str (), this is obviously not the case in the code. After analyzing all the memory read/write operations, I did not find the write out-of-bounds. At this time, I lost my direction and did not know where to find the reason for the next step.

At this stage, I overwrite the Memory Pool Module of the program to simplify the allocation and write operations of the memory pool. Of course, the main purpose is to exclude the overwrite of the Memory Pool Module. When there is no reason to check again, I have to turn to the system call. I try to remove the call from that system, but use the third-party library thread to directly implement its function, the miracle of re-compilation and launch happened, and the program stability did not crash again.

This problem is finally solved. If I have to explain the cause of the crash, I would like to say: (1) do not execute fork-related operations in multi-threaded programs (2) do not upload the internal pointer of a string to another process for use. (3) block most signals and do not ignore them. Otherwise, it will interrupt some of your system calls and cause errors.

In the process of solving this problem, I also learned a lot. I didn't hate the string class so much in the past. Of course, I also better understand the behavior of the string class, and I am more comfortable with signal processing.

What you need to reflect on is that when you face a difficult problem, you need to face the most difficult part of the problem, because what you dare not face is often the part that you finally touch, it is also the most likely cause of problems. If you cannot face it immediately, before that, you can only wander around other non-important parts, but you have no way to go. When you find the problem in the most difficult part, you will find that the previous wandering is a waste of time and energy. Therefore, to solve the problem, you should check the most difficult and most likely cause of the problem, rather than avoiding it to find the secondary cause that is not possible.

Here are some possible causes of segment errors and how to avoid the corresponding errors.

When a segment error occurs, some of them are easy to investigate, but some of them are difficult to investigate. For example, if you write a memory error in one place, it takes some time for another place to read the memory. This is very difficult to locate. Therefore, be careful when writing code.

1. Use invalid pointers, including uninitialized and released pointers (the pointer is set to null before and after being used)

2 memory read/write out of bounds. Including array access out of bounds, or when some write memory functions are used, the length is specified incorrectly or these functions cannot specify the length themselves. Typical functions include strcpy (strncpy) and sprintf (snprint) and so on.

3. For C ++ objects, use the corresponding class interface to perform memory operations. Do not use the pointer returned by the C ++ object to perform write operations on the memory, such as string data () and c_str.

4. Do not return the reference or address of a local object. When the function returns, the function stack pops up, and the address of the local object fails. modifying or reading these addresses will cause unknown consequences.

5. Avoid defining too many arrays in the stack. Otherwise, the stack space of the process may be insufficient and a segment error may occur.

6. Operating System restrictions, such as the maximum memory that a process can allocate and the maximum number of file descriptors that a process can open. These restrictions must be lifted through ulimit, setrlimit, or sysctl.

7. multi-threaded programs must be mutually exclusive when multiple threads operate on one piece of memory simultaneously. Otherwise, the memory in the memory will be unpredictable.

8. Use non-thread-safe function calls

9 In a signal environment, you can use non-reentrant function calls. These functions read or write data in a certain memory area. When the signal is interrupted, the memory write operation will be interrupted, the next entry will inevitably cause errors.

10 cross-process transmission of an address

11 some system calls with special requirements, such as epool_wait, normally close a socket, The epool will no longer return events on this socket, however, if you use DUP or dup2, The epool cannot be removed.

There are many others, which will be supplemented later.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More