Original:
http://baidutech.blog.51cto.com/4114344/904419
Core, also known as the core dump file, is a mechanism of the unix/linux operating system, for online services, the core of the color change, because the core process means that the service temporarily does not respond properly, need to recover, and as the core process of the memory space, the larger This process can last for a long time (for example, when the process consumes more than 60g+ memory, the full core file takes 15 minutes to fully write to the disk), and the resulting traffic loss is immeasurable.
Everything has two sides, the OS in the core of the same time, although it will terminate the current process, but will also retain the first-hand field data, the OS as if a shutter is pressed camera, and the photo is the output of the core file. It contains information such as memory, CPU registers and so on when the process is terminated, which can be debugged by subsequent developers.
There are many reasons for the core to occur, such as the previous UNIX version does not support the modern Linux this gdb directly attached to the process of debugging mechanism, need to first send a termination signal to the process, and then use the tool to read the core file. On Linux, we can use kill to send a signal to a specified process or use the Gcore command to make it active out of the core and exit. In the shallow sense, the core means that there is a bug in the current process and requires a programmer to fix it. For deep-seated reasons, the current process violates some OS-level protection mechanisms, forcing the OS to send signals such as SIGSEGV (i.e., signal 11) to the current process, such as accessing a null pointer or array out of the core, which in fact violates the OS's memory management. Access to the memory space of the non-current process, the OS needs to go through the core to warn, it is like a person in a virus, the immune system will be a fever warning, and cause the human body fever is a reason (interestingly, not every time the array crosses out of the core, This is related to the size and boundary of the virtual page allocation in the memory management of the OS, even if it is not core, it is very likely to read dirty data, causing the subsequent program behavior disorder, which is a difficult to track down the bug.
To say this, it seems that the core is very strong, people feel a lack of control, it is not. There are two ways to control the behavior and the way the core generates:
1. Modify the/proc/sys/kernel/core_pattern file, this file is used to control the file name generated by the core file, by default, this file content only one line of content: "Core", this file support customization, generally use% with different characters, here are a list of several:
%p The PID of the core process
%u out of core process UID
%s causes the signal number of the core
%t time from core, number of seconds starting from 1970-01-0100:00:00
%e the executable file name corresponding to the core process
The 2.ulimit–c command, which displays the current OS limit on the core file size, is 0, which means that the core file is not allowed to be generated. If you want to make changes, you can use:
Ulimit–cn
where n is a number that represents the maximum allowable core file volume, in kilobytes, and if you want to set it to infinity, you can do it:
ulimit-cunlimited
After the core file is generated, it is how to view the core file and determine where the problem is and fix it. To do this, let's take a look at the format of the core file and learn more about core files.
First you can make it clear that the format of the core file is in elf format, which can be confirmed by using the Readelf-h command, such as:
From the read Elf header information can be seen, this file type is the core file, then how Readelf know? You can get a glimpse of one or two from the data structure below:
Where the value is 4, indicates that the current file is a core file. So, the whole process is clear.
With that in view, let's look at how to read the core file and trace the bug from there. Under Linux, the General command to read the core is:
GDB Exec_file Core_file
Using GDB, you first read the symbol table information from the executable file and then read the core file. Is it OK if I don't mix it with the executable? The answer is no, because there is no symbol table information in the core file and cannot be debugged, you can use the following command to verify:
Objdump–x Core_file | Tail
We see the following two lines of information:
SYMBOL TABLE:
No symbols
Indicates that there are no symbol table information in the current elf format file.
To explain how to look at the information in the core, let's give a simple example:
#include "stdio.h"
int main () {
int stack_of[100000000];
int b=1;
Int* A;
*a=b;
}
This program uses Gcc–g a.c–o A to compile, run directly after the core, using GDB a core_file view stack information, it can be seen that the core in this line of code:
int stack_of[100000000];
The reason is obvious, directly on the stack to request such a large array, resulting in stack space overflow, violated the OS on the size of the stack space limit, so the core (whether the core is also the OS on the size of the stack space configuration, generally 8M). But here to be clear, the real core code is not the allocation stack space int stack_of[100000000], but the latter sentence int b=1, why? One reason for the core is because of the illegal access to the memory, which is not accessed when the array stack_of is allocated in the code above, but after the variable is declared and assigned, it is equivalent to cross-border access and then out of the core. To explain in more detail, let's use GDB to see where the core is, using the command gdb a core_file visible:
The program has a paragraph error "Segmentation fault", the code is an int b=1 this sentence. Let's take a look at the current stack information:
It is visible that the instruction pointer rip points to address 0x400473, and we look at what the current instruction is:
This MOVL command to send the immediate number 1 to 0XFFFFFFFFE8287BFC (%RBP) This address, where RBP is stored a frame pointer, and 0XFFFFFFFFE8287BFC is obviously a negative, the result is calculated as 400000004. This explains: where we apply for the int stack_of[100000000] occupies 400000000 bytes, b is an int type, occupies 4 bytes, and the stack space is extended by a high address to a low address, then the stack address of B is 0XFFFFFFFFE8287BFC ( %RBP), also known as $rbp-400000004. When we try to access this address:
You can see that the memory address cannot be accessed because it has exceeded the range allowed by the OS.
Here's how we'll improve the program:
#include "stdio.h"
int main () {
int* stack_of = malloc (sizeof (int) *100000000);
int b=1;
Int* A;
*a=b;
}
Using Gcc–o3–g a.c–o A to compile, after the run will again core, using GDB to view the stack information, see:
Visible bug in line 7th, that is *a=b this sentence, then we try to print the value of B, but found that the symbol table does not find the information of B. Why? The reason is that GCC uses the-O3 parameter, which optimizes the program, and a negative effect is that some local variables are discarded during the optimization process, causing difficulties in debugging. In our code, B is assigned when declared, and then used to assign a value to *a. After optimization, this variable is no longer required, directly to the *a assignment is 1, if the assembly level code, this optimization can reduce a MOV statement, save a register.
At this point our debugging information has been some distortion, so we recompile the source program, remove the-O3 parameter (this explains why some large software will have debug version exists, because debug is not optimized version, contains the complete symbol table information, easy to debug), and rerun, Get a new core and view it, such as:
This time it is more obvious that the value in B is not a problem, the problem is a, the address is an illegal region, that is, a does not allocate memory caused by the core. Of course, the problem in this case is very obvious and can be seen almost at a glance, but it does not prevent it from becoming an example to explain some of the issues that need to be noted in the process of looking at the core.
Formation and analysis of core dump files on Linux