Log of kernel crashes

Source: Internet
Author: User
Tags error code memory usage relative tainted

. Reasons for the generation of Linux Kernel panic

Panic is a panic in the English language, Linux Kernel panic as its name, Linux Kernel do not know how to go, it will as far as possible it can be obtained at this time to print out all the information.

There are two main types of kernel panic, which are explained in more detail in the following two categories of panic:

1.hard panic (i.e. AIEEE information output)
2.soft panic (i.e. oops information output)

2. Common Linux Kernel Panic error content:

(1) Kernel panic-not syncing fatal exception in interrupt
(2) kernel panic–not syncing:attempted to kill the idle task!
(3) Kernel Panic–not syncing:killing interrupt handler!
(4) Kernel panic–not syncing:attempted to kill Init! 3. What causes Linux Kernel Panic?

Only the driver modules loaded into the kernel space can directly cause kernel panic, and you will be able to use LSMOD to see which modules are loaded by the current system under normal conditions.
In addition, components built into the kernel (such as memory map) can also cause panic.

Because hard panic and soft panic are inherently different, we discuss them separately.

4. Hard Panic

Generally, the following situation is considered to have occurred kernel panic: The machine is completely locked, cannot use the number key (Num lock), CAPS LOCK key (Caps LOCK), the SCROLL LOCK key (Scroll lock) flashing. If you are under terminal, you should see the kernel dump information (including a "AIEEE" message or "Oops" message) similar to the Windows blue screen

4.1 reasons

For hard panic, the biggest possibility is the interrupt handling of the drive module (interrupt handler), typically because the driver module accesses a null pointer in the interrupt handler (null pointre). Once this occurs, the driver module cannot process the new interrupt request and eventually causes the system to crash.

I have encountered such an example: in multi-core systems, including AP application processor, MCU microcontroller and modem processor and other systems, MCU controller for the system's low-power control, MCU microcontroller for some reason timeout to the AP application processor to send a timeout interrupt, When the AP accepts the interrupt, the interrupt handler reads the MCU's status register and discovers that it is a timeout interrupt for the MCU, actively referencing a null pointer in the interrupt handler, forcing the AP processor to print the stack information and then restart the Linux system. This is a typical hard panic, which does not make an in-depth analysis of the MCU time-out reason, just to illustrate the mechanism of hard panic generation. 4. Soft Panic

4.1 Symptoms: No hard panic severe usually results in a segment error (segmentation fault) can see a Oops message,/var/log/messages can search for ' Oops ' The machine can be used slightly (but after collecting the information, the system should be restarted)

4.2 Reasons

Any module crashes caused by non-disruptive processing will result in soft panic. In this case, the driver itself crashes, but it does not cause the system to fail with a fatal failure because it does not lock the interrupt processing routine. The cause of hard panic is also useful for soft panic (e.g., accessing a null pointer at run time)

What is fatal exception?

A fatal exception (Fatal exception) represents an exception that requires the program that caused it to close. Typically, an exception (exception) can be any unexpected situation (it includes more than just a program error). The fatal exception is simply that the exception cannot be properly handled so that the program cannot continue running.

Software applications are associated with the operating system and other applications through several different layers of code. When an exception (exception) occurs at a code layer, each code layer sends the exception to the next layer in order to find all the exception-handling code, which can handle the exception. If there is no such exception-handling code at all layers, the fatal exception (Fatal exception) error message is displayed by the operating system. This information may also contain some secret information about where the fatal exception error occurred (such as the hexadecimal location in the program storage scope). This additional information is of little value to the user, but can help the Helpdesk or developer debug the program.

When a fatal exception (Fatal exception) occurs, the operating system has no other recourse method to close the application, and in some cases shuts down the operating system itself. When a special application is used, the problem should be reported to the software vendor if there is a recurring fatal exception error. "And at this time the keyboard does not have any reaction, must use the Reset key hard restart."

PANIC.C The source file has a method, when the panic hangs, specify the time-out period, you can restart the machine, which is said earlier panic timeout restart. If your machine is configured with magic keys in advance, you can use the Magic key to make the system do more for you before the time-out, of course, not to get the system back to normal, but to try to avoid losing or exporting some useful information to help with the later positioning.

6. A solution to a kernel panic

It is believed that the drivers developed by Linux kernel know that kernel panic is much more harmful to the system than the application panic, and can even be described by disaster. The panic of the application can cause the Linux system to kill the user process, but there is no way for kernel panic, because kernel is the manager of the whole system, the problem (of course, unrecoverable exception) will only wait for the reboot.

Kernel panic The biggest problem is difficult to locate, for a developer, some kernel panic that is like a nightmare, the above mainly shows how to crawl kernel panic methods and some panic instances, of course, Grab Panic printing information is the first step to solve panic is also a key step, the following according to one of the kernel panic I have encountered as an example to illustrate from the emergence of panic to solve the general method of panic.

6.1 Grabbing kernel panic information

Yes, as mentioned earlier, this is the first step is also a very critical step, if you want to solve a kernel panic of course must first know where it is generated, that is, the generation of panic kernel function call stack, the current kernel call stack recorded the generation kernel Panic when the function call chain, here I do not post the relevant print instances, such kernel panic online is also everywhere, and there are many articles to explain how to determine which source file which row caused the panic, so interested classmates can search some of these articles to see, This refers to the general steps and precautions for solving kernel panic.

For the method of fetching kernel log is introduced before, here do not repeat, but want to emphasize two points:

(1) No matter what kind of panic, the first to crawl enough kernel printing information, of course, if necessary, also need to collect the kernel panic when the application's printing information, for Android system is logcat information, On the Android embedded software platform, there is a better and more comprehensive log collection method, that is bugreport, it will produce a full range of information, right, is the full range of information, including the kernel, applications, memory, processes and processors and all related information, is a very good debugging tool, as to the working principle of bugreport interested students to find the next information.

Note: The use of bugreport should pay attention to two points: first, it can only be used in the case of normal operation of the system, second, because 1th, you need to kernel panic after the system is generated to restart the system the first time to use bugreport export all the information, Because all this information contains log information about the cause of the last system restart.

(2) since it is to crawl panic log information, it is necessary to reproduce the process of panic, and some panic the probability of random, that is to say you do not know when it may produce panic, so please cherish every time to reproduce the panic opportunity, At the very least, be prepared before you reproduce the panic you want to crawl is the information, this information can help you to further locate panic, otherwise, do not rush in the presence of panic, do not know what they want, It's a good idea to plan each time you replicate panic, and you want that information (which may be different every time you crawl the information).

Note: Often encountered in the work of such a phenomenon: the test department of the students finally found a problem, please develop students positioning, the development of students basically did not how to analyze the problem is not enough information to grasp the location, the results let the test students half a day or even a days to reproduce the problem, And so reproduce the problem development students have not figured out exactly what information they want to locate, some problems can only keep the environment for a few minutes or even a few 10 seconds, which is bound to waste the test of the work of the students results.

6.2 Analysis Kernel Panic

Gather enough panic information, the following is the time to analyze the panic, for a panic problem, you need to know three points:

(1) First to have a certain understanding of assembly language, positioning panic generated C code location

In fact, according to the kernel call stack of the current kernel thread to generate the panic call chain, in the first few lines of panic log has shown kernel panic code location, but this position is relative to the generation of panic function offset, you do not know what it is exactly what line, At this time you need to objdump disassembler to the resulting panic image file disassembly, and then according to panic information instructions to find the corresponding assembly code, control C code according to the assembly context to determine the C line of code, in fact, the production of kernel panic is generally a reference to illegal addresses, In particular, a reference to a null pointer, which is also easier to locate the Panic C code line.

(2) parsing the C-code line context leading to panic, determining the panic introduction point

The first step should be easier to find the C code line that led to panic, according to the code that generated panic to find the panic, this step can be used with PRINTK to locate (if the large probability panic easier to locate), this step relative to the first step to spend a little more time, If the application code analysis is almost over here, the panic introduction point can be modified to modify the code for regression testing, but it is much more complex for kernel.

As previously encountered panic, although it is not easy to reproduce, but basically in a fixed time around can be reproduced, I am using the script to load the unloaded WiFi module, each time is about 500 times about the production of panic, to know that the panic will be easy to solve the more, But it took me about 2 hours because of the 500 cycles, and the environment was often problematic, which led me to take a long time to locate the problem: each time the load and unload of the WiFi module resulted in devices kset node reference count minus one, when devices Kset reference count becomes 0 when the system is recycled, the Linux system may then appear n panic phenomenon, and then found that the WiFi module each time the download of the corresponding device node reference count increase or decrease the imbalance causes devices Kset is reduced one, and then found that the Linux Kernel core code issues.

(3) It is best not to doubt the core code of Linux, nor to try to modify

Because of this, let me not be sure is really the core code problem, Linux core code that is tens of thousands of Daniel after thoroughly tempered code, can you easily modify, after further analysis of this panic is because we use the WiFi card is a non-standard media card, go non-standard process, In this process, the WiFi device is initialized with a single reference to the WiFi device node, but is referenced as a standard card when uninstalling the module.

(4) Do not firmly believe that around the panic information can solve the panic problem

Or the above panic, in fact, the above mentioned panic problem should be a lot of panic, which is also found in the late reappearance panic, in the loading unloading 500 times will be panic, but not the same panic, if according to normal thinking: since it is panic, Should start from the panic information, the follow-up has been chasing down. If so, the problem will never be solved, because you can reproduce a panic there will always be other panic appear, this time you do not have a clue.

Through the analysis of these panic log, found that they have a common, before the production of panic have a section of warning printing, it is the analysis of this print to find out the source of the problem, for this section of the warning content and analysis of the process is not explained here, only to express the following views:

It turns out that in the face of panic problem positioning, if you want to be able to locate in a period of time, and there is no better idea, you should look back at panic before the kernel produced what is not normal log, which may be caused kernel panic precursor or push.

(5) As much as possible to grasp the behavior of Linux kernel, some difficult to chew panic bold speculation

The bold speculation here is based on rational reasoning about the behavior of Linux kernel, although some guesses are not entirely correct, but may be unexpected in the process of proving that it is correct. I've had two false guesses about the panic problem with the WiFi module loading and unloading:

The first time, because the long load unload will appear panic, and start to discover the panic is in the Kmem_cache_alloc function, so guess is memory leak caused by memory exhaustion, so in the subsequent replication process I wrote a script loop print memory usage, The discovery of memory occupancy has been stable in a normal range, proving that my first guess was wrong.

The second time, when I saw devices kernel object Kobject name in the case of garbled before panic (PRINTK printed the name), I have ventured to make a guess: Linux kernel on the memory phenomenon, causing devices Kset object is destroyed. After doing all the efforts to prove my idea, and found that almost all of the No. 493 time to load the WiFi module panic problem, which makes me very confused, if the step memory can be fixed in 493 times, although not fully prove that this conjecture is wrong, but it is enough to explain my direction has a problem.

So back and forth, it took me two weeks to finish this panic, so kernel panic although difficult to chew, but as long as you are willing to try to make efforts, even if the final not to take this panic, you will learn a lot of things, including the Linux kernel behavior, These will have a great impact on your future study, and it must be full of confidence to encounter such problems.

7. Summary

Always want to sum up a little kernel panic solution of the law, in the Internet also search a lot of information, basically all the same, this article also cited the article in the previous article, has done the summary of things can be recorded to others to see and give their own review is very meaningful things, before kernel Panic the problem always let me not too close, now I can be more confident to face them, here is just to some of the students to solve panic suggestions, personally feel that analysis of a specific instance of the meaning is not too big, so there is not a specific case to do a detailed analysis, Hope to find more relevant articles to read, the night has been deep, there are people ...

What is oops. From a linguistic point of view, oops should be a quasi-sound word. When a little accident, or do a more embarrassing thing, you can say "Oops", translated into Chinese words is called "Ah Yo." "Oh, sorry, I'm sorry, I didn't mean to break your cup." Look, that's what oops means.

What is oops in the development of Linux kernels? In fact, it does not have the essential difference with the above explanation, only the main character of the speech becomes Linux. When some of the more deadly problems arise, our Linux kernel will also be sorry to say to us: "Ouch (Oops), sorry, I screwed things up." The Linux kernel prints oops information when the kernel panic occurs, showing us the current register status, stack contents, and the full call trace, which helps us locate the error.

In this case, 0002 represents the Oops error code (write error, which occurs in kernel space), #1表示这个错误发生一次.

Oops's error code will be different depending on the cause of the error, the examples in this article can refer to the following definition (if you find that you encounter the oops and below can not correspond, it is best to go to the kernel code to find):

* Error_code:
* Bit 0 = = 0 means no page found, 1 means protection fault
* bit 1 = = 0 means read, 1 means write
* Bit 2 = = 0 means kernel, 1 means User-mode
* Bit 3 = = 0 means data, 1 means instruction

Sometimes, oops also prints out tainted information. This information is used to indicate what causes the kernel to be tainted (literally "defiled"). The specific definitions are as follows:

1: ' G ' if all modules loaded had a GPL or compatible license, ' P ' if any proprietary module had been loaded. Modules without a module_license or with a module_license that's not recognised by Insmod as GPL compatible be assumed t o be proprietary.
2: ' F ' If any module is force loaded by "insmod-f", "if all modules were loaded normally.
3: ' S ' If the oops occurred on a SMP kernel running on hardware this hasn ' t been certified as safe to run multiprocessor. Currently this occurs only on various athlons that is not SMP capable.
4: ' R ' If a module is force unloaded by "Rmmod-f", "If all modules were unloaded normally.
5: ' M ' If any processor had reported a machine check Exception, "If no machine check Exceptions had occurred.
6: ' B ' If a page-release function has found a bad page reference or some unexpected page flags.
7: ' U ' If a user or user application specifically requested that the tainted flag is set, ' otherwise.
8: ' D ' If the kernel has died recently, i.e. there is an OOPS or BUG.
9: ' A ' If the ACPI table has been overridden.
Ten: ' W ' If a warning have previously been issued by the kernel. (Though Some warnings may set more specific taint flags.)
One: ' C ' If a staging driver has been loaded.
: ' I ' If the kernel is working around a severe bug in the platform firmware (BIOS or similar).

Basically, this tainted message is left to the kernel developers to see. Users in the process of using Linux, if encountered oops, you can send oops content to the kernel developers to debug, kernel developers based on this tainted information can probably determine the kernel panic when the kernel running environment. If we just debug our own drive, this message doesn't make sense.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.