Linux Kernel panic is a major fault that is difficult to locate and troubleshoot. Once the system has a kernel panic, there are very few logs, A common troubleshooting method-reproduction method-is difficult to implement. Therefore, the problem of kernel panic is generally a headache.
There is no 10 thousand perfect solution to solve all the kernel panic problems. This article only gives some ideas about how to solve the problem of kernel panic, second, we can minimize the chance of kernel panic occurrence.
What is kernel panic?
As the name implies, it indicates that the Linux kernel is in a situation where you do not know how to proceed, the kernel tries its best to print all the information it can obtain at this time. As for how much information can be printed, it seems that this situation has caused panic.
There are two main types of kernel panic:
1. Hard panic (Aieee information output)
2. Soft panic (Oops information output)
What causes Kernel panic
Only the driver modules loaded into the kernel space can directly cause the kernel panic. You can use lsmod to check which modules are loaded in the current system normally.
In addition, components built in the kernel (such as memory map) can also cause panic.
Since hard panic and soft panic are essentially different, we will discuss them separately.
How to troubleshoot hard panic
Generally, the following problem occurs:
- The machine is completely locked and cannot be used.
- The number key (num lock), capital lock (caps lock), and scroll lock keep flashing.
- If you are on a terminal, you can see the kernel dump information (including "Aieee" information or "oops" information)
- Similar to Windows blue screen
Cause:
For hard panic, the greatest possibility is the Interrupt Processing (interrupt handler) of the driver module, generally because the driver module accesses a null pointre In the interrupt processing program ). Once this happens, the driver module cannot process new interrupt requests, resulting in system crash.
Information Collection
According to the panic status, the kernel records all information before the system is locked. Because kenrel panic is a very serious error and cannot determine how much information the system can record. Below are some key information to be collected. They are very important, so they should be collected as full as possible, of course, if the system starts the kernel panic, it will not be able to only know how much useful information can be collected.
- /Var/log/messages: Fortunately, the tracing information of the entire Kernel panic stack can be recorded here.
- Application/database logs: You may be able to see what happened before panic.
- Other information prior to the occurrence of the panic, or how to reproduce the status of the panic at that moment
- Terminal screen dump information. Generally, after the OS is locked, you can copy and paste the information. Therefore, you can use a digital camera or original paper and pencil tools.
If the kernel dump information is neither in/var/log/message nor on the screen, try the following method to obtain it (of course, if there is no crash ):
- If you switch to the terminal interface on the GUI, the dump information will not appear on the GUI, or even on the virtual terminal in the graphic mode.
- To ensure that the screen is not black, you can use the following methods:
- Setterm-blank 0
- Setterm-powerdown 0
- Setvesablank off
- Copy the screen information from the terminal (see the method above)
Troubleshooting of complete stack tracing information
Stack trace is the most important information for checking the kernel panic. It is best to view all the information in the/var/log/messages log, if it is only on the screen, the top information may disappear because of scrolling, leaving only part of stack tracking information. If you have a full stack trace, you may find the root cause of the panic based on the full stack trace information. To check whether there is enough stack trace information, you only need to find a row containing "EIP", which shows the functions and modules that cause panic calls. Like in the following example:
EIP is at _ dlgn_setevmask [streams-dlgndriver] 0xe
An example of complete tracing information for hard panic:
Unable to handle kernel Null Pointer Dereference at virtual address 0000000c
Printing EIP:
F89e568a
* PVDF = 32859001
* PTE = 1, 00000000
Oops: 0000
Kernel 2.4.9-31 Enterprise
CPU: 1
EIP: 0010: [<f89e568a>] tainted: pf
Eflags: 00010096
EIP is at _ dlgn_setevmask [streams-dlgndriver] 0xe
Eax: 00000000 EBX: f65f5410 ECx: f5e16710 edX: f65f5410
ESI: 00001ea0 EDI: f5e23c30 EBP: f65f5410 ESP: f1cf7e78
DS: 0018 ES: 0018 SS: 0018
Process pwcallmgr (PID: 10334, stackpage = f1cf7000)
STACK: 00000000 c01067fa 00000086 f1cf7ec0 running 1ea0 f5e23c30 f65f5410 f89e53ec
F89fcd60 f5e16710 f65f5410 f65f5410 f8a54420 f1cf7ec0 f8a4d73a 108139e
F5e16710 f89fcd60 00000086 f5e16710 f5e16754 f65f5410 0000034a f894e648
Call trace: [setup_sigcontext + 218/288] setup_sigcontext [kernel] 0xda
Call trace: [<c01067fa>] setup_sigcontext [kernel] 0xda
[<F89e53ec>] dlgnwput [streams-dlgndriver] 0xe8
[<F89fcd60>] sm_handle [streams-dlgndriver] 0 × 1ea0
[<F8a54420>] intdrv_lock [streams-dlgndriver] 0 × 0
[<F8a4d73a>] gn_maxpm [streams-dlgndriver] 0 × 8ba
[<F89fcd60>] sm_handle [streams-dlgndriver] 0 × 1ea0
[<F894e648>] lis_safe_putnext [streams] 0 ×168
[<F8a7b098>] _ insmod_streams-dvbmDriver_S.bss_L117376 [streams-dvbmdriver] 0xab8
[<F8a78821>] dvbmwput [streams-dvbmdriver] 0 × 6f5
[<F8a79f98>] dvwinit [streams-dvbmdriver] 0 × 2c0
[<F894e648>] lis_safe_putnext [streams] 0 ×168
[<F893e6d8>] lis_strputpmsg [streams] 0 × 54c
[<F895482e>] _ insmod_streams_s.rodata_l35552 [streams] 0 × 182e
[<F8951227>] sys_putpmsg [streams] 0 × 6f
[System_call + 51/56] system_call [kernel] 0 × 33
[<C01_19b>] system_call [kernel] 0 × 33
Nov 28 12:17:58 Talus Kernel:
Nov 28 12:17:58 Talus Kernel:
CODE: 8B 70 0C 8B 06 83 F8 20 8B 54 24 20 8B 6C 24 24 76 1C 89 5C
How to troubleshoot invalid full stack information
If there is only some tracking information, it is difficult to quickly locate the root cause of the problem, because there is no obvious information to tell us which module or function calls lead to kernel panic, you may only see the last commands of the kernel. In this case, collect as much information as possible, including program logs, database tracking information, and fault reproduction steps.
Example of hard panic tracking information (without EIP information ):
[<C01e42e7>] ip_rcv [kernel] 0 × 357
[<F8a179d5>] sramintr [streams_dlgndriver] 0 × 32D
[<F89a3999>] lis_spin_lock_irqsave_fcn [streams] 0 × 7d
[<F8a82fdc>] inthw_lock [streams_dlgndriver] 0 × 1c
[<F8a7bad8>] pwswtbl [streams_dlgndriver] 0 × 0
[<F8a15442>] dlgnintr [streams_dlgndriver] 0 × 4B
[<F8a7c30a>] gn_maxpm [streams_dlgndriver] 0 × 7ae
[<C0123bc1>] _ run_timers [kernel] 0xd1
[<C0108a6e>] handle_irq_event [kernel] 0 × 5E
[<C0108c74>] do_irq [kernel] 0xa4
[<C0105410>] default_idle [kernel] 0 × 0
[<C0105410>] default_idle [kernel] 0 × 0
[<C022fab0>] call_do_irq [kernel] 0 × 5
[<C0105410>] default_idle [kernel] 0 × 0
[<C0105410>] default_idle [kernel] 0 × 0
[<C010543d>] default_idle [kernel] 0 × 2D
[<C01054c2>] cpu_idle [kernel] 0 × 2D
[<C011bb86>] _ call_console_drivers [kernel] 0 × 4B
[<C011bcfb>] call_console_drivers [kernel] 0xeb
CODE: 8B 50 0C 85 D2 74 31 F6 42 0a 02 74 04 89 44 24 08 31 F6 0f
<0> kernel panic: aiee, killing interrupt handler!
In interrupt handler-not syncing
Use the kernel debugging tool (kenrel debugger, aka KDB)
If there is only a portion of the tracking information that is insufficient to identify the root cause of the problem, the kernel debugger (KDB) needs to be obtained.
KDB is compiled into the kernel. When a Panic occurs, it directs the kernel to a shell environment instead of locking it. In this way, we can collect some panic-related information, which helps us locate the root cause of the problem.
When using KDB, you must note that the kernel must be a basic core version, such as 2.4.18 rather than 2.4.18-5, because KDB is only valid for the basic core.
How to troubleshoot soft panic
Symptoms:
- No hard panic severe
- This usually leads to segment errors (segmentation fault)
- You can see an oops information. You can find 'oops' in/var/log/messages'
- The machine can be used slightly (but the system should be restarted after information is collected)
Cause:
Any module crash caused by non-Interrupt Processing will lead to soft panic. In this case, the driver will crash, but it will not cause a fatal failure to the system because it does not lock the interrupt processing routine. The cause of hard panic is also useful for soft panic (for example, accessing a null pointer at runtime)
Information collection:
When soft Panic occurs, the kernel will generate a dump data containing the kernel symbol (kernel symbols) information, which will be recorded in/var/log/messages. To start troubleshooting, you can use the ksymoops tool to convert the kernel symbol information into meaningful data.
To generate a ksymoops file, you must:
- The stack trace text information found in/var/log/messages is saved as a new file. Make sure that the timestamp is deleted. Otherwise, ksymoops will fail.
- Run the ksymoops Program (if not, install it)
- For detailed usage of ksymoops, refer to the ksymoops (8) manual.
Below is a soft panic oopsg tracking example:
CODE: 8B 70 0C 50 E8 69 F9 F8 FF 83 C4 10 83 F8 08 74 35 66 C7 47
EIP; f89ba71e <[streams-dlgndriver] _ dlgn_setidlestate + 1E/8c>
Trace; f8951bd6 <[streams] lis_wakeup_close + 86/110>
Trace; f8a2705c <[streams-dlgndriver] _ module_parm_r4_feature + 280/1453>
Trace; f8a27040 <[streams-dlgndriver] _ modulle_parm_r4_feature + 264/1453>
Trace; f89b9198 <[streams-dlgndriver] dlgnwput + E8/204>