The system crashes in two scenarios: hardware problems and software problems.
I. Hardware problems
Consider the following points:
1. Do not overclock the CPU. If the overclock is already exceeded, it is first restored to the original frequency.
Although normal operation is normal, unexpected faults may occur during high load usage. In particular, in some applications of linux systems, the performance of hardware can reach its limit, but such hardware may be okay to run Windows.
2. confirm that the power supply is sufficient.
Make sure that the power supply meets the load in the high load status.
3. Use memtest86 to check the memory status
4. Recover the BIOS to the default state.
For servers, you can use the built-in monitoring tool for testing. This is also a good troubleshooting method.
Ii. software problems
If the hardware problem has been basically ruled out, we must consider getting the system information in the dead state from the software.
1. If we are lucky enough, the system will not necessarily die completely (at this time, the keyboard may be able to respond). Then we can use the Sysrq algorithm.
The premise is that you must first enable the sysrq function:
# Echo "1">/proc/sys/kernel/sysrq
# Setterm-blank
In this way, when the system is faulty, we can use:
Reference Alt + Sysrq-T to obtain the process system stack information
Alt + Sysrq-M get memory allocation information
Alt + Sysrq-W get the current register information
For more hotkeys, refer to/usr/src/linux/Documentaion/sysrq.txt on the system.
Among them, setterm-blank can disable regular black screen protection under characters to easily record screen information.
2. to display more kernel debugging information on the screen, you can modify the display mode of the console to 80x25 in/boot/grub/menu. at the end of the line corresponding to the kernel in lst, add vga = 0x305, for example:
Refer to kernel/boot/vmlinuz-2.4.21-9.30AXsmp ro root =/LABEL =/1 vga = 0x305
3. If the keyboard is unfortunately dead, we can only send the system information to another system via serial port. The method is as follows:
Modify the/boot/grub/menu. lst file and add the core parameter "console = ttyS0 console = tty1" at the end of the kernel line, for example:
Referencing kernel/boot/grub/vmlinuz-2.4.21-9.30AXsmp ro root =/LABEL =/1 console = ttyS0 console = tty1
Then, modify/etc/sysconfig/syslog and add the klogd option "-c 7", such:
Reference KLOGD_OPTIONS = "-x-c 7"
Restart the server and perform the test:
1) Use a serial port to connect to the client and server and run the following command on the client:
Cat/dev/ttyS0
Run on the server:
Echo hi>/dev/ttyS0
If the client has "hi" output, you can.
2) run on the server:
Echo w>/proc/sysrq-trigger
Check whether the kernel information is output on the client.
3) run on the server:
Modprobe loop
Check whether the kernel information is output on the client.
If all tests pass, run the following command on the client:
Cat/dev/ttyS0 | tee/tmp/result
When the crash occurs, we can see the required kernel information from the client (View/tmp/result ).
Iii. Appendix
Generally, Linux crashes due to the following reasons:
System hardware problems (scsicard, motherboard, RAID card, HbA card, Nic, hard disk, etc)
Peripheral hardware problems (networks, etc)
Software problems (system and application software)
Driver bug (find a new driver)
Core system bug (go to LKML to check, or change the core and try again)
System settings (restore to the default status, disable the firewall, etc)