1. Overview
A project's online Distributed File System server crashes multiple linux system kernel on a certain day of a certain month, seriously affecting the project's ability to provide external services, which has a great impact on the company. The reason for Linux Kernel panic is basically determined by troubleshooting online problems. The reason for Linux Kernel panic is basically determined through troubleshooting in two phases. The main method for troubleshooting is to search for information on the Internet, analyze the kernel error logs, and create conditions for reproduction. This document summarizes your troubleshooting process.
2. Phase I
When a problem occurs, everyone is urgent. They work late every day and have made many plans to reproduce and locate the cause. In the first stage, I continued to analyze the cause of the problem for two weeks. Since the functions I made in the first stage basically all formed detailed analysis documents, so the following mainly summarizes what measures have been taken in the first stage and what results have been achieved.
In the first stage, I also took several steps. Of course, the first thought was to reproduce the online phenomenon. So I first checked the online kernel error log, according to the log, the kernel panic is mainly caused by the qmgr and master processes (at least the log information is prompted ). Of course, combined with the server phenomenon at that time, load is relatively high, and external services cannot be provided. So the first thing I think of is to simulate continuous sending of mail by writing a program (because both the qmgr and master processes are related to the sending process). When the program runs, I'm a little excited, the load is on, and the external service capability of the server is slowed down (ssh Login). At that time, the online environment was close to the online environment, but there was no panic in the kernel, even if the frequency was fast, in addition, there is no error message in the kernel. This reason is gradually ruled out later.
The distributed file system is installed on all servers with errors. We suspect that the kernel panic is caused by the distributed file system, however, by observing the service monitoring information, we found that the Distributed File System had no special information at that time, and the data flow was not very large. However, I still use several virtual machines to install the Distributed File System, write a java program, and constantly write files to the Distributed File System Cluster through the Distributed File System Client, at the same time, I started the mail sending program, trying to simulate the online environment as much as possible. After running for many times, there was no online phenomenon, so there is no better way to reproduce the online phenomenon.
As the reproduction fails, the cause can only be analyzed based on Kernel error information. The analysis process is very simple. First, find the error code and analyze the context-related code. The detailed analysis process was also reflected in last year's document.
Based on code analysis and similar bugs on the Internet, the problem is that the cpu scheduling time overflow is calculated. As a result, the watchdog process throws a panic error and the kernel is suspended. Based on the analysis and location, I modified the kernel code to construct the time overflow condition, that is, the kernel module was used to modify the Count value of the system call time. The modification was successful, unfortunately, the kernel is also dead. Therefore, directly modifying the kernel code to reproduce it also fails.
Later, I also consulted many technical staff outside the company who are familiar with the kernel. They gave their own analysis based on the information industry we provided, but they did not have a good method to reproduce the problem and find the exact cause of the error, in addition, different people have different conclusions.
Therefore, the first phase of continuous tracking of this problem for 2-3 weeks does not have a definite result.
3. Stage 2
The new year begins, and the first day is about to track the problem again. At the beginning, I made a simple plan. My plan was to analyze and locate the kernel problem at 5-8 o'clock every day. Of course, I also learned about the kernel by the way.
At the beginning, I thought about the problem from another angle. Last year, I analyzed the kernel log error information based on a single server. There is no good way to do this. So prepare to analyze the logs of all servers with errors at the same time (fortunately, the kernel logs of all servers with errors have been found for O & M and saved, otherwise I Don't Know How To Die ), find out what they have in common. The first thing they have in common is the warning message "Delta way too big!..." printed by the trace subsystem !....." But according to the relevant information, this will not cause the linux system to suspend. It is true that not all of our online servers cannot be accessed through ssh, but we still find a similar bug on the RedHat official website (url:
Https: // access.redhat.com/knowledge/solutions/70051) and provides a solution. The bug information and solutions are as follows:
Why kernel is throwing "Delta way too big" out
WARNING: at kernel trace ring_buffer, c: 1988 rb_reserve_next_event + 0x2ce/0 × 370 messages
0 Issue kernel is throwing "Delta way too big" out with kernel oops on server
Environment (Environment)
• Red Hat Enterprise Linux 6 service pack 1
Resolution (solution)
The warning "Delta way too big" warning might appear on a system with unstable shed clock right after the system is resumed and tracingwas enabled during the suspend.
Since it's not realy bug, and the unstable sched clock is working fast and reliable otherwise, we suggested to keep using the sched clock in any case and just to make note in the warning itself. or disables tracing by # echo 0>/sys/kernel/debug/tracing/tracing_on
Root Cause (Root Cause) this case was ftrace involved ftrace due to teh call to ftrace_raw_event_power_end (debugfs is mounted and ftrace loaded in this case ), they are to do with problems calculating a time stamp for a ring buffer entry.
Message comes from here and appears to be indicating problems with time stability.
1966 static int
1967 rb_add_time_stamp (struct ring_buffer_per_cpu * cpu_buffer,
1968 u64 * ts, u64 * delta)
1969 {
1970 struct ring_buffer_event * event;
1971 static int once;
1972 int ret;
1973
1974 if (unlikely (* delta> (1ULL <59 )&&! Once ++ )){
1975 int local_clock_stable = 1;
1976 # ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
1977 local_clock_stable = sched_clock_stable;
1978 # endif
1979 printk (KERN_WARNING "Delta way too big! % Llu"
1980 "ts = % llu write stamp = % llu \ n % s ",
1981 (unsigned long) * delta,
1982 (unsigned long) * ts,
1983 (unsigned long) cpu_buffer-> write_stamp,
1984 local_clock_stable? "":
1985 "If you just came from a suspend/resume, \ n"
1986 "please switch to the trace global clock: \ n"
1987 "echo global>/sys/kernel/debug/tracing/trace_clock \ n ");
1988 WARN_ON (1 );
This called from rb_reserve_next_event () here.
2122 /*
2123 * Only the first commit can update the timestamp.
2124 * Yes there is a race here. If an interrupt comes in
2125 * just after the conditional and it traces too, then it
2126 * will also check the deltas. More than one timestamp may
2127 * also be made. But only the entry that did the actual
2128 * commit will be something other than zero.
2129 */
2130 if (likely (cpu_buffer-> tail_page = cpu_buffer-> commit_page &&
2131 rb_page_write (cpu_buffer-> tail_page) =
2132 rb_commit_index (cpu_buffer ))){
2133 u64 diff;
2134
2135 diff = ts-cpu_buffer-> write_stamp;
2136
2137/* make sure this diff is calculated here */
2138 barrier ();
2139
2140/* Did the write stamp get updated already? */
2141if (unlikely (ts <cpu_buffer-> write_stamp ))
2142 goto get_event;
2143
2144 delta = diff;
2145 if (unlikely (test_time_stamp (delta ))){
2146
2147 commit = rb_add_time_stamp (cpu_buffer, & ts, & delta); <-- HERE
This has to do with time stamping for ring buffer entries.
The above information shows that the code and method I analyzed last year are exactly the same, but I am not sure about the reason. After all, I can't reproduce it, but I can only speculate.
Another error in the kernel will be analyzed later. "BUG: soft lockup-CPU # N stuck for 4278190091 s! [Qmgr/master: Process number] ", I have done a little processing on the error information above. CPU # The N following is a specific cpu number, this is different on each server. In addition, the process in the brackets is different from the process number, but qmgr and master. The following table shows statistics:
For more details, please continue to read the highlights on the next page:
Linux Kernel: click here
Linux Kernel: click here
Recommended reading:
How to install Linux 3.11 Kernel on Ubuntu
The Ubuntu 13.10 (Saucy Salamander) Kernel has been upgraded to Linux Kernel 3.10 RC5
Linux Kernel Parameter Setting file sysctl. conf