Writing server-side programs, it is easy to encounter crash problems, fortunately, Linux provides a core file, retaining the crash site. Sometimes, according to the current call stack, and print out the current stack of variables to analyze the cause of crash, but sometimes see the call stack is helpless. The following describes yourself through the combination of several commands of GDB and discovers a crash cause of the process.
Let's go into the scene and gradually discover the reasons.
First, or run the GDB command,gdb wbxgs core.5797, to see the scene.
[Email protected] bin]# gdb wbxgs_crash core.5797
GNU gdb Red Hat Linux (6.3.0.0-1.132.EL4RH)
......
#0 0x00000038e8d70540 in strlen () from/lib64/tls/libc.so.6
(GDB) BT
#0 0x00000038e8d70540 in strlen () from/lib64/tls/libc.so.6
#1 0x000000000057cfc0 in T120_trace::text_formator::advance (this=0x7e800a70, lpsz=0x1 <Address 0x1 out of bounds> )
At./t120trace.cpp:1464
#2 0x000000000057ceb1 in t120_trace::text_formator::operator<< (this=0x7e800a70, lpsz=0x1 <Address 0x1 out of bounds>)
At./t120trace.cpp:1411
#3 0x0000000000407927 in ~func_tracer (this=0x7e804bd0) at.. /h/t120trace.h:381
#4 0x00000000004432fd in Cgssocketserver::readheader (this=0x8e4130, socketfd=1088,
Buf=0x7e806cc0 "Get/detectservice?cmd=selfcheck http/1.1/r/nconnection:close/r/nhost:10.224.122.94/r/n/r/n", bufsize=1024)
At mgr/gssocketserver.cpp:337
#5 0x0000000000443981 in Cgssocketserver::handle (this=0x8e4130, socketfd=1088, [e-mail protected]) at mgr/ gssocketserver.cpp:424
#6 0x0000000000442f5e in Cgssocketserver::readthread (PARG=0X9AE9C0) at mgr/gssocketserver.cpp:304
#7 0x00000038e980610a in Start_thread () from/lib64/tls/libpthread.so.0
#8 0x00000038e8dc68b3 in Clone () from/lib64/tls/libc.so.6
#9 0x0000000000000000 in?? ()
Through this call stack, it can be seen that the program crash when playing log. Although encountered similar crash, however, the reason is that there is a dead loop, through review code, did not find a dead loop. But the current call stack for the analysis of the cause of crash is no use, if the analysis of specific reasons? Would it be the other thread that got the error causing the program to crash on this thread? To find out the reason for the deep layer, try to see if there are any problems with the other threads by using some of GDB's thread-related commands. Then, using info threads, we looked at the situation of the thread at that time.
(GDB) Info Threads
Process 5797 0x00000038e8d7186d in memset () from/lib64/tls/libc.so.6
Process 5839 0x00000038e8dc6c8c in epoll_wait () from/lib64/tls/libc.so.6
Process 5842 0x00000038e8d8f7d5 in __nanosleep_nocancel () from/lib64/tls/libc.so.6
Process 5845 0x00000038e8d8f7d5 in __nanosleep_nocancel () from/lib64/tls/libc.so.6
+ Process 5846 0x00000038e980a66f in sem_wait () from/lib64/tls/libpthread.so.0
Process 5847 0x00000038e980a66f in sem_wait () from/lib64/tls/libpthread.so.0
Process 5848 0x00000038e980a66f in sem_wait () from/lib64/tls/libpthread.so.0
+ Process 5849 0x00000038e980a66f in sem_wait () from/lib64/tls/libpthread.so.0
Process 5850 0x00000038e980a66f in sem_wait () from/lib64/tls/libpthread.so.0
Process 5852 0x00000038e8dbf946 in __select_nocancel () from/lib64/tls/libc.so.6
One process 5854 0x00000038e980a66f in sem_wait () from/lib64/tls/libpthread.so.0
Ten process 5856 0x00000038e980a66f in sem_wait () from/lib64/tls/libpthread.so.0
9 process 5857 0x00000038e980a66f in sem_wait () from/lib64/tls/libpthread.so.0
8 process 5858 0x00000038e980a66f in sem_wait () from/lib64/tls/libpthread.so.0
7 process 5859 0x00000038e8d8f7d5 in __nanosleep_nocancel () from/lib64/tls/libc.so.6
6 process 5861 0x00000038e980a66f in sem_wait () from/lib64/tls/libpthread.so.0
5 process 5862 0x00000038e980a66f in sem_wait () from/lib64/tls/libpthread.so.0
4 process 5863 0x00000038e8d8f7d5 in __nanosleep_nocancel () from/lib64/tls/libc.so.6
3 process 5864 0x00000038e8d8f7d5 in __nanosleep_nocancel () from/lib64/tls/libc.so.6
2 process 5883 0x00000038e8d8f7d5 in __nanosleep_nocancel () from/lib64/tls/libc.so.6
* 1 Process 5853 0x00000038e8d70540 in strlen () from/lib64/tls/libc.so.6
It is normal for a thread to stop at sleep or wait, but we see that thread 21 has some exceptions, the program stops at Memset, and whether or not there is a problem, you need to see if there is a specific error.
Then, through the command thread 21, go to the call stack of thread 21.
(GDB) Thread
[Switching to Thread (process 5797)] #0 0x00000038e8d7186d in memset () from/lib64/tls/libc.so.6
(GDB) BT
#0 0x00000038e8d7186d in memset () from/lib64/tls/libc.so.6
#1 0x000000000049da0d in Cgspdufactory::streamstringfrom ([email protected], [e-mail protected]) at Common/pdu/gspdu.cpp : 422
#2 0x00000000004d1f25 in Cgsothsharduserrsppdu::streamfrom (this=0x2aaaec951650, [e-mail protected]) at common/pdu/ pdugs.cpp:2707
#3 0x000000000049cb2d in Cgspdufactory::d ERIVEPDU ([e-mail protected], ulpdulen=30506) at common/pdu/gspdu.cpp:79
#4 0x000000000049c78e in Cgspdufactory::streampdufrom (PDATAPACKET=0X2AAAECA31D70) at common/pdu/gspdu.cpp:35
#5 0x0000000000449681 in Cgswdmsmanager::on_wdms_message_indication (this=0x8e3680, msg=0x2aaae9894360)
At mgr/gswdmsmanager.cpp:344
......
#18 0x0000000000407733 in Main (Argc=1, argv=0x7fff9b44ac98) at gsmain.cpp:118
(GDB) F 3
#3 0x000000000049cb2d in Cgspdufactory::d ERIVEPDU ([e-mail protected], ulpdulen=30506) at common/pdu/gspdu.cpp:79
Common/pdu/gspdu.cpp:no such file or directory.
In Common/pdu/gspdu.cpp
Use the command I locals to print the values of all variables.
(GDB) I locals
PPDU = (CBASEPDU *) 0x2aaaec951650
Ppduheader = (Cpduheader *) 0x2aaaea1c4190
Ulpdutype = 50
Until now there is no obvious anomaly, then print the PDU's head as follows:
(GDB) P *ppduheader
$ = {M_ulheadlen =, m_ulversion = 2080000, M_ulpdutype =, M_ulsrcsvrtype = webex_connect_gs, m_strsrcsvraddr = {
static NPOs = 18446744073709551615,
_m_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data Fields>}, <no data fields>},
_m_p = 0x2aaaeca52a68 "10.224.95.109:9900"}, M_strsubject = {static NPOs = 18446744073709551615,
_m_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data Fields>}, <no data fields>},
_m_p = 0x2aaaec929b28 "Qawin.qazone.GS"}, M_ulsequence = 0}
As can be seen from the Blue Word section, this PDU is sent from the 10.224.95.109 server.
At that time the QA test environment, are 10.224.122 start IP server, how can have this IP PDU, so, ask QA, found 10.224.95.109 This server is other Datacenter Server, And the old version, due to the current test environment version of the removal of two PDUs, while adding four PDUs, resulting in the old PDU sent to the time, the new version of it as a new PUD parsing, resulting in incorrect parsing, resulting in the resolution of the wrong length. All local variables can be viewed through the F 1 command into the first level call stack.
(GDB) F 1
#1 0x000000000049da0d in Cgspdufactory::streamstringfrom ([email protected], [e-mail protected]) at Common/pdu/gspdu.cpp : 422
422 in Common/pdu/gspdu.cpp
(GDB) I locals
strtmp = 0x2aaaf1c00010 ""
IRet = 0
Ullen = 1179995975
It can be seen that the parsed length is a very large value 1179995975, while thread 21 formally stops after allocating memory, and when using memset, stop there. As you can see from log, thread 21 is also stuck here, and it's not going to work anymore.
Because there were two server crash at the time, by looking at the other server's core file, another server was found to be the same call stack as the server. After QA has updated the version of 10.224.95.109 , crash no longer appears.
Through this example, it can be seen that when the server appears crash, although the current call stack may not be of value, but by analyzing all the threads of the call stack, it is possible to analyze the clues, which can help solve the problem of crash.
This problem can be learned, when modifying the interface between the server, it is important to consider the compatibility with the old version of the problem, even if the PDU may never be used, still need to retain, because production on the first GSB, and then on primary, There is a case that two versions will be running at the same time. Failure to remove or change the PDU sequence may result in the entire system not functioning.
Hope this article, to solve crash problem and avoid similar crash problem has certain reference function.
"Linux" gdb Debug Core file