IBM developer articles are good, there is no saying that the use of GDB multi-process debugging Tian Qiang (tianq@cn.ibm.com ), Software Engineer, IBM China Software Development CenterIntroduction:GDB is a common debugging tool in Linux. This article describes several methods for debugging multi-process programs using GDB and compares various methods. Mark this article!Release date:July 30, 2007 Level:Intermediate Access:10160 views Comment:0 (View | Add comment-Log On) Average score (7 scores) Score for this articleGDB is a common C/C ++ debugging tool in Linux and has powerful functions. How can I use GDB for debugging complex systems, such as multi-process systems? Consider the following three-Process System: Process
Proc2 is a subprocess of proc1, and proc3 is a subprocess of proc2. How can I use GDB to debug proc2 or proc3? In fact, GDB does not directly support multi-process program debugging. For example, if you use GDB to debug a process that fork a child process, GDB will continue to debug the process and the child process will run without interference. If you set a breakpoint in the sub-process code in advance, the sub-process will receive the sigtrap signal and terminate. How can we debug the sub-process? In fact, we can use the features of GDB or other auxiliary means to achieve our goal. In addition, GDB also adds some multi-process debugging support to newer kernels. Next we will introduce several methods in detail, including the Follow-fork-mode method, attach sub-process method, and GDB wrapper method. Follow-fork-Mode
In the Linux kernel version 2.5.60 and later versions, GDB provides the follow-fork-mode option for programs that use fork/vfork to create sub-processes to support multi-process debugging. The usage of follow-fork-mode is as follows: set follow-fork-mode [parent|child]
- Parent: Continue to debug the parent process after fork. The child process is not affected.
- Child: debug the child process after fork. The parent process is not affected.
Therefore, if you need to debug the sub-process, after you start GDB:
(gdb) set follow-fork-mode child |
Set breakpoints in the sub-process code. In addition, the detach-on-fork parameter is used to indicate whether or not GDB disconnects debugging of a process after fork, or whether it is controlled by GDB: set detach-on-fork [on|off]
- On: disconnect the process specified by follow-fork-mode.
- Off: GDB controls the parent process and child process. The process specified by follow-fork-mode will be debugged, And the other process will be placed in the suspended state.
Note that it is best to use GDB 6.6 or later. If you are using gdb6.4, only the follow-fork-mode is available. Follow-fork-mode/detach-on-fork is relatively simple to use, but due to its system kernel/GDB version restrictions, we can only use it on a compliant system. In addition, the debugging of follow-fork-mode must start from the parent process. For fork, the system of the Sun process or great sun process appears many times, such as the 3 process system, debugging is not convenient. Attach sub-process As we all know, GDB has the attach function to a running process, that is, the attach <pid> command. Therefore, we can use this command to attach the sub-process and then debug it. For example, to debug a process rim_oracle_agent.9i, first obtain the PID of the process.
[root@tivf09 tianq]# ps -ef|grep RIM_Oracle_Agent.9inobody 6722 6721 0 05:57 ? 00:00:00 RIM_Oracle_Agent.9iroot 7541 27816 0 06:10 pts/3 00:00:00 grep -i rim_oracle_agent.9i |
Pstree shows that this is a three-process system, oserv is the parent process of rim_oracle_prog, and rim_oracle_prog is the parent process of rim_oracle_agent.9i.
[root@tivf09 root]# pstree -H 6722 |
View processes through pstree
Start GDB and attach to the Process Use GDB to connect to a process
Now you can debug it. A new problem is that the sub-process has been running and the attach does not know where to run. Is there a solution? One way is to add a special code to the initial code of the sub-process to be debugged, such as at the beginning of the main function, so that the sub-process cyclically sleeps and waits when a condition is set, after attach to the process, set a breakpoint after the code segment, and then cancel the established conditions so that the code can continue to be executed. As for the conditions used in this code, you are biased. For example, we can check the value of a specified environment variable, or check that a specific file does not exist. Take the file as an example. The format can be as follows:
Void debug_wait (char * tag_file) {While (1) {If (tag_file exists) sleep for a period of time; else break ;}} |
After attach arrives at the process, set a breakpoint after the code segment and delete the file. Of course, you can also use other conditions or forms, as long as this condition can be set/detected. The attach process method is very convenient. It can cope with a variety of complex process systems, such as the Sun Tzu/Zeng sun process, such as daemon process ), the only requirement is to add a small piece of code. GDB wrapper Most of the time, the parent process fork goes out of the child process, and the child process then calls the exec family function to execute new code. In this case, we can also use the gdb wrapper method. It does not need to add additional code. The basic principle is to use GDB to call the code to be executed as a new whole to be executed by the exec function, so that the code to be executed is always under the control of GDB, in this way, we can naturally debug the sub-process code. In the above example, after the rim_oracle_prog fork goes out of the sub-process, it will then execute the binary code file rim_oracle_agent.9i. Rename the file to rim_oracle_agent.9i.binary, and create a new shell script file named rim_oracle_agent.9i. Its content is as follows:
[root@tivf09 bin]# mv RIM_Oracle_Agent.9i RIM_Oracle_Agent.9i.binary[root@tivf09 bin]# cat RIM_Oracle_Agent.9i#!/bin/shgdb RIM_Oracle_Agent.binary |
When the Fork sub-process executes a file named rim_oracle_agent.9i, GDB is started first, so that the code to be debugged is under the control of GDB. A new problem arises. Sub-processes are controlled by GDB, but still cannot be debugged: how to interact with GDB? We must start GDB in some way to interact with GDB in a certain window/terminal. Specifically, you can use xterm to generate this window. Xterm is a simulated terminal program in the X Window System. For example, we typed the xterm command in GNOME of Linux: Xterm
A terminal window will pop out: Terminal
If you are debugging on a remote Linux server, you can use VNC (Virtual Network Computing) viewer to connect to the server from a local machine and use xterm. Before that, you need to install VNC Viewer on your local machine, install and start the VNC server on the server. Most Linux distributions are pre-installed with the VNC-server package, so we can directly run the vncserver command. Note: When you run vncserver for the first time, you will be prompted to enter the password to use as the password for VNC Viewer to connect from the client. You can use the vncpasswd command on the VNC server to change the password.
[root@tivf09 root]# vncserver New 'tivf09:1 (root)' desktop is tivf09:1Starting applications specified in /root/.vnc/xstartupLog file is /root/.vnc/tivf09:1.log[root@tivf09 root]#[root@tivf09 root]# ps -ef|grep -i vncroot 19609 1 0 Jun05 ? 00:08:46 Xvnc :1 -desktop tivf09:1 (root) -httpd /usr/share/vnc/classes -auth /root/.Xauthority -geometry 1024x768 -depth 16 -rfbwait 30000 -rfbauth /root/.vnc/passwd -rfbport 5901 -pnroot 19627 1 0 Jun05 ? 00:00:00 vncconfig -iconicroot 12714 10599 0 01:23 pts/0 00:00:00 grep -i vnc[root@tivf09 root]# |
Vncserver is a Perl script used to start xvnc (x VNC Server ). X client applications, such as xterm and VNC Viewer, communicate with each other. As shown above, we can use the display value tivf09: 1. Now you can use VNC Viewer to connect to the local machine: VNC Viewer: input server
Enter the password: VNC Viewer: enter the password
Login successful, the interface is the same as the local desktop on the server: VNC Viewer
Next we will modify the rim_oracle_agent.9i script to make it look like the following:
#!/bin/shexport DISPLAY=tivf09:1.0; xterm -e gdb RIM_Oracle_Agent.binary |
If your program also passes in parameters during exec, you can change it:
#!/bin/shexport DISPLAY=tivf09:1.0; xterm -e gdb --args RIM_Oracle_Agent.binary $@ |
Add the execution permission.
[root@tivf09 bin]# chmod 755 RIM_Oracle_Agent.9i |
Now you can debug it. Programs that run the promoter process:
[root@tivf09 root]# wrimtest -l 9i_linuxResource Type : RIMResource Label : 9i_linuxHost Name : tivf09User Name : mdstatusVendor : OracleDatabase : rimDatabase Home : /data/oracle9i/920Server ID : rimInstance Home : Instance Name : Opening Regular Session... |
The program stopped. From the VNC Viewer, we can see that a new GDB xterm window opens on the server. GDB xterm window
[root@tivf09 root]# ps -ef|grep gdbnobody 24312 24311 0 04:30 ? 00:00:00 xterm -e gdb RIM_Oracle_Agent.binarynobody 24314 24312 0 04:30 pts/2 00:00:00 gdb RIM_Oracle_Agent.binaryroot 24326 10599 0 04:30 pts/0 00:00:00 grep gdb |
It is the program to be debugged. Set the breakpoint and start debugging! Note: The following errors are generally about permissions. Use the xhost command to modify permissions: Xterm Error
[root@tivf09 bin]# export DISPLAY=tivf09:1.0[root@tivf09 bin]# xhost +access control disabled, clients can connect from any host |
Xhost + prohibits access control and can be connected from any machine. For security concerns, you can also use xhost + <your machine Name>. Summary The three methods have their own characteristics and advantages, so they are suitable for different occasions and environments:
- Follow-fork-mode: it is easy to use and has restrictions on the system kernel and GDB version. It is suitable for simple multi-process systems.
- Attach sub-process method: flexible and powerful, but additional code needs to be added, suitable for various complex situations, especially the daemon process
- GDB wrapper method: used for Fork + EXEC mode. No additional code is required, but xterm/VNC is required ).
References
- GDB official reference: http://sourceware.org/gdb/documentation/
- More VNC information: http://www.realvnc.com/
About the author Tian Qiang is a software engineer in Tivoli of China Software Development Center. He is responsible for the maintenance and customer support of IBM product TMF (Tivoli Management Framework) and loves Linux. From: http://hi.baidu.com/thinke365/blog/item/c9469f250b9aeb398644f948.html Breakpoint 2 at 0x804b6f3: file collect. C, line 1172. (GDB) N [New Process 28538] <G id = "1"> [Switching to process 28538] 1174 if (! Child) // Now GDB has started fork debugging and jumped to the sub-process... You need to set GDB: (GDB)Set follow-fork-Mode Requires an argument. Valid arguments are child, parent. (GDB) set follow-fork-Mode Child parent (GDB) set follow-fork-ModeChild The sub-process enters 1174, that is, to connect to FTP... 1174 if (! Child) (GDB) L 1169 fgets (line, Max, hostlistres ); 1170 if (feof (hostlistres )) 1171 break; 1172 while (child = fork () =-1) 1173 sleep (1 ); 1174 if (! Child) 1175 { 1176 // strcpy (machine, line ); 1177 scanlines (line ); 1178 probe (lzodir, typeoffetch ); (GDB) BT #0 main (n = 1, P = 0xbf8072d4) at collect. C: 1174 (GDB) N 1177 scanlines (line ); Start debugging to the FTP connection function .... (GDB) Br scanline Breakpoint 3 at 0x804b1e5: file collect. C, line 1062. (GDB) N Breakpoint 3, scanline (line = 0x804ca80 "ftp: // 10.0.0.1/N") at collect. C: 1062 1062 site. ftp_name [0] = '/0 '; View the internal values of a struct: (GDB) P site $1 = {ftp_user = '/0' <repeats 127 times>, ftp_pass ='/0' <repeats 127 times>, ftp_name = '/0' <repeats 1023 times>,
Ftp_port = "/000/000/000/000/000/000/000/000/000 "} View the line value, which is read from the file. (GDB) N 1068 if (SCAN = strchr (line, LF ))! = NULL) (GDB) PLine $4 = 0x804ca80 "ftp: // 10.0.0.1/N" Scan = line + 6; // remove the ftp: // prefix... Copy field, which is an anonymous character .... 1115 strcpy (site. ftp_user, anonymous ); 1116 strcpy (site. ftp_pass, anonypass ); After scanline comes out, is FTP connected? Probe (lzodir, typeoffetch ); Enter the probe function. (GDB) N Breakpoint 4, probe (outputdir = 0xbf806e2b "/var/Parker/lzodata", typeoffetch = 32'') at collect. C: 961 961 file * res = NULL; (GDB) L 956/* else call dolslr or dolookup */ 957 int 958 probe (char * outputdir, char typeoffetch) 959 { 960 int RC;/* return code */ 961 file * res = NULL; 962 char tempfile [Max]; 963 char resfile [Max]; 964 char Command [Max]; 965 char topdir [Max];/* the FTP top dir */ Starting from this sentence, the FTP server is connected .... 973 if (connect (MACHINE )! = S_ OK) 974 { 975 Deb ("couldn't connect "); 976 return (s_error ); 977} 220- 220- 220 331 please specify the password. 230 login successful. --- Logged in 978 strcpy (topdir ,"/"); After the FTP connection is successful, run the following code:.... (GDB) L 973 if (connect (MACHINE )! = S_ OK) 974 { 975 Deb ("couldn't connect "); 976 return (s_error ); 977} 978 strcpy (topdir ,"/"); 979 if (typeoffetch = 'D ') 980 { 981 rc = dodownload (resfile, topdir ); 982} (GDB) P topdir $15 = '/0' <repeats 1023 times> Directory switched successfully .... 994 if (rc = dolistlr (resfile ))! = S_ OK (GDB) L 989 res = fopen (resfile, "W "); 990 rc = dorecursive (topdir, Res ); 991 fclosely (RES ); 992} 993 else 994 if (rc = dolistlr (resfile ))! = S_ OK 995 & (rc = dodownload (resfile, topdir ))! = S_ OK )) 996 { 997 Deb ("use recursive look up "); 998 res = fopen (resfile, "W "); (GDB) N 250 directory successfully changed. 200 PORT command successful. Consider using PASV. 150 here comes the directory listing. --- Done with fetching the Directory 226 directory send OK. --- Done with list-LR 1003 if (RC! = S_ OK) 1024 An error occurred while executing the command ??? 1024 sprintf (command, "% S/% s-s", parker_home, bindir, (GDB) 1027 System (command ); (GDB) [New Process 28551] Creat output file error! Program exited with code 03. (GDB) --- done with probe An error occurred while executing the system command. (GDB) N 1027 System (command ); (GDB) p command $18 ="/Var/Parker/bin/lzo_comp/var/Parker/tmp/10.0.0.1.fmt/var/Parker/lzodata/10.0.0.1-s", '/0'<Repeats 932 times> Is a folder missing because of this error? The specific Splicing process of this command? After debugging the connect code, the code is in the system library, not the user code? 973 if (connect (MACHINE )! = S_ OK) (GDB) Breakpoint 4, 0x00cd3ad0 in connect () from/lib/libc. so.6 (GDB) L 968 struct filestatist files; 969 gethostipstr (machine, hostip ); 970 sprintf (resfile, "% S/% s.org", workdir, hostip ); 971 Deb ("probe "); 972 alarm (timeout ); 973 if (connect (MACHINE )! = S_ OK) 974 { 975 Deb ("couldn't connect "); 976 return (s_error ); 977} (GDB) Debug the connect code.... Single stepping until exit from function connect,
Which has no line number information. Waitcon (M = 0x804d3c0 "10.0.0.1") at collect. C: 256 256 If (A = getsockname (h, (struct sockaddr *) & own_addr, & own_addr_len ))) (GDB) L 251 if (connect (h, (void *) & SA, sizeof (SA) <0) 252 { 253 perror ("Connect "); 254 return (s_error ); 255} 256 If (A = getsockname (h, (struct sockaddr *) & own_addr, & own_addr_len ))) 257 { 258 perror ("getsockname "); 259 printf ("error here! Return % I/N ", ); 260 return (s_error ); Display the code of a function: (GDB)L Probe 954/* typeoffetch = r call dorec */ 955/* typeoffetch = l call dolookup */ 956/* else call dolslr or dolookup */ 957 int 958 probe (char * outputdir, char typeoffetch) 959 { 960 int RC;/* return code */ 961 file * res = NULL; 962 char tempfile [Max]; 963 char resfile [Max]; Display code near a code line: (GDB) L 1 111 106 printf ("% s", line ); 107 108 # endif /**/ 109 If (line [3] = '-') 110 continue; 111 If (! Strncmp (line, PWD, strlen (PWD ))) 112 { 113 PTR = line; 114 while (PTR ++) 115 If (* PTR = '/"') |