Original article link
Test Linux applications on zseriesProgramSimilar to debugging Linux applications in other architectures. The biggest challenge for experienced Linux developers is to understand the new system architecture. It seems a daunting task for mainframe developers who are new to Linux to master new debugging tools. Don't be afraid. This article provides some useful tips to help you get started.
Learning comes from practice, but for debugging tools, "practice" won't happen until there is no problem that forces you to fix them. With this in mind, the "Quick Start" guide will be provided below.
User Debug Logging
The first step in debugging a crashed program is to find out what went wrong. The Linux Kernel on zseries has such a built-in feature that records some basic debugging information when a user's process crashes. To enable this feature, run the following command as the root user:
<Ccid_nobr>
<Ccid_code> Echo 1>/proc/sys/kernel/userprocess_debug |
When a process crashes, additional information will be provided in the log file (/var/log/messages), including the cause of program termination, fault address, and program status (psw) brief register dump for General registers and access registers.
<Ccid_nobr>
<Ccid_code> Mar 31 11:34:28 L02 kernel: User process fault: interruption code 0x10 Mar 31 11:34:28 L02 kernel: failing address: 0 Mar 31 11:34:28 L02 kernel: CPU: 1 Mar 31 11:34:28 L02 kernel: process simple (PID: 30122, stackpage = 05889000) Mar 31 11:34:28 L02 kernel: Mar 31 11:34:28 L02 kernel: User psw: 070dc000 c00ab738 Mar 31 11:34:28 L02 kernel: task: 05888000 KSP: 05889f08 pt_regs: 05889f68 Mar 31 11:34:28 L02 kernel: User GPRS: Mar 31 11:34:28 L02 kernel: 00000000 004019a0 004019a0 00000000 Mar 31 11:34:28 L02 kernel: 00000003 c00ab732 004008f8 00400338 Mar 31 11:34:28 L02 kernel: 40018ffc 0040061c 40018e34 7ffff800 Mar 31 11:34:28 L02 kernel: 00400434 80400624 8040066e 7ffff800 Mar 31 11:34:28 L02 kernel: User ACRs: Mar 31 11:34:28 L02 kernel: 00000000 00000000 00000000 00000000 Mar 31 11:34:28 L02 kernel: 00000001 00000000 00000000 00000000 Mar 31 11:34:28 L02 kernel: 00000000 00000000 00000000 00000000 Mar 31 11:34:28 L02 kernel: 00000000 00000000 00000000 00000000 Mar 31 11:34:28 L02 kernel: User code: Mar 31 11:34:28 L02 kernel: 44 40 50 00 07 Fe A7 4A 00 01 18 54 18 43 18 35 A8 24 00 00 |
Figure 1
Figure 1 shows that a program (called "simple") is interrupted by a programCode0x10 termination (the operating system principle indicates that this is a segment Conversion error), and the fault address is 0. There is no doubt that a null pointer is used. Now that we know what happened, we need to find out where it happened.
Basic diagnostics
The information provided by the user debug log entries can be used to determine the program crash location. Some available tools can help solve various program termination problems you may encounter. We will gradually introduce those tools in this article.
First, let's check the user psw in the log entry. The psw contains the command address, Status Code, and other information about the machine status. Currently, we only care about the instruction address (33rd to 63rd bits ). For simplicity, let's assume that the user psw is 070dc000 80400618. Remember, we are investigating an ESA/390 (31-bit addressing) psw. 32nd bits are not part of the instruction address. They indicate the 31-bit addressing mode, but must be processed when studying the psw value. To obtain the actual instruction pointer, subtract the second word of psw from 0x80000000. The result is a command address 0x400618. To locate the code, you need some information in the executable file. First, use readelf to print some program header information.
<Ccid_nobr>
<Ccid_code> ELF file type is Exec (Executable File) Entry point 0x400474 There are 6 program headers, starting at offset 52
Program headers: Type offset incluaddr physaddr filesiz memsiz flg align Phdr 0x000034 0x00400034 0x00400034 0x000c0 0x000c0 r e 0x4 Interp 0x0000f4 0x004000f4 0x004000f4 0x0000d 0x0000d R 0x1 [Requesting program Interpreter:/lib/lD. so.1] Load 0x000000 0x00400000 0x00400000 0x00990 0x00990 r e 0x1000 Load 0x000990 0x00401990 0x00401990 0x000fc 0x00114 RW 0x1000 Dynamic 0x0009ac 0x004019ac 0x004019ac 0x000a0 0x000a0 RW 0x4 Note 0x000104 0x00400104 0x00400104 0x00020 0x00020 R 0x4
Section to segment mapping: Segment sections... 00 01. interp 02. interp. Note. Abi-tag. Hash. dynsym. dynstr. GNU. Version . GNU. version_r. Rela. Got. Rela. PLT. init. PLT. Text. Fini. rodata 03. Data. eh_frame. Dynamic. ctors. dtors. Got. BSS 04. Dynamic 05. Note. Abi-tag |
Figure 2
Figure 2 shows the result of readelf-l simple (remember "simple" is the name of our test program ). In the program headers section, the first load row provides information about where the program is loaded. In the flg column, this segment is marked as R (read) E (executable ). Virtaddr is the address where the program starts to load. Memsiz is the code length that is being loaded into this segment. Add it to mongoaddr. The basic address range of this program is 0x400000-0x400990. The command address for program crash is 0x400618, within the scope of program loading. Now we know that the problem occurs directly in the code.
If the executable file contains debugging symbols, you can determine which line of code causes the problem. Use the addr2line program for the address and executable file as follows:
<Ccid_nobr>
<Ccid_code> addr2line-e simple 0 x0 400618 |
Will return:
<Ccid_nobr>
<Ccid_code>/home/devuser/simple. C: 34 |
To study this problem, check the 34th rows.
For the original program crash in Figure 1, psw is 070dc000 c00ab738. To obtain the command address, subtract 0x80000000. The result is 0x400ab738. This address is not exactly within our applet. So what is it? Is the code from the shared library. If you run the LDD command (LDD simple) on the executable file, the list of shared objects required for running the program and the available address of the library will be returned.
<Ccid_nobr>
<Ccid_code> libc. so.6 =>/lib/libc. so.6 (0x40021000) /Lib/lD. so.1 =>/lib/lD. so.1 (0x40000000) |
The command address corresponds to the address for loading libc. so.6. In our simple test case, we only need two shared objects. Other applications may need more shared objects, which makes LDD output more complex. We will use Perl as an example. Input:
<Ccid_nobr>
<Ccid_code> LDD/usr/bin/perl |
You will get:
<Ccid_nobr>
<Ccid_code> libnsl. so.1 =>/lib/libnsl. so.1 (0x40021000) Libdl. so.2 =>/lib/libdl. so.2 (0x40039000) Libm. so.6 =>/lib/libm. so.6 (0x4003d000) Libc. so.6 =>/lib/libc. so.6 (0x40064000) Libcrypt. so.1 =>/lib/libcrypt. so.1 (0x4018f000) /Lib/lD. so.1 =>/lib/lD. so.1 (0x40000000) |
Everything is needed, but I find that the following content reads faster for this process:
<Ccid_nobr>
<Ccid_code> LDD/usr/bin/perl | awk '{print? $4 "$3} '| sort (0x40000000)/lib/lD. so.1 (0x40021000)/lib/libnsl. so.1 (0x40039000)/lib/libdl. so.2 (0x4003d000)/lib/libm. so.6 (0x40064000)/lib/libc. so.6 (0x4018f000)/lib/libcrypt. so.1 |
Now let's determine where the crash occurred in libc. Assume that the loading address of libc. so.6 is 0x40021000, and the command address 0x400ab738 is subtracted. The result is 0x8a738. This is the offset to enter libc. so.6. Run the NM command to dump the symbols from libc. so.6 and then try to determine the function in which the address is located. For libc. so.6, Nm generates more than 7,000 rows of output. You can run grep (Regular Expression lookup program) on the calculated offset to reduce the amount of data that must be checked. Input:
<Ccid_nobr>
<Ccid_code> nm/lib/libc. so.6 | sort | grep 0008a |
66 rows will be returned. In the middle of the output, we will find:
<Ccid_nobr>
<Ccid_code> 0008a6fc t memcpy 0008a754 T _ wordcopy_fwd_aligned |
This offset is located in a certain position in memcpy. In this example, a null pointer is passed to memcpy as the target address. Where can we call memcpy? Good question. We can identify the target region by checking the register dump output in the log file. Register 14 contains the return address for executing a function call. According to Figure 1, R14 is 0x8040066e, which generates an address 0x40066e after the high position is intercepted. This address falls within the scope of our program, so we can run addr2line to determine where the address is. Input:
<Ccid_nobr>
<Ccid_code> addr2line-e simple 0x40066e |
Will return:
<Ccid_nobr>
<Ccid_code>/home/devuser/simple. C: 36 |
This is the line after we call memcpy. One note about addr2line: If the executable file does not contain debugging symbols, you will get ?? : 0 as the response.