Linux server downtime Case 2: linux case
An email alert showed that a Linux server in an overseas factory could not be connected, and the DPA (Database Performance Analyzer) system also found a problem. ping the server to find that the Network was disconnected and the local system administrator could not be contacted, after email consultation, the system administrator also found that there was a problem and restarted directly. Afterwards, the system administrator checked and found that the following error message (processing of sensitive information) appeared from)
Jul 7 10:10:27 localhost kernel: ata2.00: qc timeout (cmd 0xa0)
Jul 7 10:11:00 localhost kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jul 7 10:11:00 localhost kernel: sr 2:0:0:0: CDB: Test Unit Ready: 00 00 00 00 00 00
Jul 7 10:11:00 localhost kernel: ata2.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Jul 7 10:11:00 localhost kernel: res 51/20:03:00:00:00/00:00:00:00:00/a0 Emask 0x5 (timeout)
Jul 7 10:11:00 localhost kernel: ata2.00: status: { DRDY ERR }
Jul 7 10:11:00 localhost kernel: ata2: soft resetting link
Jul 7 10:11:00 localhost kernel: INFO: task hald-addon-stor:3728 blocked for more than 120 seconds.
Jul 7 10:11:00 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 7 10:11:00 localhost kernel: hald-addon-st D ffff8801989b6080 0 3728 3690 0x00000080
Jul 7 10:11:00 localhost kernel: ffff880428dd76a8 0000000000000082 ffff880428dd7668 ffff88042c6300d0
Jul 7 10:11:00 localhost kernel: ffff8804294ac0c0 ffff88040d730700 ffff8804294ac490 ffffffff812ef88e
Jul 7 10:11:00 localhost kernel: ffff880428dd76a8 ffffffff81311442 ffff88042c630000 7fffffffffffffff
Jul 7 10:12:06 localhost kernel: Call Trace:
Jul 7 10:12:06 localhost kernel: [<ffffffff812ef88e>] ? scsi_done+0x0/0x17
Jul 7 10:12:06 localhost kernel: [<ffffffff81311442>] ? __ata_scsi_queuecmd+0x1a6/0x200
Jul 7 10:12:06 localhost kernel: [<ffffffff814527e9>] schedule_timeout+0x36/0xe7
Jul 7 10:12:06 localhost kernel: [<ffffffff812331b6>] ? kobject_put+0x47/0x4c
Jul 7 10:12:06 localhost kernel: [<ffffffff8104367e>] ? need_resched+0x23/0x2d
Jul 7 10:12:06 localhost kernel: [<ffffffff8145265b>] wait_for_common+0xb7/0x12c
Jul 7 10:12:06 localhost kernel: [<ffffffff8104cf2f>] ? default_wake_function+0x0/0x19
Jul 7 10:12:06 localhost kernel: [<ffffffff8121e449>] ? __generic_unplug_device+0x32/0x37
Jul 7 10:12:06 localhost kernel: [<ffffffff81452773>] wait_for_completion+0x1d/0x1f
Jul 7 10:12:06 localhost kernel: [<ffffffff8122540b>] blk_execute_rq+0xcb/0x104
Jul 7 10:12:06 localhost kernel: [<ffffffff8121d88c>] ? freed_request+0x34/0x55
Jul 7 10:12:06 localhost kernel: [<ffffffff812f60af>] scsi_execute+0xde/0x12e
Jul 7 10:12:06 localhost kernel: [<ffffffff812f61cc>] scsi_execute_req+0xcd/0xf2
Jul 7 10:12:06 localhost kernel: [<ffffffff81302a9a>] sr_test_unit_ready+0x5c/0xb5
Jul 7 10:12:06 localhost kernel: [<ffffffff81302d0f>] sr_media_change+0x5d/0x28a
Jul 7 10:12:06 localhost kernel: [<ffffffff810448c9>] ? task_rq_unlock+0x11/0x13
Jul 7 10:12:06 localhost kernel: [<ffffffff8104cf08>] ? try_to_wake_up+0x3aa/0x3d1
Jul 7 10:12:06 localhost kernel: [<ffffffff8131f73f>] media_changed+0x53/0x88
Jul 7 10:12:06 localhost kernel: [<ffffffff8131f7a5>] cdrom_media_changed+0x31/0x37
Jul 7 10:12:06 localhost kernel: [<ffffffff81303072>] sr_block_media_changed+0x19/0x1b
Jul 7 10:12:06 localhost kernel: [<ffffffff8114407d>] check_disk_change+0x29/0x5b
Jul 7 10:12:06 localhost kernel: [<ffffffff81322f9b>] cdrom_open+0x919/0x9b9
Jul 7 10:12:06 localhost kernel: [<ffffffff810447f2>] ? __wake_up_sync_key+0x53/0x5f
Jul 7 10:12:06 localhost kernel: [<ffffffff8142ebec>] ? unix_stream_recvmsg+0x40f/0x4d3
Jul 7 10:12:06 localhost kernel: [<ffffffff813b63b5>] ? skb_release_data+0xaa/0xaf
Jul 7 10:12:06 localhost kernel: [<ffffffff8142ebec>] ? unix_stream_recvmsg+0x40f/0x4d3
Jul 7 10:12:06 localhost kernel: [<ffffffff8104367e>] ? need_resched+0x23/0x2d
The system version is Oracle Linux Server release 5.7 and vmwarevm. At the beginning of the check, focus on "kernel: INFO: task hald-addon-stor: 3728 blocked for more than 120 seconds. About hald-addon-store, this is the process of mounting read/write operations on media devices. It is a hardware daemon process, it constantly checks whether devices are connected to or not.
[root@localhost log]# grep 'Jul 7' messages | grep -E 'more than 120 seconds|hung_task_timeout_secs'
Jul 7 10:11:00 localhost kernel: INFO: task hald-addon-stor:3728 blocked for more than 120 seconds.
Jul 7 10:11:00 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 7 10:12:06 localhost kernel: INFO: task hald-addon-stor:3728 blocked for more than 120 seconds.
Jul 7 10:12:06 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 7 10:17:39 localhost kernel: INFO: task hald-addon-stor:3728 blocked for more than 120 seconds.
Jul 7 10:17:39 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[root@localhost log]#
With regard to "task blocked for more than 120 seconds", many documents on the Internet share the same story. Modifying the vm. dirty_ratio parameter and vm. dirty_backgroud_ratio parameter can avoid this problem. However, I still have some doubts, for the following reasons:
1: The vm. dirty_background_ratio parameter of this server is 10, and the vm. dirty_ratio parameter is 20. I checked this. This is the default value in the Redhat official document.
[root@localhost log]# sysctl -a | grep 'vm.dirty'
vm.dirty_background_ratio = 10
vm.dirty_background_bytes = 0
vm.dirty_ratio = 20
vm.dirty_bytes = 0
vm.dirty_writeback_centisecs = 500
vm.dirty_expire_centisecs = 3000
2: This server has been running for many years. If there is a problem with this parameter, this problem should occur frequently, but so far this problem has only occurred once.
3: We still have multiple servers in this operating system version. kernel parameters are also the official default values, and we have never seen such problems.
Of course, this error is caused by some operations that trigger certain critical values. Let's continue to analyze the error information to see if we can find some information.
Sar-r: Check the memory usage during this period of time. The percentage of memory used (% memused) is very high, and the percentage of swap space used (% swapused) the ratio is also direct from single-digit to 40%. The server restarts at and the data between and is unavailable. It is unclear why. From the data, we can see that memory resources are very tight.
In addition, the CPU is also viewed using sar-u, And the CPU usage is not high. Basically normal. At that time, I/O and transfer rate statistics were forgotten (now I can't see the data of yesterday), and "ata2.00: status: {drdy err} "," ata2.00: qc timeout (cmd 0xa0) "; and other error messages
Many materials about this error refer to hard disk faults or hard disk read/write errors. It is also said that the kernel bug
Of course, the kernel version of our system is 2.6.32-200.13.1.el5uek.Https://lists.linuxfoundation.org/pipermail/bugme-new/2009-July/022389.htmlThe link also says that not only the kernel version 2.6.31 is available:
I have seen this messages not only on 2.6.31... also on except other 2.6 kernels.
However, you cannot determine or locate problems based on your knowledge and experience. You can only make a vague judgment, either hard disk failure, read/write error, or kernel error. Ask the system administrator over there to check whether there are any alarms on the storage device, and there is no feedback for the moment. check whether there are bad blocks in the disk partition and find no bad blocks.
[root@localhost ~]# /sbin/badblocks -v /dev/sda
Checking blocks 0 to 104857600
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found.
[root@localhost ~]# /sbin/badblocks -v /dev/sdb
Checking blocks 0 to 157286400
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found.
[root@localhost ~]# /sbin/badblocks -v /dev/sdc
Checking blocks 0 to 41943040
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found.
[root@localhost ~]# /sbin/badblocks -v /dev/sdd
Checking blocks 0 to 104857600
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found.
[root@localhost ~]# /sbin/badblocks -v /dev/sde
Checking blocks 0 to 188743680
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found.
In addition, the message contains a large number of BAR 13: can't allocate I/O resource [0x0000-0xffff] information when the system is started.
Jul 7 10:32:02 egvlnx03 kernel: ACPI: ACPI bus type pnp unregistered
Jul 7 10:32:02 egvlnx03 kernel: system 00:01: ioport range 0x1000-0x103f has been reserved
Jul 7 10:32:02 egvlnx03 kernel: system 00:01: ioport range 0x1040-0x104f has been reserved
Jul 7 10:32:02 egvlnx03 kernel: system 00:01: ioport range 0xcf0-0xcf1 has been reserved
Jul 7 10:32:02 egvlnx03 kernel: system 00:08: iomem range 0xfed00000-0xfed003ff has been reserved
Jul 7 10:32:02 egvlnx03 kernel: system 00:0d: ioport range 0xfce0-0xfcff has been reserved
Jul 7 10:32:02 egvlnx03 kernel: system 00:0d: iomem range 0xf0000000-0xf7ffffff has been reserved
Jul 7 10:32:02 egvlnx03 kernel: system 00:0d: iomem range 0xfe800000-0xfe9fffff has been reserved
Jul 7 10:32:02 egvlnx03 kernel: pci 0000:00:15.3: BAR 13: can't allocate I/O resource [0x10000-0xffff]
Jul 7 10:32:02 egvlnx03 kernel: pci 0000:00:15.4: BAR 13: can't allocate I/O resource [0x10000-0xffff]
Jul 7 10:32:02 egvlnx03 kernel: pci 0000:00:15.5: BAR 13: can't allocate I/O resource [0x10000-0xffff]
Jul 7 10:32:02 egvlnx03 kernel: pci 0000:00:15.6: BAR 13: can't allocate I/O resource [0x10000-0xffff]
Jul 7 10:32:02 egvlnx03 kernel: pci 0000:00:15.7: BAR 13: can't allocate I/O resource [0x10000-0xffff]
Jul 7 10:32:02 egvlnx03 kernel: pci 0000:00:16.3: BAR 13: can't allocate I/O resource [0x10000-0xffff]
Jul 7 10:32:02 egvlnx03 kernel: pci 0000:00:16.4: BAR 13: can't allocate I/O resource [0x10000-0xffff]
Jul 7 10:32:02 egvlnx03 kernel: pci 0000:00:16.5: BAR 13: can't allocate I/O resource [0x10000-0xffff]
Jul 7 10:32:02 egvlnx03 kernel: pci 0000:00:16.6: BAR 13: can't allocate I/O resource [0x10000-0xffff]
Jul 7 10:32:02 egvlnx03 kernel: pci 0000:00:16.7: BAR 13: can't allocate I/O resource [0x10000-0xffff]
Jul 7 10:32:02 egvlnx03 kernel: pci 0000:00:17.3: BAR 13: can't allocate I/O resource [0x10000-0xffff]
Jul 7 10:32:02 egvlnx03 kernel: pci 0000:00:17.4: BAR 13: can't allocate I/O resource [0x10000-0xffff]
Jul 7 10:32:02 egvlnx03 kernel: pci 0000:00:17.5: BAR 13: can't allocate I/O resource [0x10000-0xffff]
Jul 7 10:32:02 egvlnx03 kernel: pci 0000:00:17.6: BAR 13: can't allocate I/O resource [0x10000-0xffff]
Jul 7 10:32:02 egvlnx03 kernel: pci 0000:00:17.7: BAR 13: can't allocate I/O resource [0x10000-0xffff]
Jul 7 10:32:02 egvlnx03 kernel: pci 0000:00:18.2: BAR 13: can't allocate I/O resource [0x10000-0xffff]
Jul 7 10:32:02 egvlnx03 kernel: pci 0000:00:18.3: BAR 13: can't allocate I/O resource [0x10000-0xffff]
Jul 7 10:32:02 egvlnx03 kernel: pci 0000:00:18.4: BAR 13: can't allocate I/O resource [0x10000-0xffff]
Jul 7 10:32:02 egvlnx03 kernel: pci 0000:00:18.5: BAR 13: can't allocate I/O resource [0x10000-0xffff]
Jul 7 10:32:02 egvlnx03 kernel: pci 0000:00:18.6: BAR 13: can't allocate I/O resource [0x10000-0xffff]
Jul 7 10:32:02 egvlnx03 kernel: pci 0000:00:18.7: BAR 13: can't allocate I/O resource [0x10000-0xffff]
After verification, an article explains the cause of the message and indicates that the problem can be ignored.
This message indicates that the kernel cannot allocate prefetch memory slots.
The number of the prefetch memory slot is 13 (PCI_BRIDGE_RESOURCES ).
If the system has 13 PCI bridge devices on the PCI bus, the above messages are recorded.
Although the message is recorded, the kernel uses a non-prefetch window instead of a prefetch area. Therefore, you can ignore this issue.
We do not want to modify the kernel parameters for the time being. For the time being, we only recommend that the system administrator add memory resources and then observe the specific operating conditions of the system.
References:
Http://ju.outofmemory.cn/entry/193909
Https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/
Http://linux.it.net.cn/e/bug/2016/0102/19226.html