Linux server downtime Case 2: linux case

Last Update:2016-07-08 Source: Internet

Author: User

Tags prefetch

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Linux server downtime Case 2: linux case

An email alert showed that a Linux server in an overseas factory could not be connected, and the DPA (Database Performance Analyzer) system also found a problem. ping the server to find that the Network was disconnected and the local system administrator could not be contacted, after email consultation, the system administrator also found that there was a problem and restarted directly. Afterwards, the system administrator checked and found that the following error message (processing of sensitive information) appeared from)

Jul  7 10:10:27 localhost kernel: ata2.00: qc timeout (cmd 0xa0)

Jul  7 10:11:00 localhost kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Jul  7 10:11:00 localhost kernel: sr 2:0:0:0: CDB: Test Unit Ready: 00 00 00 00 00 00

Jul  7 10:11:00 localhost kernel: ata2.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0

Jul  7 10:11:00 localhost kernel:          res 51/20:03:00:00:00/00:00:00:00:00/a0 Emask 0x5 (timeout)

Jul  7 10:11:00 localhost kernel: ata2.00: status: { DRDY ERR }

Jul  7 10:11:00 localhost kernel: ata2: soft resetting link

Jul  7 10:11:00 localhost kernel: INFO: task hald-addon-stor:3728 blocked for more than 120 seconds.

Jul  7 10:11:00 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Jul  7 10:11:00 localhost kernel: hald-addon-st D ffff8801989b6080     0  3728   3690 0x00000080

Jul  7 10:11:00 localhost kernel:  ffff880428dd76a8 0000000000000082 ffff880428dd7668 ffff88042c6300d0

Jul  7 10:11:00 localhost kernel:  ffff8804294ac0c0 ffff88040d730700 ffff8804294ac490 ffffffff812ef88e

Jul  7 10:11:00 localhost kernel:  ffff880428dd76a8 ffffffff81311442 ffff88042c630000 7fffffffffffffff

Jul  7 10:12:06 localhost kernel: Call Trace:

Jul  7 10:12:06 localhost kernel:  [<ffffffff812ef88e>] ? scsi_done+0x0/0x17

Jul  7 10:12:06 localhost kernel:  [<ffffffff81311442>] ? __ata_scsi_queuecmd+0x1a6/0x200

Jul  7 10:12:06 localhost kernel:  [<ffffffff814527e9>] schedule_timeout+0x36/0xe7

Jul  7 10:12:06 localhost kernel:  [<ffffffff812331b6>] ? kobject_put+0x47/0x4c

Jul  7 10:12:06 localhost kernel:  [<ffffffff8104367e>] ? need_resched+0x23/0x2d

Jul  7 10:12:06 localhost kernel:  [<ffffffff8145265b>] wait_for_common+0xb7/0x12c

Jul  7 10:12:06 localhost kernel:  [<ffffffff8104cf2f>] ? default_wake_function+0x0/0x19

Jul  7 10:12:06 localhost kernel:  [<ffffffff8121e449>] ? __generic_unplug_device+0x32/0x37

Jul  7 10:12:06 localhost kernel:  [<ffffffff81452773>] wait_for_completion+0x1d/0x1f

Jul  7 10:12:06 localhost kernel:  [<ffffffff8122540b>] blk_execute_rq+0xcb/0x104

Jul  7 10:12:06 localhost kernel:  [<ffffffff8121d88c>] ? freed_request+0x34/0x55

Jul  7 10:12:06 localhost kernel:  [<ffffffff812f60af>] scsi_execute+0xde/0x12e

Jul  7 10:12:06 localhost kernel:  [<ffffffff812f61cc>] scsi_execute_req+0xcd/0xf2

Jul  7 10:12:06 localhost kernel:  [<ffffffff81302a9a>] sr_test_unit_ready+0x5c/0xb5

Jul  7 10:12:06 localhost kernel:  [<ffffffff81302d0f>] sr_media_change+0x5d/0x28a

Jul  7 10:12:06 localhost kernel:  [<ffffffff810448c9>] ? task_rq_unlock+0x11/0x13

Jul  7 10:12:06 localhost kernel:  [<ffffffff8104cf08>] ? try_to_wake_up+0x3aa/0x3d1

Jul  7 10:12:06 localhost kernel:  [<ffffffff8131f73f>] media_changed+0x53/0x88

Jul  7 10:12:06 localhost kernel:  [<ffffffff8131f7a5>] cdrom_media_changed+0x31/0x37

Jul  7 10:12:06 localhost kernel:  [<ffffffff81303072>] sr_block_media_changed+0x19/0x1b

Jul  7 10:12:06 localhost kernel:  [<ffffffff8114407d>] check_disk_change+0x29/0x5b

Jul  7 10:12:06 localhost kernel:  [<ffffffff81322f9b>] cdrom_open+0x919/0x9b9

Jul  7 10:12:06 localhost kernel:  [<ffffffff810447f2>] ? __wake_up_sync_key+0x53/0x5f

Jul  7 10:12:06 localhost kernel:  [<ffffffff8142ebec>] ? unix_stream_recvmsg+0x40f/0x4d3

Jul  7 10:12:06 localhost kernel:  [<ffffffff813b63b5>] ? skb_release_data+0xaa/0xaf

Jul  7 10:12:06 localhost kernel:  [<ffffffff8142ebec>] ? unix_stream_recvmsg+0x40f/0x4d3

Jul  7 10:12:06 localhost kernel:  [<ffffffff8104367e>] ? need_resched+0x23/0x2d

The system version is Oracle Linux Server release 5.7 and vmwarevm. At the beginning of the check, focus on "kernel: INFO: task hald-addon-stor: 3728 blocked for more than 120 seconds. About hald-addon-store, this is the process of mounting read/write operations on media devices. It is a hardware daemon process, it constantly checks whether devices are connected to or not.

[root@localhost log]# grep 'Jul  7' messages | grep -E 'more than 120 seconds|hung_task_timeout_secs'

Jul  7 10:11:00 localhost kernel: INFO: task hald-addon-stor:3728 blocked for more than 120 seconds.

Jul  7 10:11:00 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Jul  7 10:12:06 localhost kernel: INFO: task hald-addon-stor:3728 blocked for more than 120 seconds.

Jul  7 10:12:06 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Jul  7 10:17:39 localhost kernel: INFO: task hald-addon-stor:3728 blocked for more than 120 seconds.

Jul  7 10:17:39 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[root@localhost log]#

With regard to "task blocked for more than 120 seconds", many documents on the Internet share the same story. Modifying the vm. dirty_ratio parameter and vm. dirty_backgroud_ratio parameter can avoid this problem. However, I still have some doubts, for the following reasons:

1: The vm. dirty_background_ratio parameter of this server is 10, and the vm. dirty_ratio parameter is 20. I checked this. This is the default value in the Redhat official document.

[root@localhost log]# sysctl -a | grep 'vm.dirty'

vm.dirty_background_ratio = 10

vm.dirty_background_bytes = 0

vm.dirty_ratio = 20

vm.dirty_bytes = 0

vm.dirty_writeback_centisecs = 500

vm.dirty_expire_centisecs = 3000

2: This server has been running for many years. If there is a problem with this parameter, this problem should occur frequently, but so far this problem has only occurred once.

3: We still have multiple servers in this operating system version. kernel parameters are also the official default values, and we have never seen such problems.

Of course, this error is caused by some operations that trigger certain critical values. Let's continue to analyze the error information to see if we can find some information.

Sar-r: Check the memory usage during this period of time. The percentage of memory used (% memused) is very high, and the percentage of swap space used (% swapused) the ratio is also direct from single-digit to 40%. The server restarts at and the data between and is unavailable. It is unclear why. From the data, we can see that memory resources are very tight.

In addition, the CPU is also viewed using sar-u, And the CPU usage is not high. Basically normal. At that time, I/O and transfer rate statistics were forgotten (now I can't see the data of yesterday), and "ata2.00: status: {drdy err} "," ata2.00: qc timeout (cmd 0xa0) "; and other error messages

Many materials about this error refer to hard disk faults or hard disk read/write errors. It is also said that the kernel bug

Of course, the kernel version of our system is 2.6.32-200.13.1.el5uek.Https://lists.linuxfoundation.org/pipermail/bugme-new/2009-July/022389.htmlThe link also says that not only the kernel version 2.6.31 is available:

I have seen this messages not only on 2.6.31... also on except other 2.6 kernels.

However, you cannot determine or locate problems based on your knowledge and experience. You can only make a vague judgment, either hard disk failure, read/write error, or kernel error. Ask the system administrator over there to check whether there are any alarms on the storage device, and there is no feedback for the moment. check whether there are bad blocks in the disk partition and find no bad blocks.

[root@localhost ~]# /sbin/badblocks -v /dev/sda

Checking blocks 0 to 104857600

Checking for bad blocks (read-only test): done

Pass completed, 0 bad blocks found.

[root@localhost ~]# /sbin/badblocks -v /dev/sdb

Checking blocks 0 to 157286400

Checking for bad blocks (read-only test): done

Pass completed, 0 bad blocks found.

[root@localhost ~]# /sbin/badblocks -v /dev/sdc

Checking blocks 0 to 41943040

Checking for bad blocks (read-only test): done

Pass completed, 0 bad blocks found.

[root@localhost ~]# /sbin/badblocks -v /dev/sdd

Checking blocks 0 to 104857600

Checking for bad blocks (read-only test): done

Pass completed, 0 bad blocks found.

[root@localhost ~]# /sbin/badblocks -v /dev/sde

Checking blocks 0 to 188743680

Checking for bad blocks (read-only test): done

Pass completed, 0 bad blocks found.

In addition, the message contains a large number of BAR 13: can't allocate I/O resource [0x0000-0xffff] information when the system is started.

Jul  7 10:32:02 egvlnx03 kernel: ACPI: ACPI bus type pnp unregistered

Jul  7 10:32:02 egvlnx03 kernel: system 00:01: ioport range 0x1000-0x103f has been reserved

Jul  7 10:32:02 egvlnx03 kernel: system 00:01: ioport range 0x1040-0x104f has been reserved

Jul  7 10:32:02 egvlnx03 kernel: system 00:01: ioport range 0xcf0-0xcf1 has been reserved

Jul  7 10:32:02 egvlnx03 kernel: system 00:08: iomem range 0xfed00000-0xfed003ff has been reserved

Jul  7 10:32:02 egvlnx03 kernel: system 00:0d: ioport range 0xfce0-0xfcff has been reserved

Jul  7 10:32:02 egvlnx03 kernel: system 00:0d: iomem range 0xf0000000-0xf7ffffff has been reserved

Jul  7 10:32:02 egvlnx03 kernel: system 00:0d: iomem range 0xfe800000-0xfe9fffff has been reserved

Jul  7 10:32:02 egvlnx03 kernel: pci 0000:00:15.3: BAR 13: can't allocate I/O resource [0x10000-0xffff]

Jul  7 10:32:02 egvlnx03 kernel: pci 0000:00:15.4: BAR 13: can't allocate I/O resource [0x10000-0xffff]

Jul  7 10:32:02 egvlnx03 kernel: pci 0000:00:15.5: BAR 13: can't allocate I/O resource [0x10000-0xffff]

Jul  7 10:32:02 egvlnx03 kernel: pci 0000:00:15.6: BAR 13: can't allocate I/O resource [0x10000-0xffff]

Jul  7 10:32:02 egvlnx03 kernel: pci 0000:00:15.7: BAR 13: can't allocate I/O resource [0x10000-0xffff]

Jul  7 10:32:02 egvlnx03 kernel: pci 0000:00:16.3: BAR 13: can't allocate I/O resource [0x10000-0xffff]

Jul  7 10:32:02 egvlnx03 kernel: pci 0000:00:16.4: BAR 13: can't allocate I/O resource [0x10000-0xffff]

Jul  7 10:32:02 egvlnx03 kernel: pci 0000:00:16.5: BAR 13: can't allocate I/O resource [0x10000-0xffff]

Jul  7 10:32:02 egvlnx03 kernel: pci 0000:00:16.6: BAR 13: can't allocate I/O resource [0x10000-0xffff]

Jul  7 10:32:02 egvlnx03 kernel: pci 0000:00:16.7: BAR 13: can't allocate I/O resource [0x10000-0xffff]

Jul  7 10:32:02 egvlnx03 kernel: pci 0000:00:17.3: BAR 13: can't allocate I/O resource [0x10000-0xffff]

Jul  7 10:32:02 egvlnx03 kernel: pci 0000:00:17.4: BAR 13: can't allocate I/O resource [0x10000-0xffff]

Jul  7 10:32:02 egvlnx03 kernel: pci 0000:00:17.5: BAR 13: can't allocate I/O resource [0x10000-0xffff]

Jul  7 10:32:02 egvlnx03 kernel: pci 0000:00:17.6: BAR 13: can't allocate I/O resource [0x10000-0xffff]

Jul  7 10:32:02 egvlnx03 kernel: pci 0000:00:17.7: BAR 13: can't allocate I/O resource [0x10000-0xffff]

Jul  7 10:32:02 egvlnx03 kernel: pci 0000:00:18.2: BAR 13: can't allocate I/O resource [0x10000-0xffff]

Jul  7 10:32:02 egvlnx03 kernel: pci 0000:00:18.3: BAR 13: can't allocate I/O resource [0x10000-0xffff]

Jul  7 10:32:02 egvlnx03 kernel: pci 0000:00:18.4: BAR 13: can't allocate I/O resource [0x10000-0xffff]

Jul  7 10:32:02 egvlnx03 kernel: pci 0000:00:18.5: BAR 13: can't allocate I/O resource [0x10000-0xffff]

Jul  7 10:32:02 egvlnx03 kernel: pci 0000:00:18.6: BAR 13: can't allocate I/O resource [0x10000-0xffff]

Jul  7 10:32:02 egvlnx03 kernel: pci 0000:00:18.7: BAR 13: can't allocate I/O resource [0x10000-0xffff]

After verification, an article explains the cause of the message and indicates that the problem can be ignored.

This message indicates that the kernel cannot allocate prefetch memory slots.

The number of the prefetch memory slot is 13 (PCI_BRIDGE_RESOURCES ).

If the system has 13 PCI bridge devices on the PCI bus, the above messages are recorded.

Although the message is recorded, the kernel uses a non-prefetch window instead of a prefetch area. Therefore, you can ignore this issue.

We do not want to modify the kernel parameters for the time being. For the time being, we only recommend that the system administrator add memory resources and then observe the specific operating conditions of the system.

References:

Http://ju.outofmemory.cn/entry/193909

Https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/

Http://linux.it.net.cn/e/bug/2016/0102/19226.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More