Database server hard disk failure analysis and suggestions

Source: Internet
Author: User
Product Information: Product: DL580G5 model: 2017381-b21 serial number: CNG941S242 hardware architecture: DL580G5 single-host system architecture: RedHatLinuxEnterprise5 + OracleDatabase fault phenomenon: One DL580G5BAY5 on a June 07 GB hard disk and a red light, in September June 08, huipu Gold service engineer changed

Product Information:

Product: DL580 G5

Model No.: 2017381-b21

Serial number: CNG941S242

Hardware architecture:

DL580 G5 single-host

System Architecture:

Red Hat Linux Enterprise 5 + Oracle Database

Fault symptom:

A gb hard disk on a DL580 G5 BAY5 was reported with a red light on April 9,. After the huipu gold medal service engineer changed the New Hard Disk on April 9, the background data was synchronized between 20 and 20 ~ 30 minutes later, another BAY2 hard drive in the same array was highlighted with a red light and the operating system crashed. After the server is restarted, the operating system cannot be properly accessed, and the LOGVOL04 file is damaged.

Fault analysis:

1. in this CASE, the DL580 G5 is composed of 8 hard disks. Therefore, in actual application, the capacity of 7 hard disks is used by actual data, the capacity of another hard disk can be simply understood as storing verification data. Therefore, only one hard disk can be damaged in RAID5 array mode. In RAID5 array mode, the stored data and the corresponding parity information are not backed up, in addition, the parity information and the corresponding data are stored on different disks. When a hard disk of RAID 5 is damaged, the damaged data is restored using the remaining data and the corresponding parity information.

Take four hard disks as an example: see

2. in this CASE, the first disk in the DL580 G5 showed a red light on April 9, June 07. The next day, that is, around on April 9, June 08, another hard disk in the array, BAYA2, also encountered a read error, but it has not reached the decommission level, so there is no red light alarm, please refer to the BAY2 hard disk error message read through the log:

07:41:35

Physical Drive State

Drive failed. SCSI Port 1 scsi id 2 Physical drive 0002. failure reason: Aborted command. configured drive flag 01. spare drive flag 00. big drive 00000002. enclosure bay 02. enclosure box 00.(00 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 07 db 06 08 00 00 6c 2f 02 17 1b 68 00 00 00 06)

07:41:35

Logical Drive Status

State change, logical drive 00000000. Previous logical drive state: Logical drive is currently recovering.New logical drive state: Logical drive failed. Old spare status: 00000000 New spare status: No spare assigned(00 05 00 00 00 00 00 00 00 00 00 00 00 00 05 00 00 00 01 00 00 00 00 00 00 00 00 07 db 06 08 00 00 6c 2f 02 17 1b 68 00 00 00 07)

From the above content, we can see that there are two hard disks in the RAID 5 array, so the array information is incomplete.

3. therefore, after you replace the first hard disk, that is, the BAY5 hard disk, When you capture the verification data through the other seven hard disks, the corresponding data on the BAY2 hard disk is disordered, the following error occurs:

4. The BAY2 hard drive completely reported an error at around on April 9, June 11, with a red light on it.

The above is the fault analysis.

Follow-up suggestions:

1. From the above analysis, we can see that in a relatively important system, if the RAID 5 array-level redundancy mode is adopted, there is a great data risk. Because in the background, data is generated at all times, and verification data is generated at all times, the read/write load on the hard disk is very high. If more than one hard disk has bad blocks or even completely reports FAIL, the entire array is very dangerous and may even cause the application system to crash.

2. with this failure, we recommend that you use RAID5 + HOTSPARE or ADG arrays in the servers of important application systems. Both modes allow the loss of two hard disks within the same period of time.

3. Check related hardware and collect logs on a regular basis to check whether there are any potential faults and prevent them in advance. This content can be assisted by the huipu gold medal service. If necessary, huipu gold services can provide a two-month inspection (downtime needs to be scheduled in advance ).

4. if the server has a WINDOWS system platform in the same network segment, you can install the IRS remote monitoring software currently being promoted by HP, the monitored server automatically sends the relevant error content to the hp callcenter through the network when a fault occurs. The HP customer service client will send an email to the user, in order to timely repair the fault. (This software is free of charge, but three ports need to be opened on the host to connect to the Internet)

The above is the analysis of the cause of the fault and subsequent suggestions.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.