Smart disk monitoring Solution

Last Update:2014-10-15 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Use of the smartctl command line.

650) This. width = 650; "alt =" edit "src =" http: // 192.168.1.2/images/edit.png? 1355836063 "/>

Command Line instructions:

Currently, all servers we use are equipped with LSI raid cards. When the disk is a SAS disk and smartctl is used, we need to add:

 smartctl -d megaraid,$deviceid  /dev/$diskname

When the disk is a SATA disk and smartctl is used, you need to add:

 smartctl -d sat+megaraid,$deviceid  /dev/$diskname

You can use the RAID card tool to view the disk interface type.

    megacli -cfgdsply -aall |grep ‘PD TYPE‘

If the RAID card is not used, the-D parameter is not required.

650) This. width = 650; "alt =" edit "src =" http: // 192.168.1.2/images/edit.png? 1355836063 "/>

Command Line Return Value

After smartctl is executed, it can start from $? Obtain the return value from the shell variable. If the disk is completely normal, the return value is 0. Otherwise, set the corresponding bit based on the error type.
Each bit is described as follows:

       Bit 0: Command line did not parse.       Bit 1: Device open failed, device did not return an IDENTIFY DEVICE structure, or device is in a low-power mode (see -n option above).       Bit 2: Some SMART or other ATA command to the disk failed, or there was a checksum error in a SMART data structure (see -b option above).       Bit 3: SMART status check returned "DISK FAILING".       Bit 4: We found prefail Attributes <= threshold.       Bit 5: SMART status check returned "DISK OK" but we found that some (usage or prefail) Attributes have been <= threshold at some time in the past.       Bit 6: The device error log contains records of errors.       Bit 7: The device self-test log contains records of errors.  [ATA only] Failed self-tests outdated by a newer successful extended self-test are ignored.

View bit settings:

      status=$?       for ((i=0; i<8; i++)); do         echo "Bit $i: $((status & 2**i && 1))"        done

You need to monitor whether bit3, bit4, bit6, bit7, and bit5 are set. If you want to set other locations, you need to remind you.

650) This. width = 650; "alt =" edit "src =" http: // 192.168.1.2/images/edit.png? 1355836063 "/>

Attributes displayed by smartctl:

Take a server in the company as an example:

[[email protected] ~]#smartctl -A -P use  /dev/sdbsmartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.32-279.el6.x86_64] (local build)Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net=== START OF READ SMART DATA SECTION ===SMART Attributes Data Structure revision number: 16Vendor Specific SMART Attributes with Thresholds:ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE  1 Raw_Read_Error_Rate     0x000b   086   086   016    Pre-fail  Always       -       10813449  2 Throughput_Performance  0x0005   132   132   054    Pre-fail  Offline      -       105  3 Spin_Up_Time            0x0007   117   117   024    Pre-fail  Always       -       615 (Average 615)  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       314  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0  8 Seek_Time_Performance   0x0005   112   112   020    Pre-fail  Offline      -       39  9 Power_On_Hours          0x0012   097   097   000    Old_age   Always       -       23637 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       313192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       478193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       478194 Temperature_Celsius     0x0002   222   222   000    Old_age   Always       -       27 (Min/Max 5/70)196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

1. First, the attribute_name list provided by different disk manufacturers may be different, but s. m.a. r. the subset of the T attribute list, the complete smart attribute list, and the meaning of each attribute can be found here:
Http://en.wikipedia.org/wiki/S.M.A.R.T.#8

2. when_failed
When_failed field display rules:

  if(VALUE <= THRESH)           WHEN_FAILED ＝ "FAILING_NOW";  else if (WORST <= THRESH)           WHEN_FAILED ＝ "in_the_past"(or past);  else            WHEN_FAILED ＝ "-";

That is to say, when the when_failed field of a attribute_name is "-", it indicates that this attribute is normal and has never encountered an exception.

At the same time, when bit4 is returned by the smartctl command, bit 5 can check which attribute_name is not "-", which indicates that this field has a problem.

650) This. width = 650; "alt =" edit "src =" http: // 192.168.1.2/images/edit.png? 1355836063 "/>

Simple smartctl monitoring Solution

Perform a smartctl scan for each disk in less than half an hour:

smartctl -a  /dev/$devname

Check the returned values of smartctl every time,
If bit2 is returned, you can use smartctl-X-B warn/dev/$ devname to see which commands are not supported
Warning: device does not support SCT data table command
Warning: device does not support SCT error recovery control command
If the returned value is set to bit4 or bit5, check the start of read Smart Data section in the smartctl output, that is, the attribute mentioned in the previous section, and record the attribute_name where the when_failed field is not.
If bit6 is set in the returned value, record the execution result of smartctl-l xerror/dev/$ devname.
If bit7 is set, the execution result of smartctl-l xselftest/dev/$ devname is recorded.
If bit3 is set, smart self-check fails.

In addition to bit5. it is best to issue alarm information in real time for the above bit,
If other bit values are set, real-time alarms are not required.

650) This. width = 650; "alt =" edit "src =" http: // 192.168.1.2/images/edit.png? 1355836063 "/>

Description of attribute_name:

Because the attribute_name provided by disks of different manufacturers is inconsistent, and I do not understand the meaning of some fields, the alarm information is not distinguished by attribute_name for the time being.
For example, we are concerned about throughput_performance. The smart device of Hitachi's disks in the company contains this information, but Seagate's disks do not.
For more detailed monitoring solutions, you need to have a deep understanding of the attributes in attribute_name before deciding.

650) This. width = 650; "alt =" edit "src =" http: // 192.168.1.2/images/edit.png? 1355836063 "/>

SSD disk life monitoring

SSD disk life monitoring mainly monitors the following attributes:

Media_wearout_indicator: indicates the consumption, indicating the number of disk writes on the SSD; Bytes: Number of Bad blocks generated after the factory host_writes_32mib: Number of 32 MIB written. available_reservd_space: The remaining storage space on the SSD.

The attribute above requires an alarm as long as the value field is close to the value of the thresh field. The preceding description can also be used for processing.

Smart disk monitoring Solution

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Smart disk monitoring Solution

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Smart disk monitoring Solution

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support