Use of the smartctl command line.
650) This. width = 650; "alt =" edit "src =" http: // 192.168.1.2/images/edit.png? 1355836063 "/>
Command Line instructions:
Currently, all servers we use are equipped with LSI raid cards. When the disk is a SAS disk and smartctl is used, we need to add:
smartctl -d megaraid,$deviceid /dev/$diskname
When the disk is a SATA disk and smartctl is used, you need to add:
smartctl -d sat+megaraid,$deviceid /dev/$diskname
You can use the RAID card tool to view the disk interface type.
megacli -cfgdsply -aall |grep ‘PD TYPE‘
If the RAID card is not used, the-D parameter is not required.
650) This. width = 650; "alt =" edit "src =" http: // 192.168.1.2/images/edit.png? 1355836063 "/>
Command Line Return Value
After smartctl is executed, it can start from $? Obtain the return value from the shell variable. If the disk is completely normal, the return value is 0. Otherwise, set the corresponding bit based on the error type.
Each bit is described as follows:
Bit 0: Command line did not parse. Bit 1: Device open failed, device did not return an IDENTIFY DEVICE structure, or device is in a low-power mode (see -n option above). Bit 2: Some SMART or other ATA command to the disk failed, or there was a checksum error in a SMART data structure (see -b option above). Bit 3: SMART status check returned "DISK FAILING". Bit 4: We found prefail Attributes <= threshold. Bit 5: SMART status check returned "DISK OK" but we found that some (usage or prefail) Attributes have been <= threshold at some time in the past. Bit 6: The device error log contains records of errors. Bit 7: The device self-test log contains records of errors. [ATA only] Failed self-tests outdated by a newer successful extended self-test are ignored.
View bit settings:
status=$? for ((i=0; i<8; i++)); do echo "Bit $i: $((status & 2**i && 1))" done
You need to monitor whether bit3, bit4, bit6, bit7, and bit5 are set. If you want to set other locations, you need to remind you.
650) This. width = 650; "alt =" edit "src =" http: // 192.168.1.2/images/edit.png? 1355836063 "/>
Attributes displayed by smartctl:
Take a server in the company as an example:
[[email protected] ~]#smartctl -A -P use /dev/sdbsmartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.32-279.el6.x86_64] (local build)Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net=== START OF READ SMART DATA SECTION ===SMART Attributes Data Structure revision number: 16Vendor Specific SMART Attributes with Thresholds:ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 086 086 016 Pre-fail Always - 10813449 2 Throughput_Performance 0x0005 132 132 054 Pre-fail Offline - 105 3 Spin_Up_Time 0x0007 117 117 024 Pre-fail Always - 615 (Average 615) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 314 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 112 112 020 Pre-fail Offline - 39 9 Power_On_Hours 0x0012 097 097 000 Old_age Always - 23637 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 313192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 478193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 478194 Temperature_Celsius 0x0002 222 222 000 Old_age Always - 27 (Min/Max 5/70)196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
1. First, the attribute_name list provided by different disk manufacturers may be different, but s. m.a. r. the subset of the T attribute list, the complete smart attribute list, and the meaning of each attribute can be found here:
Http://en.wikipedia.org/wiki/S.M.A.R.T.#8
2. when_failed
When_failed field display rules:
if(VALUE <= THRESH) WHEN_FAILED = "FAILING_NOW"; else if (WORST <= THRESH) WHEN_FAILED = "in_the_past"(or past); else WHEN_FAILED = "-";
That is to say, when the when_failed field of a attribute_name is "-", it indicates that this attribute is normal and has never encountered an exception.
At the same time, when bit4 is returned by the smartctl command, bit 5 can check which attribute_name is not "-", which indicates that this field has a problem.
650) This. width = 650; "alt =" edit "src =" http: // 192.168.1.2/images/edit.png? 1355836063 "/>
Simple smartctl monitoring Solution
Perform a smartctl scan for each disk in less than half an hour:
smartctl -a /dev/$devname
Check the returned values of smartctl every time,
If bit2 is returned, you can use smartctl-X-B warn/dev/$ devname to see which commands are not supported
Warning: device does not support SCT data table command
Warning: device does not support SCT error recovery control command
If the returned value is set to bit4 or bit5, check the start of read Smart Data section in the smartctl output, that is, the attribute mentioned in the previous section, and record the attribute_name where the when_failed field is not.
If bit6 is set in the returned value, record the execution result of smartctl-l xerror/dev/$ devname.
If bit7 is set, the execution result of smartctl-l xselftest/dev/$ devname is recorded.
If bit3 is set, smart self-check fails.
In addition to bit5. it is best to issue alarm information in real time for the above bit,
If other bit values are set, real-time alarms are not required.
650) This. width = 650; "alt =" edit "src =" http: // 192.168.1.2/images/edit.png? 1355836063 "/>
Description of attribute_name:
Because the attribute_name provided by disks of different manufacturers is inconsistent, and I do not understand the meaning of some fields, the alarm information is not distinguished by attribute_name for the time being.
For example, we are concerned about throughput_performance. The smart device of Hitachi's disks in the company contains this information, but Seagate's disks do not.
For more detailed monitoring solutions, you need to have a deep understanding of the attributes in attribute_name before deciding.
650) This. width = 650; "alt =" edit "src =" http: // 192.168.1.2/images/edit.png? 1355836063 "/>
SSD disk life monitoring
SSD disk life monitoring mainly monitors the following attributes:
Media_wearout_indicator: indicates the consumption, indicating the number of disk writes on the SSD; Bytes: Number of Bad blocks generated after the factory host_writes_32mib: Number of 32 MIB written. available_reservd_space: The remaining storage space on the SSD.
The attribute above requires an alarm as long as the value field is close to the value of the thresh field. The preceding description can also be used for processing.
Smart disk monitoring Solution