(Zabbix) hard drive hardware health monitoring, component life monitoring

Source: Internet
Author: User

The year has a dream

E-mail:[email protected]

First, Smart overview

Hard disk failures are generally divided into two types: predictable (predictable) and unpredictable (unpredictable). The latter may occasionally occur, there is no way to prevent it, such as the sudden failure of the chip, mechanical impact and so on. However, such as motor bearing wear, disc magnetic medium performance is a predictable situation, can be found in a few days or even weeks before the abnormal phenomenon. If this happens, the smart feature will beep at the start-up, giving the user at least enough time to transfer important data to other storage devices.
The earliest hard-disk monitoring technology originated in 1992, IBM used the AS/400 computer's IBM 0662 SCSI 2-generation hard disk drive, which was later named predictive Failure analysis (fault-diagnosis technology) monitoring technology, It is by measuring several important hard drive security parameters in the firmware and evaluating their situation, and then the monitoring software draws two results: "Hard drive Security" or "failure soon".

    Shortly after, the then microcomputer maker Compaq and the hard drive maker Seagate, Quantum and Connor jointly proposed a similar technology called IntelliSafe. With this technique, the hard drive can measure its own health metrics and transmit the parameter values to the operating system and the user's monitoring software, each of which has the right to determine which metrics need to be monitored and set their security thresholds.  
In 1995, Compaq submitted the technical solution to the small Form Factor (SFF) Committee for Standardization, supported by IBM, Seagate, Quantum, Connor, and Western data In June 1996, a 1.3 revision was made, officially renamed S.M.A.R.T. (self-monitoring analysis and Reporting technology), the full name is "self-testing and reporting technology", Become a technical standard for automatically monitoring hard drive integrity and reporting potential problems.   


As an industry norm, smart specifies the standards that hard disk manufacturers should follow, and the conditions that meet smart standards include:
1) in the equipment manufacturing period to complete the smart needs of the parameters, attributes set;
2) under the specific system platform, the normal use of smart, through the BIOS detection, can identify whether the device supports smart and can display relevant information, but also to identify valid and invalid smart information;
3) Allow users to open and close the smart function freely;
4) in the user's use process, can provide smart information, determine the working status of the equipment, and can issue a corresponding correction instructions or warnings. Smart technology can display the English warning message on the screen if the hard drive is not in a bad condition when the hard drive and operating system are supported by smart technology and enabled: "Warning:immediatly BACKUP YOUR DATA and REPLACE YOUR HDD DISK drive,a FAILURE may be IMMINENT. (Warning: Backup your data immediately and replace the hard drive, the hard drive may fail.) )
The smart feature constantly collects information from individual sensors on the hard drive and stores the information in the system retention area of the hard disk, which is typically located on the first dozens of physical tracks of the physical surface of the hard disk, which is written by the vendor to the relevant internal management program. In addition to the Smart information table, there are low-level formatters, encryption and decryption programs, self-monitoring programs, automatic fixes, and so on. The monitoring software used by the user reads smart information through a command called "Smart Return Status" (Command code: B0H) and does not allow the end user to modify the information.

Yes, we are going to use smart to monitor the health of the hard drive and the life of the hard drive parts. Here is my monitoring effect:


SSD hard disk Hardware monitoring situation:

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M02/79/91/wKiom1aUu2mzmaOlAAESbDAjEPU067.png "title=" Clipboard.png "alt=" Wkiom1auu2mzmaolaaesbdajepu067.png "/>

Monitoring content Interpretation:

number of programming error count blocks : literal meaning

Power -on times: literal meaning

hard disk usage time percentage : This parameter's meaning at a glance, indicates the time that the hard drive is energized, the data value directly accumulates the time that the device powers on, the new hard disk certainly should be close to 0, but different hard disk counts unit differs, has in the hour counts, also has in minutes, seconds even 30 seconds unit , which is defined by the disk manufacturer. A close threshold indicates that the hard drive is nearing the expected design life, which does not indicate that the hard drive will fail or be scrapped immediately. Refer to the MTBF (average failure-free time) value of the hard drive given by the disk manufacturer to approximate the remaining life or failure probability.

Hard Drive temperature : literal meaning

HDD Component Erase Life percent : The average number of erase times for all good blocks. Flash chips have a write limit, and the file allocation table needs to be updated frequently when using the FAT file system. If some areas of flash memory are read and written too frequently, they will wear faster than other areas, which will significantly shorten the life of the entire hard drive (even if the number of erase attempts in other regions is far less than the maximum limit). Therefore, if the entire region has a uniform write volume, it can significantly prolong the chip life, which is called wear equalization measures. Popular meaning is hard disk block erase write life.

Hard drive error detection and correction (ECC) times: ECC (Error correcting Code) means "bug check and correction", which allows errors to be corrected and error correction so that read and write operations continue without interruption due to errors. The data value of this item records the number of errors corrected by ECC technology when the head is read and written on the platter

remaining retired block count percentage: The flush chip, which has been identified as damaged, is recorded in the retired block and will no longer be used, and will automatically map the backup to the original bad retired block.

percent of spare blocks used: that is, the number of backup blocks used to replace the retired block, when this use to 100%, the hard disk will not be used, because there is no spare block available, in the case of a faulty block can not be replaced, resulting in data loss.



SATA hard Disk Hardware monitoring situation:

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M02/79/8F/wKioL1aUu6rTsj81AAJFpKTxFQY510.png "title=" Clipboard1.png "alt=" Wkiol1auu6rtsj81aajfpktxfqy510.png "/>

Monitoring content Interpretation:

Number of sectors suspended: The data for this parameter represents the number of "unstable" sectors, that is, sectors that are waiting to be mapped (also known as "suspended sectors"). If an unstable sector is subsequently read and written successfully, the sector is no longer in the waiting range and the data value drops. A sector that is only faulted when read does not cause remapping, but is listed as "waiting" and may not be a problem in later reading, so only remapping occurs if the write fails. The next time an error continues to be written to the sector, a remap operation is generated, with the remap sector count (05) and the data value of the Remap event count (C4) increasing, and the data value of this parameter is decreased.


sector count cannot be corrected offline : The data for this parameter accumulates the total number of uncorrected errors that occur when the read-write sector occurs. A rising data value indicates a problem with the surface media or mechanical subsystem of the platter, some sectors must have been unreadable, and the operating system will return a read disk error if a file is in use. The next write operation will perform a remapping of the sector.


Number of head loads: For the past hard disk, the disc stopped rotating when the head arm docked at the center of the disc in the parking area, the head and the disc contact, only when the disc rotation to a certain speed, the head began to float on the disc and began to move outward to the data area. This makes the magnetic head in the hard drive start and stop with the disc friction, although the disk parking area does not store data, but no doubt start and stop a cycle, so that the head has undergone two times wear. So for the previous hard drive, the number of head landing (loading/unloading) is an important life-critical parameter. In modern hard drives, the head arm is usually parked on a specially designed stop outside the platter, away from the platter. Only when the disc rotates to the rated speed, the head arm starts to move inward (the disc axis) and the head is moved to the disc area (loading), and the head arm rotates outward to return to the docking rack to unload. This completely eliminates the hard drive start-stop when the head and the disc contact phenomenon, Western Data company called it "ramp loading technology." The importance of this parameter has been greatly reduced due to the fact that the head is not in contact with the disc during loading/unloading and there is no wear on the head. The data value of this parameter is the cumulative number of load/unload operations performed by the head. In principle, this number of loading/unloading should be the same as the number of start-stop of the hard disk, but for the laptop internal drive and desktop new energy-saving hard disk, this item of data will be very large. This is because the head arm assembly is designed with a fixed return torque to ensure that the head can automatically leave the disc radius with the spring force and quickly return to the docking frame when an accidental power loss occurs. So to keep the head of the drive in the disk radius, the drive motor (seek motor) of the head arm will continue to pass current. And let the head arm in the hard disk idle for a few minutes immediately after the unloading action, return to the docking rack, both to help save energy, but also reduce the hard disk by external impact caused by the head and the disc contact probability. While reloading adds a bit of seek time, it does more harm than benefit, so the number of load/unload heads in such a drive is much larger than the amount of data on the Power Cycle count (0C) or Start-stop count (04). However, this loading/unloading method has no contact with the disk head and the disc, so the design value has been greatly increased, usually notebook internal hard disk head load/unload rating in 30~60 million times, and the desktop new energy-saving hard drive magnetic head load/unload design values can be up to 1 million times.

hard drive power-on times: literal meaning

Hard Drive shaft motor life: literal meaning

percentage of hard disk use: literal meaning

Hard Drive temperature : literal meaning

number of unexpected hard drive outages : literal meaning

Bottom Data read error percentage : The underlying data read error rate is the error that occurs when the head reads data from the disk surface, and for some hard drives, data greater than 0 indicates a problem with the disk surface or the read-write head, such as media damage, head contamination, head resonance, and so on.

Seek error percentage: This one indicates the error rate of the head seek, there are many factors can lead to the increase of seek error rate, such as the mechanical system of the head assembly, the servo circuit has a local problem, the surface media is poor, hard disk temperature is too high and so on.

Percentage of remaining spare sectors: When a read/write/Checksum error persists in a sector of the hard disk, the hard drive firmware program adds the physical address of the sector to the Defect table (g-list), redirects the address to a pre-reserved sector and transfers the data together, which is called remapping. The hard drive after the remap operation is unable to discover bad sectors in Windows General inspection because the address has been pointed to an alternate sector, which is tantamount to masking bad sectors. Because the number of spare sectors reserved by different hard disks is not the same, it means that the defective table is full or the spare sector is exhausted, the remapping function has been lost, and the bad sectors are displayed and the data is lost directly. This is not only the life of the hard disk key parameters, and the number of remap sectors directly affect the performance of the hard disk, for example, some hard disk will have a large amount of data, but the current value of the situation is not obvious, although the hard drive can still work, but it is not appropriate to continue to use. Because the spare sector is located at the end of the disk (near the spindle axis), a large number of use of spare sectors will increase the seek time, hard disk performance decreased significantly.

Spindle Start-up retries: The data value of the spindle start-up retries is the count of the spindle motor's attempt to restart, that is, the number of times that the spindle motor has failed to reach the rated speed at the specified time after the start of the main engine. The increase in the amount of data indicates that the motor drive circuit or mechanical subsystem problems, the machine power supply is not enough to cause this problem.

Spindle Spin TIME Health: Spindle Starting time is the spindle motor from the start to reach the rated speed of the time used, data values directly display the time, in milliseconds or seconds, so the smaller the data value better. However, for normal hard disk, this is only a reference value, the hard drive each time the boot time is not the same, a start slightly slower does not indicate that there is a problem. Hard drive spindle motor from start to reach the rated speed of roughly 4 seconds ~15 seconds, too long start-up time to explain the motor drive circuit or bearing mechanism problems. The data value of this parameter is always 0 on some models of hard disk, which depends on the current value and the worst value.



The specific implementation method will be summarized later and then written to the blog.


This article is from "A Dream" blog, please be sure to keep this source http://yigemeng.blog.51cto.com/8638584/1734250

(Zabbix) hard drive hardware health monitoring, component life monitoring

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.