Example of server hardware failure handling

Source: Internet
Author: User
Tags requires system log

Server hardware failure refers to various types of errors caused by abnormal server hardware. Due to the complexity of the server structure, you must be careful and careful when checking. The following is illustrated with a LH6000 example.

There is an HP LH6000 with 256M of RAM and a PIII XEON 700 with 2M cache processor. No display after the boot, but the system log prompts a CPU voltage of 0 Volts of information, the system led three lights are flashing (LED three lights flashing is another way of the server alarm, I will explain after the text). This error is generally the processor Voltage Regulator module (VRM) error or CPU error or CPU and CPU plate contact is bad, but also may be CPU plate error, then the situation is more complex, must be seriously careful thinking. Because the CPU plate in the entire server, occupies a pivotal position, if it error server will be reported fatal error, and in the system log will prompt fatal error, but reported that the CPU voltage error is about 5%. We immediately swap the CPU in another CPU slot, after the boot is still the kind of fault. So in the preliminary judgment, can eliminate is the CPU plate bad.

At this point, remove the CPU carefully wipe gold fingers, and CPU plate in contact with the CPU, after the boot still no display.

Processor Voltage Module (VRM) failure is relatively high relative to the processor's bad condition. It immediately takes the next processor voltage module in another LH 6000 and installs it on this server. After the boot, the server is still not any display, the system log still prompts the CPU voltage of 0 Volts of information, the system led three lights are still flashing. Then the situation is more obvious. Then immediately removed from another LH6000 after a CPU installation, the boot normal.

In the maintenance of the server, clues will appear bewildering, in general, it is not possible to accurately determine the location of the problem. This requires the relevant personnel to have confidence and patience. The usual process of error is solved by the information on the system log, if no problem is solved, then find other factors and then look at the log information. In short, after the server error, must be resolved step-by-step, there is no shortcut to speak.

Another example:

There is an HP LH 4 boot does not display, found that the system log did not have any information, and the system LED is not lit. The initial judgment was that there was an error in the power supply. After careful examination, found that the power of the server is normal, so the most likely is the server's power management board failure. After replacing the power management board, the boot display is normal. But at this point, the new problem is: self-test, with CTRL+M can not detect the hard drive.

The hard drive is normal on another server, so clear the CMOS of this server immediately, but still not normal. I immediately surf the internet to find the latest BIOS for this server, and the BIOS does not fix the problem after upgrading. Also check the hard disk cage and the data cable and power cord in the server still error. At this point, the general situation will be suspected to be the server's I/O board (input and output plate) problems. But at this point, I found that there is a non-HP legacy network card on the I/O board, removing this NIC immediately after the server is all right.

Hardware failure does not simply refer to hardware problems, it also refers to the incompatibility between hardware. Because the normal operation of the server requires strong coordination between the components. We recommend that all components in the procurement of the same brand original, and to use the performance of the server components (in the example of the old card, even if the normal will seriously affect the performance of the server), so that will not occur at a fault.

I had a situation where the user needed to upgrade his HP LH6000 to a dual NIC, and I advised him to buy the original network card, but when he saw that the HP LH6000 network card was the Intel 82559 chip, he decided not to use the original card and adopted another brand with Intel 82559 of the network card. After a few days, he called me and said that his new NIC could not use network redundancy and data validation, and suspected server problems. I brought an HP NIC to the user, carefully check the server environment completely normal, the HP NIC installed to the machine after all normal. This example further illustrates, to play the server's maximum performance and function, must use the original brand original accessories. Non-original brand is not original accessories, can not support certain functions of the server, serious will affect the normal use of the server.

Generally speaking, the high-end server alarm system is relatively perfect, in addition to the system log, there are indicators. In the case of HP LH6000, the green light on the indicator illuminated steady indicates that the server is normal; the green light and yellow flashes indicate that the server is faulty, but it is not fatal; If the three lights flashing (green, yellow, red three lights) indicates that the server has a fatal failure and the server stops running. In contrast, LEDs can only hint at a more general failure, and the system log is relatively complete. In the maintenance, must carefully examine these two kind of alarm system information. One thing to note is that the system log is a memory with limited capacity (LH 6000 can store 200 messages). When the capacity is not enough must be emptied, otherwise the server will alert, usually the server LED report non-fatal error, but can not save any information.

To avoid the frequency of hardware failures, server administrators must be aware that the server's usage environment is completely normal. The more important server must be in the constant temperature and humidity environment, the voltage also should conform to, not only to use UPS, but also must grounding line, must be left 0 lines, right firewire, 0 voltage in 1~3 volts. On and off the server must conform to the normal process. The worker must strictly execute the operation process.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.