Understand the Common Faults of three major hardware devices on the X86 server platform

Last Update:2013-12-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

It may be rare for intel and AMD chip giants, from the previous Xeon 5400 to the mainstream Xeon 5600, Xeon 7500, as well as AMD's powerful 12-core x86 processor Magny-Cours, among others. At the same time, the other two cores of the server can not be underestimated on the basis of the CPU, with memory with ECC, ChipKill, hot swapping technology, components, such as RAID hard disks, that prevent data loss, work together to create a rock-solid X86 server.

However, because X86 servers have many similarities with desktops, there are many similarities between early deployment, mid-term maintenance, and post-management. Therefore, although the X86 server has a mature and stable architecture system, there is a "strike ". In particular, the enterprise has a lot of load applications, and the faults encountered are very common. sharing with you the faults of the three major components can effectively prevent you from appearing on the business platform in the future.

Server core-CPU

Hazard level:★

Fault REPLAY: A friend who has done tests knows that an Intel Xeon-based server has no display at startup, and the system indicator lights are flashing wildly. The most direct suspect is poor contact between the CPU and the motherboard, however, replacing it with another CPU slot on the multi-channel server motherboard still does not respond.

Solution: In this situation, the CPU Voltage has encountered an exception. It turns out that the cpu vrm (Voltage Regulator Module, Voltage adjustment Module) has encountered a fault, the DC Circuit conversion on the main board cannot be performed, and a stable operating voltage cannot be provided for the CPU. At this point, the CPU can only be changed.

I believe that this fault is fatal, and CPU damage will directly lead to unavailability of the entire server, but the security of the CPU itself is very high and the failure rate is extremely low. Therefore, in daily maintenance tasks, service interruption caused by CPU damage is rare, and its harm level is not too high, if it is a multi-channel server, there is no need to worry about the CPU damage caused by server downtime.

The other two cores of the server platform are memory and hard disk. Specific to the memory selection, there are some differences between the server memory and ordinary desktop memory. Users who have carefully observed the server memory will find that, compared with the general memory single-sided 8 particles design, the server memory usually has nine chips on one side, which is what we often call ECC memory.

Server read performance-memory

Hazard level:★★☆

Fault REPLAY: Previously, on a server with 2 2 2 GB of memory installed, the server processes data more and more slowly due to its excessive services, you can upgrade the server by adding two memory disks of the same type. After all these memories are inserted into the motherboard, the system detects that there are only 6 GB of memory. The other 2 GB memory disappears mysteriously, and the new memory cannot be normally detected after repeated plugging.

Solution: You can find out on the official website of the server product because the memory slots of the server are paired, 1-4, 2-5, 3-6, 7-10, 8-11, and 9-12. The new memory is inserted in 2 or 3 slots and cannot form pairing, naturally, only one memory can be detected, and the memory is inserted into 5 slots. The 8 GB memory is detected smoothly.

It can be seen that the advantages of server memory are not only reflected in performance, but also put a lot of effort into fault tolerance capabilities to provide a stable environment for the entire platform, the ECC (error check and correction) technology, Register, and Chipkill used in the memory mentioned above are designed to improve the memory stability and enable better integration between memory stick and slot.

As a server storage terminal, the stability of the hard disk is related to the security of enterprise data. The server hard disk is the core data warehouse. All the software and data are stored here, therefore, server hard disks require high reliability and stability.

In addition, the server generally needs to run 24x7 hours, and its hard disk also needs to run 24 hours a day. Therefore, server hard disks have high requirements on stability and reliability. Three types of hard disks are used in the server market: SATA hard disks, SCSI hard disks, and SAS hard disks. the SATA hard disks are mainly used in the low-end server field, while the SCSI and SAS hard disks are oriented to high-end servers.

Server storage core-Hard Disk

Hazard level:★★☆

Fault REPLAY: each server will experience a crash and no-warning restart. If this happens frequently, IT will be detected by the it o & M personnel of the data center and IT will find that the hard disk has been working too long, A physical bad track occurs. Therefore, backing up and replacing the hard disk immediately is the best solution. The data in the hard disk is exported, and the I/O error keeps popping up during the data transfer process, this directly leads to a very slow data transfer speed and the loss of a lot of important data.

Solution: In this case, most errors occur on the head or disk. If you have scratched the hard disk but the area is not large, you can use a professional company to restore the data by replacing the head and restoring more than 95% of the data. This situation is relatively lucky.

However, it is usually said that the fault should be prevented before occurrence. If the fault is found timely, the fault should be solved before the disk is damaged by more physical damage. Once the disk is seriously damaged, the data will be permanently lost, to avoid this situation, we recommend that you do the following:

In hard disk selection, professional server hard disks are required. For example, the average failure-free time exceeds 1600000 hours and the annual failure rate is lower than 0.55%, in terms of earthquake resistance, it must have an impact resistance of more than 300 Gbit/s/2 ms. In addition, the RAID Array Technology of related servers, such as RAID 5, is composed of at least three hard disks, while writing data to the hard disk, the verification information is also written. When one hard disk fails, the data of the faulty hard disk can be obtained from the other two hard disks according to the algorithm, security is greatly improved.

The fault of the above three components is just a simple introduction. In fact, server faults are not only limited to these points, but also have similar problems in power supply, management module, and nic, we hope that you can accumulate more experience in applications to minimize the incidence of failures and provide a stable and flexible IT application environment.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Understand the Common Faults of three major hardware devices on the X86 server platform

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Understand the Common Faults of three major hardware devices on the X86 server platform

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support