What should I do if the server is down?

Last Update:2013-11-29 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

IT is a term commonly used by IT people. In more computer applications, IT is expressed as a dead end. IT is a headache for IT management and applications.

For general household applications, the data cannot be completely recovered at most from the highest level. If a server is on the same machine, the loss of computer files and high-level data may occur, this is a pity.

When the common appearance of the machine is mostly the operation interface, there is no response or "blue screen", the operating system does not respond, the software operation does not respond, the mouse or keyboard does not respond, and the hard disk indicator remains on without flashing. Although there are many causes for the crash, the root cause is that the hardware and software (including host systems, operating systems, and application software) are always different.

In this article, I will start from the reasons for the crash, try to systematically and comprehensively analyze the causes of the crash, and give the solution at the end of the article.

1. server hosting caused by hardware faults

The hardware is actually unable to take off these major components: CPU, memory, hard disk, power supply, and cooling system. The most common cause of hardware is cooling system faults.

1) poor heat dissipation

Multi-fan centralized design

Poor heat dissipation is the most common cause for servers. CPU, hard disk, and power supply are very hot at work, so it is very important to maintain good ventilation. The CPU is equivalent to a human brain. For the server, it needs to concurrently process various requests of hardware and software in the server. When the number of concurrent processing requirements suddenly increases, the CPU heat is like the brain's rapid thinking, there will be a "brain hot" symptom, and the hard disk I/O throughput will also be close to the amount, resulting in increased power, it will inevitably lead to an increase in heat. At the same time, the increase in power also produces a great deal of pressure on the power supply, which will inevitably lead to the generation of high heat. When the computing workload exceeds the server computing load, the heat of the three "hot producers" suddenly increases production in a short period of time, which may lead to server crashes.

Centralized cooling and hard disk side blow

In addition, in some servers that store and call videos or graphics, the graphics card and display devices are also very popular. If the heat dissipation design is poor, when there are many calls, it will also generate a machine phenomenon.

The solution is to select a CPU with a smaller calorific value when purchasing a server, and set an computing system that can achieve dynamic load balancing during system design, select a server quasi-system with good heat dissipation performance.

2) hardware or hardware/software incompatibility

Between hardware, if the motherboard, CPU, and memory do not match each other due to internal and external frequencies, it is possible that at the beginning of the Assembly, due to a small number of concurrent events, it can still run normally, however, when the number of concurrency increases to a certain level, the hardware system instability caused by the matching problem between devices is highlighted. This increases the chance of server-to-machine events.

Even with the support of the On-Demand System, you must consider the hardware compatibility.

There may be compatibility issues between hardware and software, such as between hardware and software that requires image processing. If it cannot be compatible, the entire system will not run stably, the probability of such a host event is also very high.

The incompatibility problem between server accessories is generally found on the DIY servers of friends. The hardware and software compatibility problem mainly lies in the imperfect connection between the hardware and applications. The solution to the above problems is to purchase hardware devices based on the specific system implementation that needs to be adopted, comprehensively consider compatibility between new hardware, between new accessories and new accessories, and between software and hardware to build a stable system.

3) CPU failure

CPU-caused failures mainly involve compatibility issues mentioned above, unstable processing performance caused by overclock, and unstable performance caused by the frequency of software Rewriting for more profits of some JS servers.

CPU: OK, don't touch me

As mentioned above, the frequency of overclocking is basically the same as that of software rewriting, except that the number of people performing this operation is different. One is a server enthusiast and the other is a server accessory agent.

The frequency of changes causes CPU instability, resulting in fewer running machines, mainly in some DIY market fields. It is very easy to solve the problem caused by frequency change. The server itself requires stable system operation, no special interests, no special professional knowledge, and do not change it at will.

4) memory failure

Memory causes the main faults on the machine: compatibility issues, loose memory, insufficient memory capacity, memory quality problems, and memory resource conflicts.

The memory stick is loose and basically won't appear in the brand server, because the server will be thoroughly tested by professional technicians before leaving the factory; memory Stick looseness mainly occurs in the DIY server market or when the operator upgrades the brand server.

The memory is so large that it is easy to use.

Insufficient memory capacity is mainly because the server processes too many concurrent tasks at the same time and occupies too many memory resources. As a result, the server cannot process the response, resulting in a crash.

Memory quality problems are mainly caused by chip faults before the memory chip leaves the factory or virtual welding during memory vendor assembly.

Memory resource conflicts mainly occur when operating systems or application software are running, because the system threads seize resources or software applications compete for memory addresses, this causes the server to crash.

The solution is to maintain a rigorous technical attitude and carefully check each part of the hardware during the Assembly, upgrade, and testing of purchasers and operators; for memory resource conflicts, you can choose redundant memory and clear the memory before the peak concurrency.

5) Hard Disk faults

Hard disk failure is mainly caused by the failure of the track and sector damage due to long use and excessive read/write times, in addition, the aging of each part of the hard disk, excessive disk fragments and junk files, etc.

In some powerful companies, the server disks that are running are updated every two or three years to migrate old hard disk data to the new hard disk, replace the old hard disk with some places such as test or office backup, to avoid failure of the hard disk to the maximum extent possible. For your reference, refer to the cost budget and other factors, and try to upgrade the disk before it is damaged to avoid damage to important data.

Disk fragments and spam files are generated during every minute of operation. due to excessive disk fragments or spam files, the amount of available space resources is too small, and the server may run on the machine when multiple programs run. The solution is to regularly clean up disk fragments and junk files.

6) Power Supply Fault

The failure caused by power supply is mainly caused by fan failure or damage to electronic devices and lines.

Currently, many server manufacturers on the market use HIPRO in bulk.

Power supply is caused by a fan or an electronic device or line fault. In addition to dust prevention, there are basically no special protection rules, because of random contingency, in most cases, you can only replace the backup power supply when the machine appears to minimize the running time loss caused by the machine.

7) Improper Operations

Generally, the data center space is used as efficiently as possible. For example, if you need to upgrade the hardware of a server in the cabinet, several rack servers are also installed on the server. In order not to interrupt the operation of the servers above, two or three operators may need to work together to hold up the above servers and drag out the servers to be upgraded. This process looks very simple, but if there is no experience in moving the machine, it may lead to poor contact with the bus due to vibration of the hard drive components on the above servers, as a result, it becomes a machine.

In addition, when the server is shut down due to a motherboard failure, it is basically the same as the reason for power failure. The troubleshooting method will not be repeated again.

2. software-related crashes

Software issues that need to be considered on the machine are complicated, involving host systems, operating systems and application software.

Host system failure

1) The CMOS parameter settings are unreasonable.

The unreasonable setup of CMOS parameters is the most common phenomenon caused by host system faults.

Because specific application rules are involved

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

What should I do if the server is down?

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support