How to troubleshoot server failures

Source: Internet
Author: User
Tags requires safe mode system log

This paper is divided into three parts: the first part is about the basic principles of troubleshooting server troubleshooting; The second part describes some examples of server hardware troubleshooting; The third part describes some examples of troubleshooting server software.

The first part of the server troubleshooting Basic principles of the problem

First, the server boot does not show how to do

1, check the power supply environment, 0-fire; 0-Ground voltage

2, check the power led, if light, normal?

3. When the power switch is pressed, is the indicator light on the keyboard on? Do all the fans turn?

4, whether the monitor has been replaced, replace another monitor

5, remove the increase in memory

6, remove the increased CPU

7. Remove the added third party I/O card

8, check the memory and CPU plugging is reliable

9, clear CMOS

10, replace the main spare parts, such as system board, memory and CPU

What are the basic principles of server fault scheduling?

1, try to restore the system default configuration

A: Hardware configuration: Removal of Third-party manufacturers spare parts and non-standard accessories

B: Resource configuration: Clear CMOS, restore resource initial configuration

C:BIOS,F/W, Driver: Upgrade the latest bios,f/w and related drivers

D:TPL: Extended Third party I/O cards belong to the Hardware Compatibility List (TPL) for this model

2, from basic to complex

A: The system from the individual to the network: first of all, the failure of the server to run independently, to be tested after normal access to the network operation, observe the fault phenomenon change and deal with

B: From the smallest system to the real system on the hardware: from the hardware that can be run to the real system

C: Software from basic systems to reality systems: from the basic operating system to the real system

3, exchange contrast

A: In the most likely to be the same condition, the exchange operation simple effect of the parts

B: Exchange of NOS carrier, not only exchange software environment

C: Swap hardware to exchange hardware environments

D: Exchange the whole machine, not only exchange the overall environment

Third, server troubleshooting needs to collect what information

Server information:

1. Machine model

2, Machine serial number (s/n: such as: NC00075534)

3, Bios version

4, whether to increase other equipment, such as network card, SCSI card, memory, CPU

5, how to configure the hard disk, whether to do array, array level

6, install what operating system and version (Winnt 4, Netware, Sco, others)

Fault information:

1, on the post, the screen displays the exception information

2, the status of the server itself LED

3, Alarm sound and beep codes

4. Nos Event record file

5, Events Log files

Determine the type of failure and the symptom:

1, no display on the boot

2, the power of self test phase failure

3, installation phase failure and phenomenon

4, the operating system load failed

5, System operation phase failure

The second part of server hardware fault handling several cases

A hardware failure is a type of error caused by an exception to the server hardware. Due to the complexity of the server structure, you must be careful and careful when checking. Here is a foolproof 4500 as an example to illustrate. (example, in practical problems, if encountered similar phenomenon, also need specific problems specific analysis, do not blindly apply)

There is a 4500, equipped with 256M of memory, using a PIII XEON 500 with 2M cache processor. No display after the boot, but the system log prompts a CPU voltage of 0 Volts of information, the system led three lights are flashing (LED three lights flashing is another way of the server alarm, I will explain after the text). This error is generally the processor Voltage Regulator module (VRM) error or CPU error or CPU and CPU plate contact is bad, but also may be CPU plate error, then the situation is more complex, must be seriously careful thinking. Because the CPU plate in the entire server, occupies a pivotal position, if it error server will be reported fatal error, and in the system log will prompt fatal error, but reported that the CPU voltage error is about 5%. We immediately swap the CPU in another CPU slot, after the boot is still the kind of fault. So in the preliminary judgment, can eliminate is the CPU plate bad.

At this point, remove the CPU carefully wipe gold fingers, and CPU plate in contact with the CPU, after the boot still no display.

Processor Voltage Module (VRM) failure is relatively high relative to the processor's bad condition. So immediately take the next processor voltage module in another 4500, installed in this server. After the boot, the server is still not any display, the system log still prompts the CPU voltage of 0 Volts of information, the system led three lights are still flashing. Then the situation is more obvious. So immediately from another 4500 to remove a CPU installed, the boot normal.

Summarize:

In the maintenance of the server, clues will appear bewildering, in general, it is not possible to accurately determine the location of the problem. This requires the relevant personnel to have confidence and patience. The usual process of error is solved by the information on the system log, if no problem is solved, then find other factors and then look at the log information. In short, after the server error, must be resolved step-by-step, there is no shortcut to speak.

Another example:

There is a foolproof 4200 boot does not display, found that the system log did not have any information, and the system LED is not lit. The initial judgment was that there was an error in the power supply. After careful examination, found that the power of the server is normal, so the most likely is the server's power management board failure. After replacing the power management board, the boot display is normal. But at this point, the new problem is: self-test, with CTRL+M can not detect the hard drive.

The hard drive is normal on another server, so clear the CMOS of this server immediately, but still not normal. Immediately on the Internet to find the latest BIOS for this server, upgrade the BIOS does not solve the problem. Also check the hard disk cage and the data cable and power cord in the server still error. At this point, the general situation will be suspected to be the server's I/O board (input and output plate) problems. But at this time, I found that on the I/O board has a non associative legacy network card, immediately remove this NIC after the server is all normal.

Hardware failure does not simply refer to hardware problems, it also refers to the incompatibility between hardware. Because the normal operation of the server requires strong coordination between the components. We recommend that all components in the procurement of the same brand original, and to use the performance of the server components (in the example of the old card, even if the normal will seriously affect the performance of the server), so that will not occur at a fault.

There is also a situation: users need to upgrade his 3200 to dual network card, I recommend him to buy the original network card, but when he saw the 4500 network card is used by the Intel 82559 chip, flatly decided not to use the original network card and another brand using the Intel 82559 network card. After a few days, he called me and said that his new NIC could not use network redundancy and data validation, and suspected server problems. Maintenance engineer with an Intel 82559 network card to the user, carefully check the server environment is completely normal, the Intel 82559 network card installed on the machine after all normal. This example further illustrates, to play the server's maximum performance and function, must use the original brand original accessories. Non-original brand is not original accessories, can not support certain functions of the server, serious will affect the normal use of the server.

To avoid the frequency of hardware failures, server administrators must be aware that the server's usage environment is completely normal. The more important server must be in the constant temperature and humidity environment, the voltage also should conform to, not only to use UPS, but also must grounding line, must be left 0 lines, right firewire, 0 voltage in 1~3 volts. On and off the server must conform to the normal process. The worker must strictly execute the operation process.

Generally speaking, server maintenance personnel for hardware failure as long as there is a wealth of experience can quickly find fault, if you can not solve the server must quickly contact the service center 020-32487454.

The third part of the server common soft trouble solving ideas and examples

Server Software failure is the highest proportion of server failures, accounting for about 70% of the problem, and the process must be more deliberate. There are many reasons for the software failure of the server, the most common is the server BIOS version is too low, server management software or server drivers have bugs, application conflicts and man-made software failure. The following examples illustrate the maintenance methods of various software failures.

There is a foolproof 3500 server configured for dual PIII 500 with 521K cache CPU, 512M memory. After power-on, the system log reported the Voltage Regulation module exception (VRM) error, the message is: "Voltage Regulator Module (VRM) over/under-voltage 2.88v/0v." On the surface, most likely the server's voltage regulator module or other hardware failure, very easy to cause maintenance personnel considered a hardware failure. The maintainer immediately tested with other 3500 hardware and found that the server was still reporting a VRM error even with the new accessories. At the point of no return, the maintenance engineer brought the latest CPU management board (CPU Management Control) firmware (firmware), then upgraded the CPU management plate firmware, the server returned to normal immediately.

Firmware upgrade method is, 1 use floppy disk to start the computer, and then insert firmware floppy disk and run the relevant file Cabrillo above; 2 The system refreshes the BMC (motherboard controller) and HSC (Hot-swappable backplane Controller); 3 then ask the system to perform the first few options (usually 2); 4 and then the system asked the server's power configuration (usually 2); 5 If the answer has two power supplies, the system asks the server if there is a secondary fan-that is, whether there is a fan (usually N) in the position of the third power supply (redundant power supply); 6 The system then asks if you want to rewrite the BMC kernel use area; (usually N) and then the system asks if you want to enter a asset tag; (usually N) 7) The last system asks if you want to reboot the system after refreshing (usually y). This upgrade method is also suitable for refreshing the system BIOS and so on, the command parameters are different and update firmware and BIOS file name is different, please refer to the server instructions.

Any server's firmware and BIOS will have different bugs, because bugs are unavoidable, so we can not mistakenly think that the server's BIOS program is perfect, but should update the server's firmware and BIOS, but should be cautious before upgrading, The wrong way to upgrade can cause serious consequences.

At present, the popular high-end servers have a strong management procedures, providing customers with a convenient way of management; The server also has a variety of operating system drivers, to facilitate the use of customers in a variety of operating systems. However, any program in the world will have some bugs that will affect the user's use. However, the server vendors will always be the first time to develop a new program, customers only need to update these programs in a timely manner can avoid such failures.

When the software failure of the server is this class, the performance of the phenomenon is also different. In general, the management bug can cause the system to slow down, the CPU occupancy rate is higher, not the normal use of certain functions, and so on; Driver bugs can cause crashes, conflicts with certain software, and unstable disk work. The best way to see if a management program has gone wrong is to first disable the management tool in the system and then see if the server is still abnormal. Since the management tool started with the system startup, it should be avoided first. For example, in Windows NT4, you first disable some server software services in the Administrative Tools service, and then modify the Startup items in the registry. If there is a problem with the driver, enter the system in Safe mode to see if it is normal. It should be noted, however, that in safe mode, it is normal (especially disk I/O aspects) that the system slows down.

Server maintenance personnel should often download the latest management tools and drivers on the server Web site. This will reduce the occurrence of a large part of the software failure.

In contrast, the software conflict caused by the fault diagnosis is more difficult, the need for managers have a relatively rich experience and keen observation.

Once a user said that he had a foolproof server can not install SQL Server 2000, has been reloading n NT, troubleshooting system failure. And this unique server will also be a very important database server, so very worried. So the maintenance engineer went to his company to check it out. This server is located in the room is very standard, perfect engine room, check the situation of the server, found that there is no hardware failure, and then ruled out the CD-ROM drive poor potential. However, the user engraved SQL Server 2000 CD-ROM caused the engineer's suspicion, the engineer asked him to come up with a genuine SQL Server installation, the result is still not. During the installation process, there is no error, can be in the runtime will automatically quit, without any hint. However, I found a message in the system log of the Event Viewer in the administration tool: Windata.exe caused an invalid data overflow. Windata is a program written by the user himself and initiated with the operating system. I end this process immediately, and then run SQL all normal.

For such software failures, it is best for the operator to check the log to see if there are any suspicious processes in the system. Current servers, whether high-end or low-end, support for standard programs such as SQL are fairly reliable, so the key to eliminating it is to end the suspect process.

There is also a software failure caused by human factors, which are generally caused by human error (including failure to operate by operation Process), accidental shutdown (including power supply suddenly not powering), or abnormal shutdown applications.

This kind of failure can be avoided if the factors of human error operation are strengthened. Here is a detailed description of the unexpected shutdown or abnormal shutdown procedures caused by the failure of the method.

The normal shutdown System program is very important, especially the Web server, a user is due to the failure of the normal shutdown system program and experienced a data corruption or even lost experience.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.