Network management experience on server common software fault resolution

Source: Internet
Author: User
Tags safe mode system log hosting web hosting

Server Software failure is the highest proportion of server failures, accounting for about 70% of the problem, and the process must be more deliberate. There are many reasons for the software failure of the server, the most common is the server BIOS version is too low, server management software or server drivers have bugs, application conflicts and man-made software failure. The following examples illustrate the maintenance methods of various software failures.

There is an HP LH6000R server, after the boot, the system log reported a voltage regulator module exception (VRM) error, the message is: "Voltage Regulator Module (VRM) over/under-voltage 2.88v/0v." On the surface, most likely the server's voltage regulator module or other hardware failure, very easy to cause maintenance personnel considered a hardware failure.

The maintainer immediately uses the hardware on the other lh6000r to test and finds that the server is still reporting a VRM error even with the new accessories. At the point of no return, the maintenance engineer brought the latest CPU management board (CPU Management Control) firmware (firmware), then upgraded the CPU management plate firmware, the server returned to normal immediately.

The firmware upgrade method is to extract the CPU Management Board (CMC) firmware Refresh program in the server's Navigator (navigation CD), the program is FLASH.EXE, and then the Lh6kc.bin (CPU Management Board firmware) will be downloaded from the Web Copy to a DOS boot disk and use this disk to start the server. Then run "FLASH/CMC a:lh6kc" in DOS. BIN, after the refresh is complete, restart the server. This upgrade method is also suitable for refreshing the system BIOS, and so on, but the parameters of the Flash command is different and update firmware and BIOS file name is different, please refer to the server instructions.

Any server's firmware and BIOS will have different bugs, because bugs are unavoidable, so we can not mistakenly think that the server's BIOS program is perfect, but should update the server's firmware and BIOS, but should be cautious before upgrading, The wrong upgrade method can cause serious consequences.

At present, the popular high-end servers have a strong management procedures, providing customers with a convenient way of management; The server also has a variety of operating system drivers, to facilitate the use of customers in a variety of operating systems. However, any program in the world will have some bugs that will affect the user's use. However, the server vendors will always be the first time to develop a new program, customers only need to update these programs in a timely manner can avoid such failures.

When the software failure of the server is this class, the performance of the phenomenon is also different. In general, the management bug can cause the system to slow down, the CPU occupancy rate is higher, not the normal use of certain functions, and so on; Driver bugs can cause crashes, conflicts with certain software, and unstable disk work. The best way to see if a management program has gone wrong is to first disable the management tool in the system and then see if the server is still abnormal.

Since the management tool started with the system startup, it should be avoided first. For example, in Windows NT4, you first disable some server software services in the Administrative Tools service, and then modify the Startup items in the registry. If there is a problem with the driver, enter the system in Safe mode to see if it is normal. It should be noted, however, that in safe mode, it is normal (especially disk I/O aspects) that the system slows down.

Server administrators should often download the latest management tools and drivers on the server Web site. This will reduce the occurrence of a large part of the software failure.

In contrast, the software conflict caused by the fault diagnosis is more difficult, the need for managers have a relatively rich experience and keen observation.

Once a friend told me that he had a wave of servers that could not install SQL Server 2000, has been reloading n NT, troubleshooting system failures. And this unique server will also be a very important database server, so very worried. So I accompanied a friend to his company to see. This server is located in the room is very standard, perfect room, I checked the situation of the server, found that there is no hardware failure, and then ruled out the optical drive reading the possibility of poor power.

However, the friend carved SQL Server 2000 CD caused my suspicion, I let him come up with a genuine SQL Server installation, the result is still not. During the installation process, there is no error, can be in the runtime will automatically quit, without any hint. However, I found a message in the system log of the Event Viewer in the administration tool: Windata.exe caused an invalid data overflow. Windata is a program written by a friend, and it starts with the operating system. I end this process immediately, and then run SQL all normal.

For such software failures, it is best for the operator to check the log to see if there are any suspicious processes in the system. Current servers, whether high-end or low-end, support for standard programs such as SQL are fairly reliable, so the key to eliminating it is to end the suspect process.

There is also a software failure caused by human factors, which are generally caused by human error (including failure to operate by operation Process), accidental shutdown (including power supply suddenly not powering), or abnormal shutdown applications.

This kind of failure can be avoided if the factors of human error operation are strengthened. Here is a detailed description of the unexpected shutdown or abnormal shutdown procedures caused by the failure of the method.

It is important to shut down the system properly, especially the Web server. One of my friends was experiencing a data corruption or even loss because he didn't shut down the system properly. My friend was using the HP Web hosting server appliance, so I gave him some usage rules.

These methods are very effective for server maintenance, including proper shutdown procedures, how to avoid data loss, and how to recover from an abnormal shutdown of the system. Here's an example of my friend's HP Web hosting Server Appliance (Unix used, but the idea is valid for other operating systems).

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.