What if the server crashes? How to troubleshoot

Source: Internet
Author: User

This article describes how to troubleshoot the entire server after it crashes.

Most of us have encountered this situation: the server has no response. As a result, we cannot access the task manager or even the network sharing area on the server. Of course, there seems to be always a task-critical server. This means that the IT administrator in charge of the server will inevitably be panic.

When processing server crashes, the difference between the so-called hard crash call hang and soft crash soft hang is very important. This often helps us at least diagnose Basic Problems Based on what operations can be performed on the server and what operations cannot be performed. For example, if we cannot ping the test server, we cannot switch the numeric Lock key NumLock through the keyboard) or the capital Lock key Caps Lock), or the mouse cursor does not respond, then we are very likely to encounter a hardware crash. These problems are generally related to hardware and may be related to drivers), but they are rarely related to Windows operating system configuration problems or memory leaks. When a hard crash occurs, the system crashes at a very low level in the kernel and no longer processes threads. In case of hard crashes, the first step is to contact the hardware vendor to diagnose the system. Unless you have specific reasons to suspect that the problem lies in a hardware, such as the recently installed memory), you are not advised to remove or replace the hardware.

Now let's talk about soft crashes. When the server is in a soft-dead state, it basically does not respond, but the kernel is still working at a very low level-for example, ping the test or switch the number lock key. In the soft-dead state, you may not be able to log on to the machine locally or through Terminal Services, or you may encounter a blank desktop, however, the network and printer share areas are still accessible. This is common for the type of symptoms we see during memory depletion or process deadlocks.

We can see that a common dead end problem is caused by the memory depletion of the paging or non-Paging pool. When these resources are exhausted, you will see events similar to the following in the System Event Log:

 
 


As you can see, the 2019 error indicates that the non-Paging pool memory is exhausted; the 2020 error indicates that the paging pool memory is exhausted. If you see any such event in the log before the crash, the depletion problem may be solved together. Our Platforms CPR team published a blog post http://blogs.msdn.com/ B /ntdebugging/archive/2006/12/18/understanding-pool-consumption-and-event-id_3a00_--2020-or-2019.aspx last year, introducing how to troubleshoot issue 2019 and issue 2020, so we will not repeat it here.

The more difficult problem to identify the root cause is that the system page table item PTE is exhausted. In a previous article about switching between 3 GB and 3 GB switches, we briefly introduced the system PTE. PTE is used to track the Page Structure in the memory. For example, the book index tells you which page the book content is on. PTE tells the system which physical page in the memory the data resides on. The machine starts with a fixed number of PTE-the more memory the system has, the more PTE points to the Memory Page. If the system runs out of Available page table items, it can no longer allocate memory, resulting in a system crash or no response.

Unfortunately, when the system PTE is exhausted, there are no entries in the system log to indicate this problem. However, you can use Performance Monitor to Monitor idle system PTE. No counters detail the PTE usage of each process, so it is not always feasible to use performance monitor alone to find out the source of PTE depletion. You may be able to associate the process's handling volume with rising volume of handle leaks) with the PTE depletion. However, unless there are obvious root causes, memory dumping or real-time debugging is required.

To sum up, the following are several simple steps to be followed after the system is completely crashed:

1. Is this a hardware crash or a soft crash? If this is a hardware crash, it is likely that the underlying hardware has a problem, so contact the hardware vendor.

2. Check the event log to find any events in the event of a dead event log. Take the page pool depletion as an example. You will see the event number 2019 or 2020, and the event source is SRV.

3. Start the performance monitor and check the starting value of the idle system PTE under the memory object. This is not a good omen if the idle system PTE is less than the normal value of about 15000 or less at system startup. This means that all PTE instances are exhausted at startup, so there are fewer resources available for normal server operations.

4. Create a Performance Monitor log for a period of time. At least add counters for memory, processes, processors, and systems. The duration of running logs depends on the time after which the system crashes ). Set the interval so that you can capture at least 100 samples within the log validity period. Any low memory should be clear at a glance-especially if such leakage is stable.

5. Finally, please follow this article http://support.microsoft.com/default.aspx? Scid = kb; EN-US; 244139) describes some steps to prepare the system to capture the complete memory dump for analysis as needed.
 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.