Eight Steps for troubleshooting AIX server faults (1)

Source: Internet
Author: User

Problem 1: the server is larger, but the computing power is reduced.

At that time, I needed to extract an AIX5.3LPAR from a POWER4-based? Old IBMpSeries? P670 server migration to POWER6-based? PSeriesp570 server. The old server resources are insufficient (using WorkloadManager to manage the resources of main applications on the server). Therefore, the new dynamic processor resources on the new hardware should provide the computing power I need. I performed mksysb on this LPAR, and then used NetworkInstallationManager to restore it on the new hardware and map it through the SAN disk.

I started this LPAR until it seemed everything went well before I started the application. Suddenly, the user started to call. They cannot access their products at all. When I log on, I find that the server is completely idle. There are no processes that consume a lot of resources on the server. Why are users having problems?

Problem 2: The image cannot be removed from the faulty hard disk.

One of my servers has an image root disk. One day, the error report indicates that bad blocks on one of the disks cannot be located again. I knew this was a precursor to a hardware failure, so I began to remove the image. However, the server says that the image cannot be completely removed, because one of the logical volumes has only one good copy, and it is on a faulty disk. How can I solve this problem and change the hardware?

Troubleshooting Process

Remember these two example problems. Now let's take a look at the process of solving them.

Step 1: Do not tamper

Once you find yourself in trouble, the best way to do it is to stop it. Just like Indiana Jones in the "Raiders of the ship", if you find that there will be a dart to you when you step on the floor, then stop in the same place and do not move on. More changes will only complicate the problem and may make the situation worse. It is meaningless to solve multiple problems when a problem affects the normal operation of the system.

For the first example, I asked the user to exit the system immediately and then I terminated the application. I know that when the performance is poor, users' queries and input will be interrupted, which may damage their data and I do not want further changes to their environment before I check the system. Although users do not want to hear that they cannot use the new server now, they will be very happy to know why I am looking for the problem. In addition, this allows me to perform other troubleshooting steps in my own way.

Step 2: start with the basic command and then increase the complexity

When I was studying Kung Fu, I heard the story of a second-level black belt making thieves at the bus station. The students all wanted to know which tricks she used to drop the defender. Is it a Jinhu style? Or are you still in the middle of the market? We even imagined that she was so powerful that she was able to put the other party down. None of the results: She used one of the first techniques that the white band learned in the class-elbow in the front of the chest, then blow the nose.

AIX provides commands for checking all aspects of the server, including hardware and software. Even the most basic commands provide a good foundation for problem analysis. When there is not enough information or something is still abnormal, you can start to try more complex and powerful tools. However, you should start with the simplest commands and ideas and then use more powerful tools.

For the second example, I first look for hardware problems by checking the errpt output, then run the unmirrorvg command-a simple but powerful tool to remove the image-instead of running rmlvcopy on each logical volume on the disk. when I find that a logical volume cannot be deleted, I use other basic commands such as lspv, lsvg, and migratepv to collect information. I tried to use extendvg and mirrorvg to create another copy of the volume group on another disk. This still leaves some old partitions, So I went further and used syncvg and synclvdom to coordinate ObjectDataManager with the server. Finally, I used migratelp to try to transfer all logical partitions out of the disk. Unfortunately, none of these tools work, but they provide a lot of information.

Step 3: Reproduce the problem

The key aspect of any hypothetical and experimental process, based on a scientific approach, is the ability to rebuild the process and produce the same results. If not, the conclusion is at least uncertain. In the worst case, this could subvert the theory of scientists and damage their reputation, just as the physicist declared at room temperature cold fusion in the 1990s S.

Or, in my opinion, if it fails at the beginning, try other places to see if it can cause the same problem.

When managing an AIX server, if something goes wrong and you have the resources required to reproduce the problem, perform the same operation on another similar LPAR, check whether the same result is generated. If the same attribute is modified on another server, the same result will be returned. It can be inferred that this operation is the root cause of the problem. However, if the opposite result is produced, you need to study the nuances between servers and try to speculate on the cause of the problem.

For the LPAR involved in the first example, I found that the problem did not occur when I switched the SAN disk back to the old p670 server and started it. Users can access their applications, the CPU is under normal load, and the CPU usage is more than 80% (10% kernel + 70% users ). Therefore, I can conclude that something unique on the p570 server causes a problem, rather than something introduced during the migration process.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.