Practice: Eliminate the fault of the AIX Server

Last Update:2013-12-18 Source: Internet

Author: User

Tags ibm support

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In this article, learn how to solve ibm aix in a similar way®. You will be familiar with related tools and knowledge, so as to improve the skills to solve some difficult problems that may occur. This article provides two interesting scenarios that I have encountered and steps for detecting exceptions. Then stop, let you guess what went wrong, and finally give the answer.

Example

First, describe the two problems I encountered as a system administrator.

Problem 1: the server is larger, but the computing power is reduced.

At that time, I needed to migrate an AIX 5.3 LPAR from a POWER4-based™Old IBM pSeries®P670 server migration to POWER6-based®PSeries p570 server. The old server resources are insufficient (using Workload Manager to manage the resources of main applications on the server). Therefore, the new dynamic processor resources on the new hardware should provide the computing power I need. I performed mksysb on this LPAR, and then used Network Installation Manager to restore it on the new hardware and map it through the SAN disk.

I started this LPAR until it seemed everything went well before I started the application. Suddenly, the user started to call. They cannot access their products at all. When I log on, I find that the server is completely idle. There are no processes that consume a lot of resources on the server. Why are users having problems?

Problem 2: The image cannot be removed from the faulty hard disk.

One of my servers has an image root disk. One day, the error report indicates that bad blocks on one of the disks cannot be located again. I knew this was a precursor to a hardware failure, so I began to remove the image. However, the server says that the image cannot be completely removed, because one of the logical volumes has only one good copy, and it is on a faulty disk. How can I solve this problem and change the hardware?

Troubleshooting Process

Remember these two example problems. Now let's take a look at the process of solving them.

Step 1: Do not tamper

Once you find yourself in trouble, the best way to do it is to stop it. Just like Indiana Jones in the "Raiders of the ship", if you find that there will be a dart to you when you step on the floor, then stop in the same place and do not move on. More changes will only complicate the problem and may make the situation worse. It is meaningless to solve multiple problems when a problem affects the normal operation of the system.

For the first example, I asked the user to exit the system immediately and then I terminated the application. I know that when the performance is poor, users' queries and input will be interrupted, which may damage their data and I do not want further changes to their environment before I check the system. Although users do not want to hear that they cannot use the new server now, they will be very happy to know why I am looking for the problem. In addition, this allows me to perform other troubleshooting steps in my own way.

Step 2: start with the basic command and then increase the complexity

When I was studying Kung Fu, I heard the story of a second-level black belt making thieves at the bus station. The students all wanted to know which tricks she used to drop the defender. Is it a Jinhu style? Or are you still in the middle of the market? We even imagined that she was so powerful that she was able to put the other party down. None of the results: She used one of the first techniques that the white band learned in the class-elbow in the front of the chest, then blow the nose.

AIX provides commands for checking all aspects of the server, including hardware and software. Even the most basic commands provide a good foundation for problem analysis. When there is not enough information or something is still abnormal, you can start to try more complex and powerful tools. However, you should start with the simplest commands and ideas and then use more powerful tools.

For example, AIX errpt is in a variety of UNIX®Is one of the basic tools that can be found. It provides various information about hardware and software issues. If the-a flag or-j option and tag code are used, more detailed output is generated, the type of the problem described, the affected components, and how the system responds to the error type. If it does not provide enough information, you can use the diag command to further check the system. This command runs the test on each part of the hardware and operating system.

For the second example, I first look for hardware problems by checking the errpt output, then run the unmirrorvg command-a simple but powerful tool to remove the image-instead of running rmlvcopy on each logical volume on the disk. When I find that a logical volume cannot be deleted, I use other basic commands such as lspv, lsvg, and migratepv to collect information. I tried to use extendvg and mirrorvg to create another copy of the volume group on another disk. This still leaves some old partitions, So I went further and used syncvg and synclvdom to coordinate Object Data Manager with the server. Finally, I used migratelp to try to transfer all logical partitions out of the disk. Unfortunately, none of these tools work, but they provide a lot of information.

Step 3: Reproduce the problem

The key aspect of any hypothetical and experimental process, based on a scientific approach, is the ability to rebuild the process and produce the same results. If not, the conclusion is at least uncertain. In the worst case, this could subvert the theory of scientists and damage their reputation, just as the physicist declared at room temperature cold fusion in the 1990s S.

Or, in my opinion, if it fails at the beginning, try other places to see if it can cause the same problem.

When managing an AIX server, if something goes wrong and you have the resources required to reproduce the problem, perform the same operation on another similar LPAR, check whether the same result is generated. If the same attribute is modified on another server, the same result will be returned. It can be inferred that this operation is the root cause of the problem. However, if the opposite result is produced, you need to study the nuances between servers and try to speculate on the cause of the problem.

For the LPAR involved in the first example, I found that the problem did not occur when I switched the SAN disk back to the old p670 server and started it. Users can access their applications, the CPU is under normal load, and the CPU usage is more than 80% (10% kernel + 70% users ). Therefore, I can conclude that something unique on the p570 server causes a problem, rather than something introduced during the migration process.

Step 4: Study the problem

In the information age, you only need to press the keyboard a few times and click a few times to obtain a large amount of information. Better yet, the system administrator is often a member of a large community, which records many people and years of experience.

First, check the information of the manufacturer and the seller. Companies like IBM publish all their manuals, redbooks, technical documents, and even man pages online for research. You only need to enter simple keywords in the search bar of the main site to find a large number of suggestions and information that may be helpful.

Other sources of information I recommend include various news groups, forums, and sites that are frequently accessed by other system administrators. People who interact with the server often visit the technical site and comment on what they see during their work. For public help, most system administrators are willing to provide guidance or e-mail help. In addition, you can often find the old information related to other versions of the operating system and software, and find more information through them.

The primary skill of using these information sources is to use an appropriate keyword set. If I use a general website such as Google to study AIX problems, it will make sure that the search string starts with AIX to exclude information related to other UNIX styles. Then, it may contain the command output or the tag generated by errpt. I also make sure that double quotation marks ("") are added before and after a specific phrase to restrict the search to these specific problems and avoid irrelevant information, this is especially true for commonly used words (such as Logical Volume Manager.

If the disk failed to be relocated due to bad blocks, the search using the phrase AIX "bad block relocation" failure on Google produced hundreds of results, but it does not seem to match my situation.

Step 5: cancel all changes

Sometimes, the best way to solve the problem is to cancel all the changes that have been made and return to the original state. This step is not always feasible. Sometimes, overly enthusiastic Class C executives force you to roll back their servers. Or, due to time constraints, it is necessary to do so. In any case, rollback is one of the best tactics to choose from.

I put this step in the middle of the troubleshooting step list, because sometimes it is necessary to do so early and sometimes it is later. However, based on my experience, I think it is best to complete the first four steps before considering canceling all the changes. If the change is canceled immediately at the beginning of the troubleshooting process, the problem may not be resolved, and the same problem may occur during the next attempt to the same job. If you roll back late during the process, the normal running time may be affected, or the problem may be complicated, to the extent that it is impossible to roll back.

For the first example, due to time, I actually had to roll back the server migration operation. If the production server is out of service for a longer period of time, users and companies will lose money. It took a week to reschedule the job, which allowed me to do more research, but when I tried to migrate again, the problem began again. In the second example, the hardware issue cannot be rolled back. Unable to tell the server, "Back to the status before the bad block relocation Error !" I have to continue to work hard to overcome disk faults.

Step 6: change only one rule at a time

If none of the above steps work, you decide to change the main components or perform more radical operations on the server, remember one of the most important rules: change only one place at a time.

Multiple changes may result in one of two situations. First, if these changes solve the problem, you do not know which changes are valid. If you don't care about what solved the problem, it may not be a big deal, but good system administrators want to learn more, because they know that problems often occur multiple times in the same place. Second, if the problem persists, more complexity may be introduced. If you continue, you do not know which change to cancel. If you go far enough, the system will mess up a pot of porridge and you will get confused. (There is a joke on xkcd about this situation .)

If the problem persists after a change is made, you usually want to cancel it and try other measures. In the first example, this is the case: When I compare the Hardware Management Console profiles of the two servers, they are different. I noticed that the old POWER4 hardware uses a dedicated CPU, while the new POWER6 hardware uses an uncapped shared CPU pool. I want to know how this difference affects CPU performance, so I modified the profile on the POWER6 hardware to use a dedicated CPU. The strange thing is that, based on user feedback, the server is "normal" and I see the load on the processor. Therefore, I know the problem is definitely related to CPU resources, but I need to find out why it is like this.

Step 7: Turn to IBM Support

If you have tried all reasonable steps and need new ideas, you should usually Contact IBM Support. They have advanced troubleshooting tools and experts who are proficient in operating systems and related products (such as VIO and PowerHA, relevant cases can be called up to confirm and assist in solving similar problems. However, if you have not called 800-IBM-SERV before, you need to know the following points.

First, you should have an IBM contract number. There are multiple support levels, from the highest level of dedicated 24x7x365 support to support from eight o'clock A.M. to five o'clock P.M. for non-critical servers. You can purchase these support service packages directly from IBM or sign contracts with value-added vendors.

You also need to provide some information so that IBM Support can call up your account-usually the phone number, serial number, contract number, or physical location where the server is located. This information depends largely on the hardware case or software case you have established.

The support staff must also be informed of the severity or priority of the problem. The priority is divided into several levels from 1 to 4. Level 1 usually involves system stop or production impact. For this level, the phone number will be immediately transferred to the technical staff. Level 4 means that the processing time can be longer and is usually used for general management problems.

After you describe the problem and establish a support case, you will be given a tracking number-usually called PMR. This number identifies this case with other support staff you work. The hardware and software PMR are unique. If your problem crosses the border, you need a new number.

I have to contact IBM for both examples. For the first problem, IBM mobilizes people from VIO support to the kernel team to solve the problem. For the second question, only hardware technicians are involved. I provide information from the snap command for analysis.

Step 8: go to extremes

Sometimes there is no other way to solve the problem, but to try some unorthodox measures that most people think are crazy about. This is usually done when you are desperate, or even have a job or life at stake. In this case, IBM support staff often say, "if you do this, you will be in an unsupported state. You must start again before we can support it ." However, if your solution is effective, you may be able to save the trouble.

For my second example, after I contacted IBM Support, they said the only way was to generate a mksysb image to restore the server. Since we have nothing to lose, after discussing with my administrator team, we plan to create a triple image for the root disk and then dial out the disk from the server. Outbound disk may cause the server to fail to boot. However, the potential risk is that outbound disks may interfere with larger servers and cause all the above LPAR to crash. Do we dare to do this?

Answer

Now that I have provided a background for the question, you should answer it. Summary:

Migrate a server with Workload Manager enabled to faster hardware, but it does not work properly unless you set the LPAR profile to dedicated CPU instead of dynamic CPU. Why?

How can I recover a server from a disk that cannot be undone, or retrieve data from a physical partition that cannot be removed from the disk?

If you have an idea, continue.

Actual Situation

The first problem is the Workload Manager. Its Applications are limited to 50% of the CPU usage. Therefore, when the System Manager detects the LPAR in a round robin cycle, it asks "How many CPUs do you need ?" The server replied, "I currently only use half of the allocated CPU ." Therefore, the system administrator dynamically reduces the CPU nominal value by half. After this loop repeats several times, the CPU computing power is halved multiple times, basically close to zero. To solve this problem, adjust the Workload Manager pool to 100% of the CPU usage at most, so that the dynamic CPU nominal value will properly limit itself.

For the second example, only backup and recovery can be performed. If block relocation fails, no enterprise is willing to adopt a temporary solution. According to IBM Support, this problem is rare. Only mksysb can be executed to back up data to a good disk and restore the system. There is no other choice. After the operating system is restored, You can securely swap the disk and replace it without compromising other LPAR on the hardware.

Conclusion

We hope that you will have some knowledge about how the system administrator can resolve the fault of the AIX server, the strategy that can be used, the practice that should be avoided, and the suggestions on how to find a solution to the problem. These steps are not suitable for all situations, and there are other options, but they can indicate the correct direction.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More