Detailed analysis of top as commands for AIX performance monitoring

Source: Internet
Author: User

The topas command of AIX performance monitoring parses the most comprehensive dynamics of the operating system in detail, and the convenient performance view is the topas command. The topas output is used as an example below, briefly describe the performance monitoring of the AIX System for the reference of O & M engineers and system administrators. In addition, 1. view the error message errpt in the operating system. 2. Use df to view disk space usage. Here we mainly analyze performance issues. After running the topas command: # topas area 1: reflect the CPU usage and working conditions. Kernel: Description: CPU usage ratio of the operating system Kernel. As the basic software, the operating system also requires a certain amount of CPU and memory resources for Application Support and Services (by the way, memory resources will not be described later ), especially for memory resources, the heavier the system load, the more CPU and memory resources occupied by the corresponding kernel. Generally, the CPU usage of the kernel is not too long. Generally, it is less than the CPU usage of the application. User: Description: The CPU usage rate of the User process. This is the key value of CPU usage. This usage rate reflects the total CPU usage rate of various software running on the operating system. Generally, if the User + Kernel is continuously greater than 70%, it can be considered that the system may have serious CPU performance problems. Wait Description: Percentage of CPU time in the waiting state. Generally, the CPU waits for IO responses. As we all know, the main bottlenecks of computers are IO. When an application is executed, it needs to read and write data stored outside the disk, and the process will initiate an IO request and wait for the IO to complete. The CPU usage of this wait process is wait. When this value is very high, it means that IO is too late to respond to many IO requests. At this time, I can only try to optimize it at the IO level. Idle: Description: CPU Idle time ratio. This is not required. That is, the CPU time ratio is idle. The main possible cause of CPU usage problems: the database server executes a certain SQL or stored procedure (the stored procedure is the encapsulated SQL package) and requires a large number of operations (generally, the software design is unreasonable ). Or an exception exists in the application, such as an endless loop or a logic error occurs when another program is written. A general program error may cause a CPU to be fully occupied. For example, the above 20% is because a transaction program occupies a CPU of all time slices for a long time (the system has a total of five CPUs ). Area 2: displays the network usage. Netwok is used to list Nic interfaces. KBPS indicates the number of KB (kilobytes) I-Pack packets per second, the number of data packets output by O-Pack per second KB-In the number of bytes input per second KB-Out the number of bytes output per second. When we find network congestion (an error occurs when the network adapter fails to transmit data packets, that is, the network adapter fails to send data packets. Or when the network response is obviously slow, if there is no CPU problem, check the network traffic) It is found that the KBPS of a network card is more than four digits, even five-digit (this value depends on the gigabit network card or MB ). Let's take a look at what the NIC is and what services it is processing. Execute netstat-in on the command line to view the IP address of the corresponding en * interface, and check whether the traffic is high with the NIC or the production service Nic through the IP address. Then, use netstat-v en * to check the network adapter's detailed working status, including the number of error packets, conflicting packets, crc check errors, or network reset errors. For details about the above information, see the output of netstat-v en *. If a large number of crc or error packets occur, the network cable may be faulty or the network connection may be poor. If the above operations are normal and the network response is slow, the switch may be congested. Possible causes of network problems: loading a large amount of data (previously seen) through a MB network with a large number of queues for a long time ftp transmission, or network cable, switch problems, etc. Area 3: displays the disk usage. Disk Busy % percentage of Disk Busy, that is, the ratio of the maximum IOPS (I/O operations per second) that the Disk can meet to the current I/O quantity. Other parameters are not explained. You just need. Generally, we mainly look at the disk's Busy %. When the disk's Busy % continues to exceed 85%, we think the disk is quite Busy and there may be problems. Of course, you don't have to worry about the content that you know you want to produce a large number of IO operations, just wait for it to complete. Cause of the problem: the log Writing Process on the application server or the log query process reads and writes a large number of logs, resulting in a high disk busy rate, or frequent disk read/write by other programs. In the system, hdisk0 and hdisk1 are generally system disks, and the built-in SCSI disks have relatively low IOPS. It is easy to run at full capacity. Area 4: Status of process information. Name: Name of the process, that is, the Name of the binary file started when the process is executed. PID, process ID, and process ID are unique in the system. It is an important value for us to understand the tracing process information. CPU usage, disk I/O read/write, process memory usage, pagingspace usage, and so on. CPU % percentage of CPU usage by the process. PgSp: the size of the pagingspace occupied by the process. Owner of the Owner process, that is, the operating user who started the process. In topas, the first several processes with the highest cpu usage are listed by default for reference. If the CPU usage in the first area continues to be high, let's take a look at the process that occupies a lot of CPU resources. Let's look at the user's process. If you execute it yourself, kill it or contact the project team. Area 5: displays information about the Memory Page and page feed space. The paging space is the space on the disk. It is used in the AIX operating system for memory usage. The specific theory will not be elaborated. For details, refer to the operating system content. The disk space speed is more than 10 times slower than the memory speed. Therefore, it is only a temporary storage location of the memory page, which is stored on memory pages that are not used for a long time. If paging occurs in large numbers, it will be troublesome. It indicates that the memory is insufficient! This area focuses on PageIn. If the two values of PageOut are greater than three digits and are later than this value for a long time, it is technically called memory bumps, that is, the memory page is constantly switched to the disk space, when the Memory Page is read from the disk space, the system memory usage efficiency becomes very low, and the system response performance slows down. This information can also be viewed using vmstat. The pi and po columns correspond to the above. Of course, if there is only a page to go out, or only a page to go in, or some pages to be switched out in a short time, there is no problem. Just pay attention to it. Area 6: Information about memory usage. The total amount of memory actually owned by the Real and MB operating systems. The unit is MB. % Comp, computing memory usage rate, % Noncomp non-computing memory usage rate. % Client is also non-computing memory. Noncomp includes Client memory. The memory used by the jfs file system is noncomp. to distinguish between jfs2 and nfs, the memory used is Client. Computing memory is the memory actually used by the process. For example, when we write a program, we use malloc memory, or we use a stack in sorting. The variable values in the process must be saved in the memory, this part of memory is computing-type memory (not comprehensive, for reference only ). While the operating system is reading and writing files, the I/O buffer required, or when we are writing programs, opening and reading and writing files are all in the file buffer. (Except for bare devices, the CCCC database uses RAC, and all data storage uses bare devices. On the database server, the data file is cached in the data buffer in the sga area of oracle (this area is regarded as computing memory) and does not occupy non-computing memory .) There are many possible causes of memory problems. Mainly include: the process uses more memory. For example, the CCCC database server uses a lot of memory for a large number of oracle connections, or the execution of an SQL script or stored procedure executed in the database requires a large amount of memory to complete its operations (this situation exists in the special case library, execution of a stored procedure causes the operating system to run out of memory and pg to run out. The operating system automatically runs PGSP_KILL to kill the process. This is also the first time I know that the aix system still has this function, haha ). The second major problem is memory leakage. Memory leakage is the simplest way to apply for memory space, which is no longer used but not released. When we write a program, malloc is not free. This leads to a serious problem. With the execution of the program, there are fewer and fewer physical memory available. Finally, the program crashes and the application has to be restarted on a regular basis to solve the problem. The memory paging mechanism of the operating system causes all memory pages not used in the program to finally go to pg, and the page swap space will continue to grow. This is why system problems are caused by applications. Area 7 shows the usage of the page feed space. If the usage of the page feed space increases for a long time, it indicates that the system memory is insufficient and disk space has been used to buffer the memory. If the usage of PG continues to grow, or the usage exceeds 50%, be cautious (by 50%, it is already a major alarm on the monitoring platform !), And immediately submit the system administrator to analyze the reason for memory growth. If this value continues to grow, the system will definitely crash!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.