Understanding Linux Performance

Source: Internet
Author: User
The project often encounters the need to analyze the efficiency of the currently running system, or the problem of customer consultation on how to optimize the system efficiency. In more cases, when a system problem occurs, you need to analyze the cause and locate the system fault or bottleneck. Of course, it is best to solve the fault together. But in fact, operating system optimization is a very complicated problem. Moreover, Linux has its own set of mechanisms different from other operating system management mechanisms, which will cause a lot of unnecessary misunderstandings and troubles. Ask yourself that I have written a good and rational article. I can only turn to a document written by someone else for your reference. (The article has made some reductions based on the actual situation and marked the problems that are easy to encounter) I. Prerequisites
We can list a list at the beginning of the article to list some tuning parameters that may affect the performance of the Linux operating system. However, this is of no value. Performance Tuning is a very difficult task. It requires a deep understanding of hardware, operating systems, and applications. If performance tuning is very simple, the optimization parameters we want to list have been written into the hardware microcode or the operating system for a long time, so we do not need to continue reading this article. As shown in, server performance is affected by many factors.

In the face of a database server that uses a single IDE hard disk and has 20000 users, it is futile to adjust the I/O Sub-system even if we use weeks, A new driver or application update (such as SQL optimization) can significantly improve the performance of the server. As we mentioned earlier, Do not forget that the system performance is affected by many factors.Understanding how the operating system manages system resources will help us better determine which subsystem should be adjusted in the face of problems.
Ii. Linux CPU scheduling
The basic functions of any computer are very simple, that is, computing. To implement the computing function, you must have a method to manage computing resources, processors, and computing tasks (also called threads or processes ). Thanks very much to Ingo Molnar, which brought the O (1) CPU scheduler to the Linux kernel. Unlike the old O (n) scheduler, the new scheduler is dynamic, it supports Server Load balancer and operates at a constant speed.
The new scheduler has excellent scalability, regardless of the number of processes or the number of processors, and the scheduler has less system overhead. The new optimizer algorithm uses two priority queues. Reference · Active running queue
· Expired running queue

An important goal of the scheduler is to effectively allocate CPU time slices to the process based on the priority permission. After the allocation is completed, the CPU is listed in the CPU running queue, except for the CPU running queue, there is also an expired running queue. When a task in the active running queue uses its own time slice, it is moved to the expired running queue. During the moving process, the time slice is recalculated. If there is no task with a given priority in the active running queue, the pointer pointing to the active running queue and the expired running queue will be exchanged, in this way, the expiration priority list can be changed to the activity priority list. Generally, an interactive process (relative to a real-time process) has a higher priority. It occupies a longer time slice and obtains more computing time than a low-priority process, however, the adjustment of the scheduler does not completely starve low-priority processes.The advantage of the new scheduler is to significantly change the scalability of the Linux kernel, so that the new kernel can better handle enterprise-level applications consisting of a large number of processes and a large number of processors.The new O (1) scheduler contains a sub-2.6 kernel, but is also backward compatible with the 2.4 kernel.

Another important advantage of the new scheduler is its support for NUMA (non-uniform memory architecture) and SMP (elastic Ric multithreading processors), such as Intel @'s hyper-Threading Technology.
The improved NUMA support ensures that load balancing does not occur between CECs or NUMA nodes unless one node exceeds the load limit.
Iii. Linux memory architecture
Today, we are faced with 32-bit or 64-bit operating systems. The biggest difference between enterprise users is that 64-bit Operating Systems Support memory addressing larger than 4 GB. From the performance perspective, we need to know how the 32-bit and 64-bit operating systems map physical memory and virtual memory.

In the figure above, we can see that the 64-bit and 32-bit linux kernels are significantly different in addressing.
In a 32-bit architecture, such as a IA-32, the Linux kernel can directly address the range of only the first GB of physical memory (896 MB if the reserved part is removed ), the access memory must be mapped to the so-called zone_normal space smaller than 1 GB, which is completed by the application. However, the Memory Page allocated in zone_highmem will cause performance degradation.
On the other hand, 64-bit architectures such as x86-64 (also known as em64t or amd64 ). Zone_normal space will be extended to 64 GB or 128 GB (actually more, but this value is limited by the memory capacity supported by the operating system itself ). As we can see, using a 64-bit operating system we exclude the impact of some zone_highmem memory on performance.
In practice, in a 32-bit architecture, due to the memory addressing problem described above, large memory and high load applications may cause crashes or serious slowness. Although hugemen core can be mitigated, adopting the x86_64 architecture is the best solution.
Iv. Virtual Memory Management
Because the operating system maps the memory to virtual memory, the physical memory structure of the operating system is usually invisible to users and applications. If you want to understand the Linux system memory optimization, you must understand the Linux virtual memory mechanism. The application does not allocate physical memory, but requests a portion of the memory mapped to virtual memory from the Linux kernel. As shown in, the virtual memory is not necessarily mapped to the space in the physical memory. If an application has a large capacity request, it may also be mapped to the swap space in the disk subsystem.

In addition, applications usually write data not directly to the disk subsystem, but to the cache and buffer.The bdflush daemon regularly writes data in the cache or buffer to the hard disk.
The Linux kernel processes data written to the disk subsystem and manages the disk cache.Compared with other operating systems, the specified part of memory is allocated as the disk cache. the Linux Processing Memory is more effective. By default, the Virtual Memory Manager allocates all available memory space as the disk cache, this is why sometimes we observe that the available memory for a Linux system with several GB of memory is only 20 mb.
At the same time, Linux uses the swap space mechanism with high efficiency. As shown in, the virtual memory space is composed of the physical memory and the swap space in the disk subsystem. If the virtual memory manager finds that a allocated memory page has not been called for a long time, it will move this part of memory page to the swap space. We often find some daemon processes, such as Getty, which will be started with the system but rarely applied. In this case, in order to release expensive primary memory resources, the system will move this part of memory to the swap space by page.The above is the mechanism for Linux to use swap space. When swap partition usage exceeds 50%, it does not mean that the use of physical memory has reached the bottleneck, the swap space is only one way for Linux kernel to better use system resources.
Simple understanding: swap usage only indicates the validity of Linux memory management. For identifying memory bottlenecks, swap in/out is a more meaningful basis, if the value of swap in/out remains between 200 and 300 pages per second for a long time, the system may have a memory bottleneck. The following example shows a good state:Reference # vmstat
Procs ----------- memory ------------- --- swap -- ----- Io ---- -- System -- ---- CPU ----
R B SWPD free buff CacheSi soBi Bo in CS us Sy ID wa
1 0 5696 6904 28192 50496 0 88 117 61 29 11 8 80 1

5. modular I/O Scheduler
As we know, the linux2.6 kernel has brought us many new features, including the new I/O scheduling mechanism. The old 2.4 kernel uses a single I/O scheduler, And the 2.6 kernel provides four selectable I/O schedulers. Because Linux systems are widely used, different applications have different requirements for I/O devices and loads, for example, a laptop and a database server of 10000 users have very different requirements on I/O. Reference (1). Anticipatory
Anticipatory I/O scheduler creation assumes that a block device has only one physical lookup head (for example, a separate SATA hard disk), just like the name of the anticipatory scheduler, the anticipatory scheduler uses the "anticipatory" algorithm to write a large data stream to the hard disk instead of writing multiple random small data streams, which may lead to some latency of write I/O operations. This scheduler applies to some common applications, such as most PCs.
(2). Complete Fair Queuing (CFQ)
The complete Fair Queuing (CFQ) scheduler is a standard algorithm used by Red Flag DC Server 5.The CFQ scheduler uses QoS policies to allocate the same bandwidth to all tasks in the system. The CFQ scheduler is suitable for multi-user systems with a large number of computing processes.It tries to prevent the process from getting starved to death and achieves relatively low latency.
(3). Deadline
The deadline scheduler is a polling scheduler using the deadline algorithm. It provides near real-time operations on the I/O subsystem. The deadline scheduler provides a low latency and maintains a good Disk Throughput. If you use the deadline algorithm, ensure that the process resource allocation is normal.
(4). Noop
The Noop scheduler is a simplified scheduler that only performs the most basic merge and sort operations. The relationship with the desktop system is not very great. It is mainly used in some special software and hardware environments. These software and hardware generally have their own scheduling mechanism and have little requirements for Kernel support, this is suitable for some embedded system environments. As a desktop user, we generally do not choose it.

6. network subsystem
The new network interruption mitigation (napi) changes the network subsystem and improves the performance of large-traffic networks. Linux Kernel focuses more on reliability and low latency than reducing system usage and high throughput when processing network stacks. Therefore, in some cases, the performance of an enterprise-level application, such as firewall or file, printing, and database, may be lower than that of a Windows server with the same configuration.
In the traditional method of processing network packets, as described by the Blue Arrow, after an Ethernet packet arrives at the interface of the network card, if the MAC address matches, it will be sent to the buffer of the network card. The NIC then moves the packet to the network buffer of the operating system kernel and causes a hard interrupt to the CPU. The CPU will process the packet to the corresponding network stack, it may be a TCP port or an Apache application.

This is a simple process for processing network packets, but we can see the disadvantages of this processing method. As we can see, each time a suitable network packet arrives at the network interface, it will send a hard interrupt signal to the CPU, interrupt other tasks that the CPU is processing, resulting in switching and CPU cache operations. You may think that this is not a problem when only a small number of network packets reach the network card, but the gigabit network and modern applications will bring thousands of network data every second, this may have a negative impact on performance.
Because of this situation, napi introduces a counting mechanism when processing network communication. For the first packet, napi is processed in a traditional way, but for the subsequent packets, the NIC introduces the poll mechanism: If a packet is cached in the nic dma ring, no longer apply for new interruptions for this packet until the last packet is processed or the buffer zone is exhausted. This effectively reduces the impact of excessive CPU interruptions on system performance. At the same time, napi improves system scalability by creating soft interruptions that can be executed by multiple processors. Napi will help a large number of enterprise-level multi-processor platforms.Require a napi-enabled driver. Today, many drivers do not enable napi by default, which provides a broader space for us to tune the performance of the network subsystem.
7. Understand Linux tuning parameters
Linux is an open-source operating system, so a large number of available performance monitoring tools. The choice of these tools depends on your personal preferences and requirements for data details. All performance monitoring tools work according to the same rules, so no matter which monitoring tool you use, you need to understand these parameters. Some important parameters are listed below, so it is useful to understand them effectively.
(1) processor parameter reference-CPU utilization
This is a simple parameter that intuitively describes the utilization of each CPU. In the xseries architecture, if the CPU utilization exceeds 80% for a long time, it may be a bottleneck of the processor.
· Runable Processes
This value describes the process to be executed. In a duration, this value should not exceed 10 times the number of physical CPUs; otherwise, there may be a bottleneck on the CPU.
· Blocked
Describes the processes that cannot be executed because they wait for the end of the I/O operation. Blocked may point out that you are facing an I/O bottleneck.
· User time
Describes the percentage of processes processed by users, including nice time. If the value of user time is very high, it indicates that the system performance is used in actual work.
· System time
Describes the percentage of CPU spending on Kernel operations, including IRQ and software interruptions. If the system time is very high, it indicates that the system may have bottlenecks in network or driver stack. A system usually takes only a small amount of time to process kernel operations.
· Idle time
The percentage of idle CPU.
· Nice time
This section describes the percentage of CPU spending on re-nicing processes.
· Context switch
Number of threads in the system to exchange.
· Waiting
CPU spent the total time waiting for I/O operations. Similar to blocked, a system should not spend too much time waiting for I/O operations, otherwise, you should further check whether the I/O subsystem has a bottleneck.
· Interrupts
Interrupts values include hard interrupts and soft interrupts. Hard interrupts has more adverse effects on system performance. The high interrupts value indicates that the system may have a software bottleneck, which may be the kernel or driver. Note that the interrupts value includes the interrupt caused by the CPU clock (1000 interrupts values per second in modern xserver systems ).

(2) memory parameter reference-free memory
Compared with other operating systems, the Linux idle memory value should not be used as an important indicator for Performance Reference, because, as we mentioned earlier, the Linux kernel allocates a large amount of unused memory as the cache of the file system, so this value is usually relatively small.
· Swap usage
This value describes the used swap space. Swap usage only indicates the Linux memory management validity. For identifying memory bottlenecks, swap in/out is a more meaningful basis, if the value of swap in/out remains between 200 and 300 pages per second for a long time, the system may have a memory bottleneck.
· Buffer and Cache
This value describes the cache allocated to the file system and Block devices. In red flag DC Server 5, you can modify page_cache_tuning in/proc/sys/Vm to adjust the number of idle memory as cache.
· Slabs
Describes the memory space used by the kernel. Note that the page of the kernel cannot be switched to the disk.
· Active versus inactive memory
Provides information about the active memory of the system memory. inactive memory is the space exchanged by the kswapd daemon to the disk.

(3) network parameter reference-packets received and sent
This parameter indicates the number of packets received and sent by a specified Nic.
· Bytes encoded Ed and sent
This parameter indicates the number of bytes of data packets received and sent by a specified Nic.
· Collisions per second
This value provides the number of network conflicts that occur on the specified Nic. The continuous occurrence of this value indicates a bottleneck in the network architecture, rather than a problem on the server. Conflicts are rare in normal network configurations, unless your network environment is composed of hubs.
· Packets dropped
This value indicates the number of data packets dropped by the kernel, probably because of a lack of firewall or network cache.
· Overruns
Overruns indicates the number of times the network interface is cached. This parameter should be associated with the packets dropped value to determine whether the bottleneck exists in the network cache or the network queue is too long.
· Errors
This value records the number of frames marked as failed. This may be caused by incorrect network configuration or damage to some network cables. Damage to some network cables in a copper port Gigabit Ethernet environment is an important factor affecting performance.

(4) block device parameter reference-iowait
The time the CPU spends waiting for I/O operations. This continuous high value is usually caused by I/O bottlenecks.
· Average queue length
The number of I/O requests. Generally, a disk queue value of 2 to 3 is the best condition. A higher value indicates that the system may have an I/O bottleneck.
· Average wait
The average time of responding to an I/O operation. Average wait includes the actual I/O operation time and the waiting time in the I/O queue.
· Transfers per second
Describes the number of I/O operations per second (including read and write operations ). The combination of transfers per second value and Kbytes per second can help you estimate the average size of the system's transmission blocks, this transfer block size is usually consistent with the Strip size of the disk subsystem to achieve the best performance.
· Blocks read/write per second
This value indicates the number of blocks reads and writes per second. In kernel 2.6, blocks is 1024 bytes. In earlier kernel versions, blocks may be of different sizes, from bytes to 4 kb.
· Kilobytes per second read/write
The actual data volume of the read/write block device is measured in KB.

VIII. Appendix
This article captures and modifies the IBM Redbook tuning Red Hat Enterprise Linux on IBM eserver xseries servers.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.