[Go] Understanding Linux Performance

Source: Internet
Author: User
Tags ide hard drive

Source: http://www.linuxfly.org/post/114/

[Go] Understanding Linux Performance

The project often encounters the need to perform an efficiency analysis of the systems currently in operation, or to meet customer inquiries to optimize the efficiency of the system. More often, in the event of a system problem, you need to analyze the cause, locate the system failure or bottleneck, of course, it is best to resolve the fault together. But in fact, operating system optimization is a very complex problem, and Linux has its own set of different operating system management mechanism, which will cause a lot of unnecessary misunderstanding and trouble. Ask myself that I am writing a good article of reason, can only transfer a copy of the document for reference.(the article according to the actual to make a certain cut, and the easy-to-encounter problems have been identified)
First, the premise
We can make a list at the beginning of the article listing some tuning parameters that might affect the performance of the Linux operating system, but it's not really worth it. Because performance tuning is a very difficult task, it requires considerable insight into hardware, operating systems, and applications. If performance tuning is very simple, the tuning parameters that we want to list are already written to the hardware microcode or the operating system, and we don't need to read this article anymore. As shown, the performance of the server is affected by many factors.

When faced with a database server with 20000 users with a separate IDE hard drive, even if we use a few weeks to adjust the I/O subsystem is futile, usually a new driver or an update of the application (such as SQL Optimization) can make the performance of this server significantly improved. As we mentioned earlier, don't forget that the performance of the system is affected by many factors. Understanding how the OS manages system resources will help us better judge which subsystem should be tuned in the face of the problem.
second, the Linux CPU scheduling
The basic function of any computer is very simple, that is calculation. There must be a way to manage computing resources, processors, and computing tasks (also called threads or processes) in order to achieve the functions of computing. Thank you very much for Ingo Molnar, who brought the O (1) CPU scheduler to the Linux kernel, which is different from the old O (n) scheduler, which is dynamic and can support load balancing and operate at a constant speed.
The scalability of the new scheduler is very good, regardless of the number of processes or number of processors, and the scheduler itself has less overhead. The new accessor algorithm uses two priority queues.
Reference • Active Run queue
• Expired Run queue
An important goal of the scheduler is to efficiently allocate CPU time slices to processes based on priority permissions, and when allocations are complete it is listed in the CPU's run queue, with an expired run queue in addition to the CPU's running queue. When a task in the active run queue runs out of its own time slice, it is moved to the expired run queue. During the move, the time slices of the process are recalculated. If the active run queue does not already have a task of a given priority, pointers to the active run queue and the expired run queue are exchanged, so that the expiration priority list becomes the list of activity priorities. Usually the interactive process (relative to the real-time process) has a higher priority, it occupies a longer time slice, more computation time than the low-priority process, but through the scheduler's own adjustment does not make the low-priority process completely starved. The advantage of the new scheduler is that it significantly changes the scalability of the Linux kernel, allowing the new kernel to better handle enterprise-class applications with a large number of processes and a large number of processors. The new O (1) scheduler contains the seed 2.6 kernel, but is also backwards compatible with the 2.4 kernel.

Another important advantage of the new scheduler is the support for NUMA (Non-uniform Memory Architecture) and SMP (symmetric multithreading processors), such as [email Protected] Hyper-Threading technology.
Improved NUMA support ensures that load balancing does not occur between CECs or NUMA nodes, unless a node's out-of-load limit occurs.
third, the memory architecture of Linux
Today we are faced with the choice of 32-bit or 64-bit operating systems. The biggest difference between them for enterprise users is that 64-bit operating systems can support memory addressing greater than 4GB. From a performance standpoint, we need to understand how both the 32-bit and 64-bit operating systems are mapped for physical and virtual memory.

In the above illustration we can see that the 64-bit and 32-bit Linux cores are significantly different in addressing.
In a 32-bit architecture, such as the Ia-32,linux kernel can directly address only the first GB of physical memory (if you remove the reserved part of the remaining 896MB), access to memory must be mapped to the so-called zone_normal space less than 1GB, this operation is done by the application. However, the memory pages allocated in Zone_highmem will result in degraded performance.
On the other hand, 64-bit architectures such as x86-64 (also known as EM64T or AMD64). The Zone_normal space expands to 64GB or 128GB (which can actually be more, but this value is limited by the amount of memory that the operating system itself supports). As we can see, using a 64-bit operating system we ruled out the effects of partial memory zone_highmem on performance.
In practice, in the 32-bit architecture, due to the memory addressing problem described above, for large memory, high load applications, will lead to the crash or serious slow problems. While using the Hugemen core can be mitigated, taking a x86_64 architecture is the best solution.
iv. virtual Memory Management
Because the operating system maps memory to virtual memory, the physical memory structure of the operating system is often not visible to users and applications. If you want to understand the tuning of Linux system memory, we must understand the virtual memory mechanism of Linux. Instead of allocating physical memory, the application requests a portion of the memory space mapped to virtual memory to the Linux kernel. As shown in virtual memory does not necessarily map space in physical memory, if the application has a large capacity request, it may also be mapped to the swap space in the disk subsystem.

Also, it is common to note that applications do not write data directly to the disk subsystem, but rather to the cache and buffer. The Bdflush Daemon will periodically write data from the cache or buffer to the hard disk.
The Linux kernel handles data writing to the disk subsystem and managing disk caches are closely linked. Linux handles memory more efficiently than any other operating system that allocates a specified portion of memory as a disk cache, and by default the virtual Memory manager allocates all available memory space as a disk cache. That's why sometimes we watch a Linux system with a few g of memory configured with only 20MB of available memory.
At the same time, Linux uses the swap space mechanism is also quite efficient, as shown in the virtual memory space is composed of physical memory and the swap space in the disk subsystem together. If the virtual Memory manager finds that an allocated memory page has not been called for a long time, it will move this portion of the memory paging into the swap space. Often we find daemons, such as Getty, that start with the system but are rarely applied. In order to free up expensive main memory resources, the system will move this part of the memory paging into the swap space. the above is the mechanism of using swap space in Linux, when the swap partition uses more than 50%, it does not mean that the use of physical memory has reached the bottleneck, swap space is just the Linux kernel a better way to use system resources.
Simple understanding: Swap usage only represents the validity of Linux managed memory. The swap in/out is a more meaningful basis for identifying memory bottlenecks, and if the value of the swap in/out remains at 200 to 300 pages per second, it usually indicates a potential memory bottleneck for the system. The following example is a good state:
Reference # Vmstat
procs-----------Memory----------------Swap-------io------System------CPU----
R b swpd Free buff cache si soBi Bo in CS US sy ID WA
1 0 5696 6904 28192 50496 0 0 88 117 61 29 11 8 80 1
Five, modular I/O scheduler
As we know, the Linux2.6 kernel brings us a lot of new features, including the new I/O scheduling mechanism. The old 2.4 kernel uses a single I/O scheduler, and the 2.6 kernel provides us with four selectable I/O schedulers. Because Linux systems are widely used in a wide range of applications, the requirements for I/O devices and workloads are different, such as a laptop and a 10000 user database server, there must be a big difference in the I/O requirements.
References (1). Anticipatory
Anticipatory I/O Scheduler creation assumes that a block device has only one physical lookup head (such as a single SATA hard drive), just like the anticipatory scheduler name, the anticipatory scheduler uses "anticipatory" The algorithm writes to the hard disk a relatively large data stream instead of writing multiple random small data streams, which can lead to some delay in write I/O operations. This scheduler is suitable for a number of common applications, such as most personal computers.
(2). Complete Fair Queuing (CFQ)
The complete Fair Queuing (CFQ) Scheduler is the standard algorithm used by red Flag DC Server 5. The CFQ Scheduler uses a QoS policy to assign the same bandwidth to all tasks within the system. The CFQ Scheduler is suitable for multi-user systems with a large number of computational processes. It tries to avoid the process being starved and achieves a relatively low latency.
(3). Deadline
The deadline Scheduler is a polling scheduler that uses the deadline algorithm to provide near real-time operations to the I/O subsystem, and the deadline scheduler provides very little latency and maintains a good disk throughput. If you use the deadline algorithm, make sure that there is no problem with the process resource allocation.
(4). NOOP
The NoOp Scheduler is a simplified scheduler that only makes the most basic merge and sequencing. The relationship with the desktop system is not very large, mainly used in some special software and hardware environment, these software and hardware generally have their own scheduling mechanism for the core support requirements are very small, which is suitable for some embedded system environment. As a desktop user we generally do not choose it.
VI. Network Subsystem
The new network Interruption mitigation (NAPI) has changed the network subsystem and improved the performance of the large traffic network. The Linux kernel is more concerned with reliability and low latency than reducing system occupancy and high throughput when dealing with the network stack. So in some cases, Linux building a firewall or enterprise applications such as file, print, and database performance may be less than the same configured Windows Server.
In the traditional way of handling network packets, as described by the blue arrows, an Ethernet packet arrives at the NIC interface, and if the MAC address matches, it is sent to the buffer of the NIC. The NIC then moves the packet into the network buffer of the operating system kernel and sends a hard interrupt to the CPU, which processes the packet to the appropriate network stack, possibly a TCP port or Apache application.

This is a simple process for handling network packets, but we can see the drawbacks of this approach. As we can see, each time a suitable network packet arrives the network interface will send a hard interrupt signal to the CPU, interrupting other tasks that the CPU is processing, causing the switching action and the operation on the CPU cache. You may think that this is not a problem when only a small number of network packets arrive at the NIC, but gigabit networks and modern applications will bring thousands of network data per second, which could adversely affect performance.
Because of this, NAPI introduces a counting mechanism when dealing with network traffic. For the first packet, the NAPI is handled in a traditional manner, but on the back of the packet, the NIC introduces the poll polling mechanism: If a packet is in the cache of the NIC DMA ring, no new interrupts are requested for the packet until the last packet is processed or the buffer is exhausted. This effectively reduces the impact of excessive interrupt CPU on system performance. At the same time, Napi improves the scalability of the system by creating soft interrupts that can be executed by multiple processors. Napi will help a large number of enterprise-class multi-processor platforms, and it require a NAPI-enabled driver。 Today many drivers do not have NAPI enabled by default, which provides a broader space for tuning the performance of the network subsystem.
vii. Understanding Linux Tuning Parameters
Because Linux is an open-source operating system, there are a number of performance monitoring tools available. The choice of these tools depends on your personal preferences and the requirements for the details of the data. All performance monitoring tools work according to the same rules, so no matter which monitoring tool you use, you need to understand these parameters. Some important parameters are listed below, and it is useful to understand them effectively.
(1) Processor parameters
Reference CPU Utilization
This is a very simple parameter that visually describes the utilization of each CPU. In the xseries architecture, a processor bottleneck can occur if the CPU is more than 80% utilized for a prolonged period of time.
· Runable processes
This value describes the process being prepared to be executed, which should not exceed 10 times times the number of physical CPUs in a single duration, otherwise there may be bottlenecks in the CPU.
· Blocked
Describes the processes that cannot be executed because of waiting for the I/O operation to end, blocked may indicate that you are facing an I/O bottleneck.
· User time
Describes the percentage of user processes that are processed, including nice time. If the value of user time is high, the system performance is used to handle the actual work.
· System time
Describes the percentage of CPU spent on processing kernel operations, including IRQ and software interrupts. If system time is high, there may be a network or drive stack bottleneck. A system typically spends little time dealing with kernel operations.
· Idle time
Describes the percentage of CPU idle.
· Nice time
Describes the percentage of the CPU that is spent processing the re-nicing process.
· Context switch
The number of interchanges between threads in the system.
· Waiting
The total time that the CPU spends waiting for I/O operations, similar to blocked, is that a system should not spend too much time waiting for I/O operations, or you should further detect if the I/O subsystem has bottlenecks.
· Interrupts
Interrupts values include hard interrupts and soft interrupts, and hard interrupts can have more adverse effects on system performance. The high interrupts value indicates that the system may have a bottleneck in the software, either the kernel or the driver. Note that the interrupts value includes interrupts caused by the CPU clock (the modern xserver system has a value of 1000 interrupts per second).
(2) Memory parameters
Reference Free memory
Compared to other operating systems, the value of Linux free memory should not be an important indicator of a performance reference because, as we mentioned earlier, the Linux kernel allocates a large amount of unused memory as the filesystem cache, so this value is usually small.
· Swap usage
This value describes the swap space that has been used. Swap usage only represents the validity of Linux managed memory. The swap in/out is a more meaningful basis for identifying memory bottlenecks, and if the value of the swap in/out remains at 200 to 300 pages per second, it usually indicates a potential memory bottleneck for the system.
· Buffer and Cache
This value describes the cache allocated for the file system and the block device. In the red Flag DC Server 5 release, you can adjust the amount of free memory as a cache by modifying the page_cache_tuning in/PROC/SYS/VM.
· Slabs
Describes the memory space used by the kernel, noting that the kernel's pages cannot be swapped to disk.
· Active versus inactive memory
Provides active memory information about system memory, inactive memory is the space that is swapped by the KSWAPD daemon to disk.
(3) Network parameters
Reference Packets received and sent
This parameter represents the number of packets received and sent by a specified network card.
· Bytes received and sent
This parameter represents the number of bytes of packets received and sent by a specified network card.
· Collisions per second
This value provides the number of network conflicts that occur on the specified network card. The persistent presence of this value represents a bottleneck in the network architecture, not a problem on the server side. Conflicts in a normally configured network are rare unless the user's network environment is composed of a hub.
· Packets dropped
This value indicates the number of packets discarded by the kernel, possibly due to a lack of firewalls or network caches.
· Overruns
Overruns expresses the number of times the network interface cache is exceeded, and this parameter should be linked to the packets dropped value to determine if there is a bottleneck in the network cache or the network queue is too long.
· Errors
This value records the number of frames that are flagged as failed. This may be caused by the wrong network configuration or some network cable damage, the damage of some network cables in the copper port Gigabit Ethernet environment is an important factor affecting the performance.
(4) Block device parameters
Reference Iowait
The time that the CPU spends waiting for I/O operations. This consistently high value can often be caused by an I/O bottleneck.
· Average Queue Length
The number of I/O requests, typically a disk queue value of 2 to 3 is the best case, and a higher value indicates that the system may have an I/O bottleneck.
· Average wait
The average time to respond to an I/O operation. Average wait includes the time of the actual I/O operation and the time waiting in the I/O queue.
· Transfers per second
Describes how many I/O operations (including read and write) are performed per second. The value of transfers per second, combined with Kbytes per second, can help you estimate the average transmission block size of the system, which typically matches the stripe size of the disk subsystem to achieve the best performance.
· Blocks Read/write per second
This value expresses the number of blocks read/write per second, which is 1024bytes in the 2.6 kernel, and blocks can be different sizes from 512bytes to 4KB in the earlier kernel version.
· Kilobytes per second Read/write
The number of actual data that is read and written to the block device, in kilobytes.
Viii. Appendices
This document intercepts and modifies the Redbook from IBM tuning Red Hat Enterprise Linux on IBM eserver xseries Servers.

[Go] to understand Linux performance

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.