CPU optimization of Linux performance optimization (i)

Source: Internet
Author: User
Tags mutex switches cpu usage

Tag: Service return activity produces STR output file system report htm


What is performance optimization? Personally, performance optimization is intended to improve application or system capabilities. So how can you achieve tuning your application performance? A lot of content is designed here, including the Linux kernel, CPU architecture, and the allocation and management of resources to the Linux kernel, understanding the process creation process, and so on. This is due to a lot of space, so my article is not too much introduction. In the next few articles, we explain how to discover the root cause of application failure, which is the ability of every system engineer. Don't say much nonsense, I go straight to the subject.

Common terminology

Delay: The delay is the time that describes the operation to wait for the result to be returned. In some cases, it can refer to the entire operation time, which is equivalent to the response time.

IOPS: The number of input/output operations that occur per second is a measure of data transfer. For disk reads and writes, IOPS refers to the number of reads and writes per second.

Response time: The time when the general operation was completed. Includes the time to wait and the service, and also the time to return the result.

Usage: For resources requested by the service, usage describes how busy the resource is in the given time interval. For coupled resources, utilization refers to the amount of storage capacity consumed.

Saturation: Refers to a resource that cannot meet the queued workload of the service.

Throughput: The rate at which the work order is evaluated, especially in terms of data transmission, which is used to transfer speed (Bytes/sec and bits per second). In some cases, throughput refers to the speed of the operation.

Linux kernel Features

CPU scheduling level: A variety of advanced CPU scheduling algorithms, non-always storage Access architecture (NUMA);

I/O scheduling interface: I/O scheduling algorithm, including deadline/anticipatory and complete Fair queue (CFQ);

TCP Network congestion: TCP congestion algorithm, allowing on-demand selection;


What is the difference between a process, a thread, and a task?

A process is typically defined as the execution of a program. The environment used to perform user-level programs. It includes memory address space, file descriptors, line stacks, and registers.
A thread is a program that runs separately in a process. That is, the thread is in the process.
A task is a program that completes an activity that enables a process to be a thread.

Reference connection: http://blog.chinaunix.net/uid-25100840-id-271078.html

What is context switching?

Execute a program code, implement a function of the process introduction, when the CPU, the relevant resources must also be in place, is the graphics card, memory, GPS, and then the CPU began to execute. Everything except the CPU here is the execution environment of this program, which is the program context we define. When the program finishes or the CPU allocated to it runs out of time, it's going to be switched out and waiting for the next CPU. The last step in being switched out is to save the program context, because this is the next time that he is run by the CPU and must be saved.

What is the difference between I/O-intensive and CPU-intensive workloads?

I/O intensive refers to the CPU energy consumption of the system relative to the hard disk/memory energy consumption is much better, at this time, the system operation, most of the situation is CPU in the I/O (hard disk/memory) read/write, the CPU load is not high at this time. CPU-intensive refers to the system's hard disk/memory consumption is much better than the CPU energy consumption, at this time, the system operation, most of the situation is CPU load 100%,CPU to read/write I/O (hard disk/memory), I/O can be completed in a short time, and the CPU has a lot of operations to deal with, CPU load is high. In general, CPU utilization is quite high, most of the time to do calculations, logic judgment and other CPU action program.

Application Performance Technologies

1. Select I/O dimensions
The overhead of performing I/O includes initializing buffers, system calls, context switches, allocating kernel metadata, checking process permissions and restrictions, mapping addresses to devices, executing kernel and driver code to perform I/O, and finally releasing metadata and buffers. Increasing the I/O size is a common strategy for applications to increase throughput.
2. Caching
The operating system uses caching to improve the read performance and memory allocation performance of the file system, and applications use caching for similar reasons. The results of frequently performed operations are saved in the local cache for later use, rather than always performing expensive operations.
3. Buffers
To improve write performance, data is merged and placed in a buffer before being fed to the next level. This increases the write delay because, after the first write to the buffer, it waits for subsequent writes before it is sent.
4. Concurrency and parallelism
Parallelism: The ability to install and start executing multiple executable programs (for example, to answer phones and eat at the same time). To take advantage of multi-core processor systems, applications need to run on multiple CPUs at the same time, which is called parallelism. Applications are implemented through multiple processes or multithreading.
Concurrency: There is the ability to handle multiple tasks, not necessarily at the same time. For example, after the call to eat, there is a resource preemption;
Synchronous Primitives: Synchronize the access of the source to the memory, which causes the wait time (delay) when no access is allowed. There are three types of common:
Mutex Lock: Only the lock holder can operate, other threads will block and wait for the CPU;
Spin Lock: The spin lock allows the lock holder to operate, and other threads that require a spin lock loop on the CPU to check if the lock is released. While this provides low-latency access, the blocked thread does not leave the CPU and is ready to run knowing that the lock is available, but thread spin, wait is also a waste of CPU resources.
Read/write Lock: The reading/writing lock guarantees the integrity of the data by allowing multiple readers or only one writer to be allowed without the reader.
Adaptive spin Lock: Low latency access without wasting CPU resources is a blend of mutex and spin locks.
5. Bind CPU

About CPU Performance analysis

The system load, calculated by summarizing the number of running threads and the number of threads that are waiting to run. The load is reflected within 1/5/15 minutes, respectively. The current average load is not only used to represent CPU headroom or saturation, nor is it possible to infer CPU or disk load from this value alone.

Virtual Memory Statistics command. The last few columns print system global CPU usage state, in the first column shows the number of running processes. As shown below:

[Email protected] ~]# vmstat procs-----------memory-------------Swap-------io------System-------CPU-----R  b< C3/>SWPD   free   buff   cache   si   so    bi    bo   in   CS US sy ID  wa  st0  0   0    14834208 158384 936512  0     0     0     0    1   3   0  0  0  0


R: Run queue Length and number of running threads;

B: The number of processes that indicate blocking;

SWPD: The size of virtual memory has been used, if more than 0, indicates that your machine is out of physical memory, if not the cause of program memory leakage, then you should upgrade the memory or the memory-consuming task to other machines;

Si: The size of the virtual memory read from disk per second, if the value is greater than 0, indicates that the physical memory is not enough or memory leaks, to find the consumption of memory process resolved. My machine has plenty of memory and everything is fine.

So: the size of virtual memory written to disk per second, if this value is greater than 0, ibid;

BI: Block device receives the number of blocks per second, where the block device refers to all the disk and other block devices on the system, the default block size is 1024byte, I have no IO operation on this machine, so has been 0, but I have to deal with copying a large number of data (2-3T) of the machine has seen can reach 140000/s, Disk write speed of almost 140M per second;

Bo: The number of blocks that a block device sends per second, such as when we read a file, the Bo will be greater than 0. Bi and Bo are generally close to 0, or the IO is too frequent, need to adjust;

In: CPU interrupts per second, including time interruption;

CS: The number of context switches per second, for example, we call the system function, the context switch, the thread switch, also to the process context switch, the smaller the better, too big, to consider the number of threads or processes, such as in Apache and Nginx Web server, We generally do performance testing will carry out thousands of concurrent or even tens of thousands of concurrent tests, the selection of the Web server process can be the process or the peak of the thread has been down, pressure measurement, until CS to a relatively small value, the process and the number of threads is a more appropriate value. System calls are also, each time the system function is called, our code will enter the kernel space, resulting in context switching, this is very resource-intensive, but also try to avoid frequent calls to system functions. Too many context switches means that most of your CPU is wasted in context switching, resulting in less time for the CPU to do serious work, and the CPU not being fully utilized, is undesirable.

ST:CPU overhead on other tenants in a virtualized environment;

Multi-processor Statistics tool to report statistics for each CPU.

[[email protected] ~]# mpstat-p all 1Linux 2.6.32-573.el6.x86_64 (zbredis-30104) 09/14/2017 _x86_64_ (CPU)    03:14:03 pm CPU%usr%nice%sys%iowait%irq%soft%steal%guest%idle03:14:04 pm All 0.00 0.00    0.08 0.00 0.00 0.00 0.00 0.00 99.9203:14:04 PM 0 0.00 0.00 0.00 0.00 0.00 0.00    0.00 0.00 100.0003:14:04 PM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.0003:14:04 PM     2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.0003:14:04 PM 3 0.00 0.00 0.00 0.00  0.00 0.00 0.00 0.00 100.0003:14:04 PM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00    100.0003:14:04 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.0003:14:04 PM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.0003:14:04 PM 7 0.00 0.00 0.00 0.00 0.00 0. 00 0.00 0.00 100.0003:14:8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.0003:14:04 PM 9 0.00 0.00 0.00    0.00 0.00 0.00 0.00 0.00 100.0003:14:04 PM 10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.0003:14:04 PM 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00


IRQ: Hardware interrupt CPU usage;

SOFR: Software interrupts CPU usage;
Steal: Time spent in service to other tenants;
Guest: Time spent on guest virtual machines;

Important concerns are listed in%user/%sys/%idle. Shows the amount of each CPU and the time ratio of the user state and the kernel state. These values can be used to view those CPUs running to 100% utilization (%user +%sys), while other CPUs are not running full and may be caused by a single-threaded application's load or device interrupt mapping.


System activity Reporter. Used to observe the current activity, as well as the configuration to archive and report historical statistics. Basically all resources use the information that it can look up. The specific parameter descriptions are as follows:

-A: The sum of all reports, similar to the "-bbdqrrsuvwwy-i sum-i xall-n all-u all-p All" parameter used together;
-B: Displays statistics for I/O and transfer rates;
-B: Displays the paging status;
-D: HDD usage report;
-R: Usage statistics of memory and swap space;
-G: Serial I/o situation;
-B: Buffer usage;
-A: file read and write;
-C: System call situation;
-N: Statistical network information;
-Q: Reports Queue Length and system average load;
-R: activities of the process;
-Y: terminal equipment activity situation;
-W: System switching activity;
-X {PID | Self | All}: The statistics for the specified process ID are reported, the Self keyword is the statistics of the SAR process itself, and the all keyword is the statistic of all system processes;

Common parameter combinations:

To view the CPU:

Overall CPU Statistics-sar-u 3 2, which indicates that the sampling time is 3 seconds and the number of samples is 2 times;
Each CPU statistic-sar-p All 1 1, indicating that the sampling time is 1 seconds, the number of times is 1 times;

1. If the value of%iowait is too high, it indicates that the hard disk has I/O bottleneck;
2. If the value of the%idle is high but the system response is slow, it is possible that the CPU waits for the allocated memory, at which time the memory capacity should be increased;
3. If the value of%idle is consistently lower than 1, the CPU processing power of the system is relatively low, indicating that the most necessary resource to be solved in the system is CPU;

To view memory:

View memory Usage-SAR-R 1 2

Kbcommit: Ensure that the current system requires memory (RAM+SWAP) to ensure that it does not overflow;
%commit: This value is a percentage of kbcommit and total memory (including swap);

Pidstat: Primarily used to monitor all or specified processes that consume system resources, such as CPUs, memory, device IO, task switches, threads, and so on.

CPU Usage Statistics
Execute "pidstat-u" and perform "Pidstat" separately
Memory usage Statistics
Pidstat-r-P PID 1

MINFLT/S: The number of page faults per second (minor page faults), the number of times page faults means that the virtual memory address mapped to the physical memory address generated by the page fault times;
MAJFLT/S: The number of Main page faults per second (Major page faults), when the virtual memory address mapped to a physical memory address, the corresponding pages in swap, such a page fault major page fault, generally in memory use tense;
IO Status Statistics
Pidstat-d 1 2

CPU-related optimizations

1. Compiler optimizations
2. Scheduling priority and scheduling class (set nice value)
For example, Nice-n command
Renice change the priority of the already running process;
The Chrt command displays and modifies the priority and scheduling policies directly;
3. Process bindings (a process can be tied to one or more CPUs)
For example, taskset-pc 0-3 10790

4. Exclusive CPU
5.BIOS Tuning
Enable Turbo Frequency

CPU optimization of Linux performance optimization (i)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.