Linux-CPU performance optimization,

Source: Internet
Author: User

Linux-CPU performance optimization,

Preface

What is performance optimization? In my opinion, performance optimization aims to improve application or system capabilities. So how can we optimize the performance of the application? We have designed a lot of content here, including the Linux kernel, CPU architecture and Linux kernel Resource Allocation and Management, and understanding the process creation process. In this regard, due to the large amount of space, I will not introduce much in my article. In the following articles, we will explain how to find the root cause of application faults, which is also the capability of every system engineer. I will go directly to the topic.

Terms

Latency: Describes the time after an operation to wait for the returned result. In some cases, it can refer to the entire operation time, which is equivalent to the response time.

IOPS: the number of input/output operations per second. It is a measure of data transmission. For disk read/write, IOPS refers to the number of reads/writes per second.

Response time: the time when the general operation is completed. This includes the time used for waiting and service, and the time used to return results.

Usage: for the resources requested by the Service, the usage description is the resource busy degree within the specified time range. For resources in the early spring, usage refers to the storage capacity consumed.

Saturation: refers to the queuing workload of a certain resource that cannot meet the service requirements.

Throughput: rate of work order, especially in data transmission, which is used for data transmission speed (byte/second and bit/second ). In some cases, throughput refers to the operation speed.

Linux Kernel functions

CPU scheduling level: various advanced CPU scheduling algorithms, non-persistent storage access architecture (NUMA );

I/O scheduling sector: I/O scheduling algorithm, including deadline/anticipatory and fully fair queue (CFQ );

TCP network congestion: TCP congestion algorithm, which can be selected as needed;

FAQs

What are the differences between processes, threads, and tasks?

A process is usually defined as a program execution. Environment used to execute user-level programs. It includes the memory address space, file descriptor, thread stack, and register.
A thread is a separate program in a process. That is to say, the thread is in the process.
A task is an activity completed by a program. It can be a process or a thread.

Reference connection: http://blog.chinaunix.net/uid-25100840-id-271078.html

What is context switching?

Run a piece of program code to introduce the process of implementing a function. When the CPU is obtained, the related resources must also be in place, that is, the video card, memory, GPS, and so on, and then the CPU starts to execute. All the items except the CPU constitute the execution environment of the program, that is, the context of the program we define. When the execution of the program or the CPU execution time allocated to him is used up, it will be switched out, waiting for the next CPU to run properly. The last step of switching out is to save the program context, because this is the next time the CPU runs properly and must be saved.

What is the difference between I/O-intensive and CPU-intensive workloads?

I/O-intensive means that the CPU consumption of the system is much better than that of the hard disk/memory, in most cases, the CPU is waiting for I/O (Hard Disk/memory) read/write, and the CPU load is not high. CPU-intensive means that the system's hard disk/memory consumes much more energy than the CPU. At this time, most of the system's operating conditions are CPU load 100%, the CPU needs to read/write I/O (Hard Disk/memory), and I/O can be completed in a short time, while the CPU still has many operations to process and the CPU load is very high. Generally, the CPU usage is quite high, and most of the time is used for computing, logical judgment, and other CPU actions.

Application Performance Technology

1. Select I/O size
The overhead for executing I/O includes initialization buffer, system calling, context switching, distribution of kernel metadata, checking process permissions and restrictions, ing addresses to devices, and executing kernel and driver code to execute I/ o, and finally release the metadata and buffer. Increasing I/O size is a common strategy for applications to increase throughput.
2. Cache
The operating system uses the cache to improve the read performance and Memory Allocation performance of the file system. The reason why applications use the cache is similar. Store the results of frequently executed operations in the local cache for future use, rather than always performing costly operations.
3. Buffer Zone
To improve the write performance, data is merged and placed in the buffer before being sent to the next layer. This will increase the write latency, because after the first write to the buffer, you have to wait for subsequent writes before sending.
4. Concurrency and Parallelism
Parallelism: the ability to install and start executing multiple executable programs (for example, answering calls and eating at the same time ). To take advantage of the multi-core processor system, applications need to run on multiple CPUs at the same time. This method is called parallel. Applications are implemented through multiple processes or threads.
Concurrency: it has the ability to process multiple tasks and does not have to be at the same time. For example, resources can be preemptible after receiving a call;
Synchronization primitive: the synchronization primitive supervises the access to memory. When access is not allowed, the wait time (Delay) will occur ). Three Common types:
Mutex lock: only the lock holder can operate. Other threads will block and wait for the CPU;
Spin lock: The spin lock allows the lock owner to operate. Other threads that need to spin the lock will cyclically select themselves on the CPU and check whether the lock is released. Although this can provide low-latency access, the blocked thread will not leave the CPU, always ready to run to know that the lock is available, but thread spin and wait are also a waste of CPU resources.
Read/write locks: read/write locks ensure data integrity by allowing multiple readers or only one writer but no readers.
Adaptive spin lock: Low-latency access without wasting CPU resources. It is a mixture of mutex locks and spin locks.
5. Bind CPU

CPU Performance Analysis

Uptime:
The system load is calculated by summarizing the number of running threads and the number of threads waiting in queue. Reflect the load of 1/5/15 minutes respectively. The current average load is used not only to indicate the CPU margin or saturation, but also to infer the CPU or Disk Load from this value alone.

Vmstat:
Virtual Memory statistics command. The last few columns print the global CPU usage status. The number of processes that can be run is displayed in the first column. As follows:

1234 [root@zbredis-30104 ~]# vmstat procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----r  b   swpd   freebuff   cache   si   so    bi    bo   incs us sy idwa  st0  0   0    14834208 158384 936512  0     0     0     0    1   3   0  0 100  0  0

Tip:

R: the length of the running queue and the number of running threads;

B: indicates the number of blocked processes;

Swpd: the size of the virtual memory used. If it is greater than 0, the physical memory of your machine is insufficient. If it is not the cause of program memory leakage, you should upgrade the memory or migrate the memory-consuming tasks to other machines;

Si: the size of the virtual memory read from the disk per second. If the value is greater than 0, it indicates that the physical memory is insufficient or the memory is leaked. Find out the memory-consuming process to solve the problem. My machine has plenty of memory and everything is normal.

So: the size of the virtual memory written to the disk per second. If the value is greater than 0, the same as above;

Bi: number of blocks received by Block devices per second. The Block devices here refer to all disks and other Block devices on the system. The default block size is 1024 bytes, I have no I/O operations on my local machine, so it has always been 0, but I have seen it on a machine that processes and copies a large amount of data (2-3 TB) that can reach 140000/s, the disk write speed is almost MB per second;

Bo: number of blocks sent by the block device per second. For example, if we read a file, the number of bo messages must be greater than 0. Bi and bo are generally close to 0, otherwise IO is too frequent and needs to be adjusted;

In: The number of CPU interruptions per second, including time interruptions;

Cs: the number of context switches per second. For example, if we call a system function, we need to perform context switches, thread switches, and process context switches. The smaller the value, the better, and the larger the value, we need to reduce the number of threads or processes. For example, on a web server such as apache and nginx, we generally perform thousands or even tens of thousands of concurrent tests during performance tests, the process of selecting the web server can be lowered from the process or thread peak until cs reaches a relatively small value. This process and the number of threads are a suitable value. The same is true for system calls. Every time we call a system function, our Code will enter the kernel space, resulting in context switching. This is resource-consuming and we should try to avoid frequent calls to system functions. Too many context switches indicate that most of your CPU is wasted on context switches, resulting in less time for proper CPU operations and insufficient CPU utilization.

St: cpu overhead in other tenants in the virtualization environment;

Mpstat:
The multi-processor statistics Tool reports statistics for each CPU.

1234567891011121314151617 [root@zbredis-30104 ~]# mpstat -P ALL 1Linux 2.6.32-573.el6.x86_64 (zbredis-30104)     09/14/2017_x86_64_    (12 CPU) 03:14:03 PM  CPU    %usr   %nice%sys %iowait    %irq   %soft  %steal  %guest   %idle03:14:04 PM  all    0.00    0.00    0.08    0.00    0.00    0.00    0.00    0.00   99.9203:14:04 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.0003:14:04 PM    1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.0003:14:04 PM    2    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.0003:14:04 PM    3    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.0003:14:04 PM    4    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.0003:14:04 PM    5    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.0003:14:04 PM    6    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.0003:14:04 PM    7    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.0003:14:04 PM    8    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.0003:14:04 PM    9    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.0003:14:04 PM   10    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.0003:14:04 PM   11    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

Tip:

Irq: CPU usage interrupted by hardware;

Sofr: CPU usage interrupted by software;
Steal: it takes time to serve other tenants;
Guest: time spent on guest virtual machines;

The important columns are % user/% sys/% idle. Shows the usage of each CPU and the time ratio of the user and kernel states. You can view the CPUs that run to 100% usage (% user + % sys) based on these values. Other CPUs that are not full may be caused by the load of a Single-thread application or device interruption ing.

Sar:

System activity reporter. Used to observe the current activity and configure to archive and report historical statistics. Basically, all resource usage information can be viewed. The specific parameters are described as follows:

-A: The SUM of ALL reports, similar to the "-bBdqrRSuvwWy-I sum-I XALL-n ALL-u ALL-P ALL" parameter;
-B: displays the statistics of I/O and transmission rate;
-B: displays the paging status;
-D: hard disk usage report;
-R: memory and swap space usage statistics;
-G: serial port I/O;
-B: Buffer usage;
-A: file read/write status;
-C: System Call status;
-N: Collects network information;
-Q: Report queue length and average system load;
-R: Process activity;
-Y: terminal device activity;
-W: system exchange activities;
-X {pid | SELF | ALL}: reports statistics of the specified process ID. The SELF keyword is the statistics of the sar process, and the ALL keyword is the statistics of ALL system processes;

Common parameter combinations:

View CPU:

Overall CPU statistics-sar-u 3 2, indicating that the sampling time is 3 seconds and the number of sampling times is 2;
CPU statistics-sar-p all 1 1 indicates that the sampling time is 1 second and the number of times is 1;

1. If % iowait is too high, it indicates that the hard disk has an I/O bottleneck;
2. If the value of % idle is high but the system response is slow, the CPU may be waiting for memory allocation. In this case, the memory capacity should be increased;
3. If the value of % idle is lower than 1, the CPU processing capability of the system is relatively low, indicating that the most important resource to be solved in the system is the CPU;

View memory:

View memory usage-sar-r 1 2

Kbcommit: Ensure the memory required by the current system, that is, the memory required to ensure no overflow (RAM + swap );
% Commit: this value is a percentage of kbcommit and total memory (including swap;

Pidstat: used to monitor system resources occupied by all or specified processes, such as CPU, memory, device IO, task switching, and threads.

Cpu usage statistics
Execute "pidstat-u" and separately execute "pidstat"
Memory usage statistics
Pidstat-r-p PID 1

Minflt/s: number of page missing errors per second (minor page faults). The number of page missing errors indicates the number of page fault times generated by ing the virtual memory address to the physical memory address;
Majflt/s: Number of master page missing errors (major page faults) per second. When the virtual memory address is mapped to the physical memory address, the corresponding page is in swap, such page fault is major page fault, which is generally generated when the memory usage is insufficient;
IO statistics
Pidstat-d 1 2

CPU Optimization

1. Compiler Optimization
2. scheduling priority and scheduling class (set nice value)
For example, nice-n 19 command
Renice changes the priority of a running process;
The chrt command is displayed and the priority and scheduling policy are directly modified;
3. process binding (a process can be bound to one or more CPUs)
For example, taskset-pc 0-3 10790

4. Exclusive CPU
5. BIOS Optimization

Enable turbo boot

Source: https://www.cnblogs.com/yangxiaoyi/p/7532920.html

Reference: https://www.cnblogs.com/liyongsan/p/6922824.html

View comments

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.