1 CPU hotspot Analysis positioning background
CPU resources are still very expensive, in order to feel this expensive, the current CPU price of resources:
So for the program apes, it is necessary for the program to use CPU resources reasonably and efficiently . using limited CPU resources to solve the actual problems we face , that's why we want to optimize the program as much as possible.
The introduction from the micro-level, do not do macro-level introduction (such as data center-level capacity monitoring, management scheduling (openstack,kubernates, etc.) and migration (manual, Automatic, Lengxian, Jechin)).
This article will organize the directory structure in a backward-push way. What to do with the problem, and why to do so, as well as how to prevent as much as possible later.
- CPU hotspot Occupancy Analytics targeting
- What is a CPU
- Process scheduling for the Linux kernel
- Related performance metrics that reflect CPU usage
- The JVM thread corresponds to the Linux process
- Java Concurrency programming
2 CPU hotspot Occupancy Analytics targeting
Of course, can also use a variety of artifact-assisted positioning, such as Jprofiler .
3 What is a CPU
CPU: Interprets computer instructions and processes data in computer software.
The performance of the computer is largely determined by the performance of the CPU, and the performance of the CPU is mainly reflected in the speed of its running program. performance metrics that affect operating speed include parameters such as the CPU's operating frequency, cache capacity, instruction system, and logical structure .
Frequency: The main frequency is also called the clock frequency, in megahertz (MHz) or gigahertz (GHz), which is used to indicate the CPU's operation and the speed of processing the data . Typically, the higher the frequency, the faster the CPU will process the data.
The main frequency of the CPU = FSB x multiplier factor. There is a certain relationship between the main frequency and the actual operation speed, but it is not a simple linear relation. Therefore, the CPU's main frequency and the actual CPU computing capacity is not directly related to the frequency of the CPU in the digital pulse signal vibration speed . CPU computing speed also depends on the CPU of the pipeline, bus and other aspects of performance indicators.
FSB: FSB is the CPU's reference frequency, in MHz. The FSB of the CPU determines the speed at which the entire motherboard is running . Generally speaking, in the desktop, said Overclocking, are ultra-CPU FSB (of course, the CPU frequency multiplier is locked) believe this is very good understanding. But for the server CPU, overclocking is absolutely not allowed. In front of the CPU determines the operating speed of the motherboard, the two are running synchronously, if the server CPU overclocking, change the FSB, will produce asynchronous operation, (desktop many motherboards support asynchronous operation) This will cause the entire server system instability.
Bus Frequency: The front-end bus (FSB) is the bus that connects the CPU to the North Bridge chip. The front-end bus (FSB) frequency, or bus frequency, directly affects the speed of direct data exchange between the CPU and the memory. One formula can be calculated, that is, data bandwidth = (bus frequency x Data bit width)/8, the maximum bandwidth of data transmission depends on the width and transmission frequency of all simultaneous transmissions .
The difference between the FSB and the front-end bus (FSB) Frequency: The speed of the front-end bus refers to the speed at which the data is transmitted, and FSB is the speed at which the CPU runs synchronously with the motherboard. That is, the 100MHz FSB refers to the digital pulse signal in the oscillation 100 million times per second, while the 100MHz front-end bus means that the CPU acceptable data transmission per second is 100mhzx64bit÷8bit/byte=800mb/s.
Cache size is also an important indicator of the CPU, and the structure and size of the cache on the CPU speed is very large, CPU cache running frequency is very high, generally with the processor frequency operation, the efficiency is much larger than the system memory and hard disk. when actually working, the CPU often needs to read the same data block repeatedly, and the increase of cache capacity can greatly increase the hit rate of reading data inside the CPU, instead of looking for the memory or hard disk to improve the system performance . However, due to the CPU chip area and cost factors to consider, the cache is very small.
L1 Cache (cache) is the first CPU cache, which is divided into data cache and instruction cache. The capacity and structure of the built-in L1 cache has a large impact on the performance of the CPU, but the buffer memory is composed of static RAM, the structure is more complex, the capacity of the L1 cache cannot be too large in the case that the CPU core area is not too large. The L1 cache capacity of the general server CPU is usually 32-256kb.
The L2 cache (Level two cache) is the second-tier cache of CPUs, both internal and external. The internal chip level two cache runs at the same speed as the frequency, while the external level two cache is only half the frequency. L2 cache capacity also affects the performance of the CPU, the principle is that the bigger the better, the previous home CPU capacity of the largest is 512KB, laptop can also reach 2 m, and the server and workstation with CPU
The L2 cache is higher and can reach more than 8M.
The L3 cache (Level three cache), is divided into two types, the earlier is external, memory latency, while improving the performance of large data volume computing processor. Reduce memory latency and increase the ability to compute large data volumes are useful for games. The increased L3 cache in the server domain still has a significant performance boost. For example, a configuration with a larger L3 cache can be more efficient with physical memory, so it is slower for the disk I/O subsystem to handle more data requests. Processors with larger L3 caches provide more efficient file system cache behavior and shorter message and processor queue lengths .
What is the relationship between the number of cores we often say?
Terms |
Overview |
View Instructions |
Number of CPUs (socket) |
is actually plugged into the mainframe, and you can see the CPU hardware. |
cat/proc/cpuinfo| grep "Physical ID" | Sort| uniq| Wc-l |
Number of CPU Cores (CORE) |
The number of chipsets on a physical CPU that can process data. In other words, a physical CPU may have multiple cores, the two cores that are said in daily life, and the four cores refer to the CPU cores. |
cat/proc/cpuinfo| grep "CPU Cores" | Uniq |
Hyper-Threading (thread) |
A CPU core is a physical thread, and Intel develops hyper-Threading technology that can simulate a physical thread using two threads, making a single core work like two cores to get the most out of CPU performance |
|
Logical CPU |
Can be simply understood as a processing unit, generally, the total number of logical CPUs corresponds to the total number of CPU cores, but with hyper-threading technology, a core is used to act like two cores, when the number of logical CPUs is twice times the number of cores |
cat/proc/cpuinfo| grep "Processor" | Wc-l |
Vcpus |
For host, the KVM virtual machine is a process, and the Vcpus of the virtual machine are the threads derived from the process. Different vcpus are just different threads, and different threads are running on different CPUs KVM, you can specify the number of Socket,core and thread, such as setting "-SMP 5,sockets=5,cores=1,threads=1", the number of Vcpus is 5*1*1 = 5. The client sees the CPU core based on the KVM Vcpus, while Vcpus is dispatched to the physical CPU core as a QEMU thread/lightweight process by Linux as a normal thread/light. The user wants to bind the Vcpus of the virtual machine to a specific physical CPU, and the Vcpus are dispatched only on the bound physical CPU to isolate the Vcpus and improve the performance of the virtual machine. If no Vcpus is bound, the Vcpus of the virtual machine can be dispatched on all physical CPUs (Libvirt Cputune provides fine-grained vcpus binding settings that can be specific to each Vcpus setting. and provides the normalization of Vcpus capabilities, such as quota,period,shares, which can be used to implement CPU QoS) |
The PS command on the host can see if a KVM virtual machine corresponds to a process of the host, and the different vcpus are actually different threads derived from this process |
The affinity of Vcpus under virtualized scenarios:
Practice:
1:linux CPU clock, bus frequency, FSB, cache capacity how to view
2:linux the number of CPUs, the number of cores, whether hyper-threading is turned on, how to view the number of hyper-threads?
3:KVM How to set Vcpus affinity under virtualization?
4:docker How to set CPU affinity in containerized?
4 related performance metrics that reflect CPU usage
Linux CPU occupancy calculation is based on/proc/stat file content calculation, under Linux/unix, CPU utilization is divided into user state, System State and idle state , respectively, the CPU is in the user state execution time, the system core execution time, And the time the idle system process executes.
Original indicator |
Describe |
Note |
User Time |
Represents the time that the CPU executes the user process, including the nices time. It is generally expected that the higher the user space CPU the better. |
|
System time |
Represents the CPU uptime in the kernel, including IRQ and SOFTIRQ time. The high CPU utilization of the system indicates that there is a bottleneck in some parts of the system. Usually the lower the value, the better. |
|
Latency (Waiting time) |
The amount of time the CPI spends waiting for the I/O operation to complete. The system department should spend a lot of time waiting for I/O operations, otherwise the I/O has a bottleneck. |
What the hell? |
Idle Time (idle) |
The system is in an idle period, waiting for the process to run |
What the hell? |
Nice time |
The time it takes the system to adjust the process priority. |
What the hell? |
Hard Interrupt processing time |
The time it takes the system to process a hard interrupt. |
Why is this indicator? |
Soft interrupt processing times (SOFTIRQ time) |
The time it takes for the system to handle a soft interrupt outage |
Why is this indicator? |
The above for Linux CPU-related atomic indicators, the actual occupancy rate we say are statistical indicators, statistical indicators are interpreted and calculated as follows:
Statistical indicators |
Overview |
Calculation formula |
CPU Time |
the time Hz in the kernel is the number of times the system clock is fixed to make a clock interrupt within one second . Hz can be configured before compiling the kernel, so you can see the current system's clock interrupt frequency with the following command: cat/boot/config-' uname-r ' | grep config_hz. Tick is the time that the system clock ticks every tick, with a value of (1/hz) seconds. That is, the time interval between two consecutive clock interrupts. Jiffies is used to calculate the number of ticks since the start of the system, which means that the value of the variable is incremented once each time the clock is interrupted by the system clock. |
User+system+nice+idle+iowait+irq+softirq+stl |
User-state CPU usage |
The process that the user uses CPU includes: TheCPU runs the regular user process, the CPU runs niced PROCESS,CPU runs the real-time process . A Linux process can be executed in the user mode, or in the system (kernel) mode, when a process runs in kernel code, we call it in the kernel state, when a process is executing the user's own code, we call it in the user state, when executed in user mode, The process executes in its own application code and does not require kernel resources to compute, manage memory, or set variables |
(User time + nice time)/cpu Time *100% |
CPU usage in kernel state |
Shows the percentage of CPU time that is spent in the system mode, including the CPU resources consumed by kernel processes (Kprocs) and other processes that require access to the kernel resources, the processes that the system uses for CPU include: for system calls, for I/O management (interrupts and drives), For memory management (paging and swapping), for process management (context switch and process start), if a process requires kernel resources, it must perform a system call and then switch to system mode to make the resource available |
(System time + hard Irq times +softirq hours)/cpu *100% |
Idle rate |
If no thread can execute (the run queue is empty), the system assigns a thread called wait, which can be called an idle kproc. If the PS report shows that the total time for this thread is high, this indicates that there is a time period where no other threads are ready to run on the CPU or wait for execution. The system therefore spends most of its time idle or waiting for new tasks |
(Idle time)/cpu Time *100% |
Several of the terms mentioned above: User state, kernel state, interrupt. This involves some knowledge of the Linux kernel, mainly related to process scheduling.
Practice:
- 1. View the original metrics for CPU usage in Linux
- 2. See statistics on CPU usage in Linux
5 process scheduling for the Linux kernel
This is explained by the above: the CPU is the actual calculation, but the CPU is the hardware, for the upper-level applications do not directly interact with the CPU, but through the operating system, that is, the application actually through the interaction with the operating system instructions, There is an internal operating system that interacts with the CPU through device drivers to accomplish the related computational tasks.
For Linux, the interior can be divided into three layers: hardware-kernel-user space
- User mode + kernel mode
In general, a process running on the CPU can have two modes of operation, both in user mode and in kernel mode (that is, the process is working in both the user and kernel states, and the kernel state is still the process, unless the process is switched on). The entire Linux kernel is made up of various interrupts and exception handlers. That is, under normal circumstances, the processor in user mode to execute the user program, in the event of an interruption or exception, the processor switches to privileged mode to execute the kernel program, after processing the interrupt or exception, and then return to user mode to continue to execute the user program , for example, user process a called the kernel system call to get the current clock tick number, When executing the system call instruction in user process A, the current register state such as Ip,cs of the current user process is saved, and then jumps to kernel space (that is, the kernel code region) to perform the system call function as expected to get the current number of clock ticks. After execution, the iret instruction is returned to process a (that is, the information saved at the time of entry is reset to the appropriate register), and then the instruction of a process is executed from the CS:EIP address.
- Linux Scheduling mode:
The Linux kernel is basically a "preemptive priority" approach, that is, when the process runs in user mode, whether or not voluntarily, under certain conditions (such as time slices run out or waiting for I/O), the core can temporarily deprive its operation and dispatch other processes into operation. However, once the process has switched to kernel mode, it will not be subject to the above restrictions until the process is dispatched until it returns to user mode . The scheduling strategy in Linux system basically inherits the UNIX priority-based scheduling. That is, the core calculates a priority for each process in the system, which reflects the eligibility of a process for CPU use, i.e. high priority processes are first run. The core picks up one of the highest priority processes from the process readiness queue and allocates a CPU time slice to run it. In the process of running, the priority of the current process decreases over time, thus achieving a "negative feedback" effect: After a period of time, the lower level of the original process is relatively "elevated" level, so that there is a chance to get run. When the priority of all processes becomes 0 o'clock, the priority of all processes is recalculated .
- Linux Scheduling algorithm:
When Linux executes a process schedule, it first finds all processes in the ready queue, from which a process with the highest priority and in memory is selected. If there are real-time processes in the queue, the real-time process will run first. If the process that most needs to run is not the current process, then the current process is suspended, and all the machine states involved in the site are saved, including program counters and CPU registers, and then the site is resumed for the selected process
- CPU Context Switch
During a switchover, the intermediate data stored by a process in the processor's registers is called the context of the process, so the process switching is essentially a switchover between the running process and the context of the running process. When a process is not consuming a processor, the context of the process is stored in the private stack of the process. When the processor is occupied, it is restored to the processor register, and the breakpoint to run the process into the processor's program pointer to the PC, so the running process begins to be run by the processor, which is the process has occupied the use of the processor . Process switching can be accomplished with interrupt technology
This is known as the key factor that affects the kernel CPU usage when the CPU context switches.
6 JVM threads correspond to Linux processes
This is the Linux process scheduling, for our first chapter is about the analysis of Java application hot thread, process and Java threads, what is the problem?
Threads in Java are managed by the JVM, and how it corresponds to the operating system's threads is determined by the implementation of the JVM. The hotspot on Linux 2.6 uses the NPTL mechanism, and theJVM thread has a one by one relationship with the kernel lightweight process . The scheduling of threads is completely given to the operating system kernel , and of course the JVM retains some policies that affect its internal thread scheduling, for example, under Linux, as long as a thread.run will invoke a fork to produce a thread.
The way Java threads are implemented on Windows and Linux platforms is how kernel threads are implemented. threads implemented in this way are supported directly by the operating system kernel. -- thread switching is done by the kernel by manipulating the scheduler ( Thread Scheduler ) to implement thread scheduling and to reflect thread tasks on individual processors. Kernel threads are a single clone of the kernel. Instead of using the kernel thread directly, the program uses its advanced interface, the lightweight process (LWP), which is the thread
Description: KLT is the kernel thread kernel the thread, which is the "Kernel clone". Each KLT corresponds to one of the lightweight processes in Process P LWP (also known as threads), during which the user state, the kernel state of the switch, and the thread Scheduler on the processor CPU. )
Using kernel threads on the surface of the program will inevitably switch the user and kernel states back and forth on the operating system, and the support number of LWP is limited because it is a one-to-two threading model.
each thread can share process resources (memory address, file I/O etc.), and can be scheduled independently (threads are CPU the basic unit of dispatch).
from the above, we can tell Java threads are actually user-state lightweight processes that map to kernel-state threads. (Threads are The basic unit of CPU scheduling)
7 Java concurrency
When the number of Java threads is greater than the number of CPU threads, the operating system uses the time slice mechanism, and thread scheduling algorithm is used to switch threads frequently. The number of active threads is best when the number of CPUs (cores) is. Too few active threads cause the CPU to be underutilized, and too many active threads cause excessive thread context switching overhead. Threads should be active, threads in IO, dormant threads, etc. do not consume CPU.
I/O Intensive (io-bound) |
I/O bound refers to the CPU performance of the system is much better than the hard disk/memory performance, at this time, the system operation, most of the situation is CPU in the I/O (hard disk/memory) read/write, CPU Loading is not high at this time. CPU bound refers to the system's hard disk/memory performance is much better than the CPU performance, at this time, the system operation, most of the situation is CPU Loading 100%,CPU to read/write I/O (hard disk/memory), I/O can be completed in a short time, and the CPU There are many operations to be processed and the CPU Loading very high. The development we are doing now is mostly Web applications, involving a lot of network transmission, not only that, and the interaction with the database, and the cache is also related to Io, once the IO occurs, the thread will be in the waiting state, when the IO is finished, the data is ready, the thread will continue to execute. Therefore, we can find that for IO-intensive applications, we can set up a number of threads in the thread pool, so that during the time of waiting for IO, the thread can do other things and improve the efficiency of concurrent processing. is the amount of data available for this thread pool to be set casually? Of course not, be sure to remember that thread context switching comes at a cost. Currently summarizes a set of formulas for IO intensive applications: Threads = CPU Cores/(1-blocking factor) This blocking factor is typically between 0.8~0.9, You can also take 0.8 or 0.9. |
compute-intensive (cpu-bound) |
In the multi-program system, most of the time is used to calculate, logic judgment and other CPU action program called CPU bound. In the multi-core CPU era, we want each CPU core to participate in the computation, the CPU performance fully utilized, this is not wasted server configuration, if the very good server configuration is still running a single-threaded application that would be a huge waste. For compute-intensive applications, it is entirely up to the CPU's core to work, so in order for its advantages to fully play out, to avoid excessive thread context switching, the more ideal scenario is: Number of threads = number of CPU cores +1 It can also be set to CPU cores, which depends on the version of the JDK and the CPU configuration (the server's CPU has Hyper-threading). For JDK1.8, it adds a parallel computation, a computationally intensive number of ideal threads = CPU Core Threads |
Get CPU Cores in Java method: Runtime.getruntime (). Availableprocessors ()
- Wait,sleep,notify: Blocking, sleeping, waking.
- Java multithreaded framework: Executorservice and so on.
- Inter-thread communication: semaphores, memory queues, etc.
- Thread safety: Various locks, various security sets, etc.
Java application performance analysis in Linux environment Location-CPU use Chapter