Case phenomenon:
When the pressure test, found a request pressure 80tps, the CPU occupied is very high (24-core machine, each CPU occupation rate of the total soared to more than 80%), and set the checkpoint did not have any error.
1. The top command is as follows:
2, understand the background to implement the logic: The general is such: after the server received the request, will be another KV server request data, take back the data, according to the user's machine code to do personalized operation, and finally return the results to the client, during the output some debug log.
Check the next, the KV server is normal, the description is the problem of the native service server. Specifically, use the Vmstat command to see where the anomaly is.
3, it can be seen intuitively, Bi, Bo, in, CS the values of the four items are very high, according to experience, BI and bo for disk IO-related, in and CS on behalf of the system process-related. One solution, first look at IO.
4, with the iostat–x command to read the disk read and write, sure enough, the disk slowly to block the dead.
5, read the next process, only write log operation can cause frequent read and write disk. Decisively close log. Re-crackdown on the test.
6, Bi and Bo down to normal, indicating that the disk problem solved. But the number of context switches actually reached 400,000 times per second! It's horrible.
7, only know that the number of context switches is very large, how to know which processes to switch between?
a script was searched on the internet, which was used to count the top20 of the process switching in a given time and print it out.
#! /usr/bin/Env stap##GlobalCsw_countGlobalidle_countprobe Scheduler.cpu_off {csw_count[task_prev, Task_next]++Idle_count+=idle}function fmt_task (Task_prev, task_next) {returnsprintf"%s (%d)->%s (%d)", Task_execname (Task_prev), Task_pid (Task_prev), Task_execname (Task_next), Task_pid (Task_next))} function Print_cswtop () {printf ("%45s%10s\n","Context Switch","COUNT")foreach([Task_prev, Task_next]inchCsw_count-limit -) {printf ("%45s%10d\n", Fmt_task (Task_prev, Task_next), Csw_count[task_prev, Task_next])} printf ("%45s%10d\n","Idle", Idle_count) Delete Csw_countdelete idle_count}probe timer.s ($1) {print_cswtop () printf ("--------------------------------------------------------------\ n")}
After saving to CS.STP, execute with STAP CSWMON.STP 5 command.
8, the discovery is the discover process in the repeated and the system process to switch. This consumes a lot of resources.
9, from the online search for some of the ways to reduce the switching process:
The development was then changed: the number of threads was doubled and controlled in a process.
Re-suppressed a bit. The number of context switches was found to be reduced to about 250,000 times.
The performance data at this time can reach about 260 times per second, much higher than the previous 80 times. Have reached the need to go live.
However, due to the high number of page break books and context switches, it is necessary to optimize the following
Disk IO high and thread switching over high performance voltage measurement case studies