Contextual Concepts
Multiple threads are often exposed to high-performance programming. At first, it was our understanding that multiple threads executed in parallel would be faster than a single thread, as more people worked together faster than one. However, the reality is that there is a need to compete for IO devices between multiple threads, or to compete for lock-in resources, resulting in slower execution than a single thread. Here is a frequently mentioned concept: Context switch.
The exact definition of context switching can be referenced by: http://www.linfo.org/context_switch.html. Here's a quick introduction. Multitasking systems often require multiple operations at the same time. The number of jobs is often greater than the number of CPUs in the machine, whereas a single CPU can perform only one task at a time, how can it be felt that these tasks are being carried out simultaneously? The designer of the operating system skillfully uses the time-slice rotation, the CPU gives each task a certain amount of time, and then saves the current task State, and continues to serve the next task after loading the next task's state. The state of the task is saved and reloaded, and this process is called context switching. The time-slice rotation makes it possible to perform multiple tasks on the same CPU, but it also brings the direct consumption of the storage site and the loading site.
(Note. More precisely, context switching brings both direct and indirect factors that affect the consumption of program performance. Direct consumption includes: CPU registers need to be saved and loaded, the system scheduler code needs to execute, the TLB instance needs to reload, the CPU pipeline needs to be brushed off; Indirect consumption refers to the sharing of data between multi-core caches, and the impact of indirect consumption on the program depends on the size of the thread workspace operation data.
You can use Vmstat to observe the number of context switches in Linux. The following commands are executed:
1$ vmstat12procs-----------Memory-------------Swap-------io-----system------CPU----3R b swpd free buff cache si so bi boinchCS US sy ID WA4 1 0 0 4593944 453560 1118192 0 0 - A 238 - 6 1 the 15 0 0 0 4593212 453568 1118816 0 0 0 the 958 1108 4 1 94 26 0 0 0 4593360 453568 1118456 0 0 0 0 895 1044 3 1 the 07 1 0 0 4593408 453568 1118456 0 0 0 0 929 1073 4 1 the 08 0 0 0 4593496 453568 1118456 0 0 0 0 1133 1363 6 1 the 09 0 0 0 4593568 453568 1118476 0 0 0 0 992 1190 4 1 the 0
Vmstat 1 refers to statistics per second, where the CS column refers to the number of context switches. In general, the context switch for idle systems is approximately 1500 or less per second.
For the preemptive operating system that we use frequently, there are several reasons for context switching: 1. After the current time slice of the task is exhausted, the system CPU will dispatch the next task 2 normally. The current execution task encounters an IO block, and the scheduler suspends this task, continuing to the next task 3. Multiple tasks preempt the lock resource, the current task is not grabbed, suspended by the scheduler, continue to the next task 4. User code suspends the current task, yielding 5 of CPU time. Hardware interrupts. Some time ago it was found that someone was using Futex's wait and wake to test the direct consumption (link) of the context switch, and someone used blocking IO to test the context Switch's consumption (link). So how does the Java program Test and observe the consumption of contextual switches?
Example analysis
I did a little experiment where the code was simple and there were two worker threads. At the beginning, the first thread hangs itself; The second thread wakes up the first thread, and then hangs itself; The first thread wakes up and wakes up a second thread, then hangs itself. As soon as one goes, awaken each other and hang themselves. The code is as follows:
1 Importjava.util.concurrent.atomic.AtomicReference;2 ImportJava.util.concurrent.locks.LockSupport;3 4 Public Final classContextswitchtest {5 Static Final intRUNS = 3;6 Static Final intIterates = 1000000;7 StaticAtomicreference turn =Newatomicreference ();8 9 Static Final classWorkerthreadextendsThread {Ten volatileThread Other; One volatile intNParks; A - Public voidrun () { - FinalAtomicreference T =turn; the FinalThread other = This. Other; - if(Turn = =NULL|| other = =NULL) - Throw Newnullpointerexception (); - intp = 0; + for(inti = 0; I < iterates; ++i) { - while(!t.compareandset (Other, This)) { + Locksupport.park (); A++p; at } - Locksupport.unpark (other); - } - Locksupport.unpark (other); -NParks =p; -System.out.println ("Parks:" +p); in - } to } + - Static voidTest ()throwsException { theWorkerthread A =Newworkerthread (); *Workerthread B =Newworkerthread (); $A.other =b;Panax NotoginsengB.other =A; - Turn.set (a); the LongStartTime =system.nanotime (); + A.start (); A B.start (); the A.join (); + B.join (); - LongEndTime =system.nanotime (); $ intParknum = A.nparks +B.nparks; $System.out.println ("Average Time:" + (Endtime-starttime)/parknum) -+ "NS"); - } the - Public Static voidMain (string[] args)throwsException {Wuyi for(inti = 0; i < RUNS; i++) { the test (); - } Wu } -}
After compiling, in my own notebook (Intel (R) core (TM) i5 CPU M 460 @ 2.53GHz, 2 Core, 3M L3 Cache) with a few rounds of testing, the results are as follows:
Java-953495953485936305936302965563965560Average time : 13261ns
We will find that the simple for loop, the linear execution will be very fast, does not need 1 seconds, and the execution of this program takes a few 10 seconds time. Each context switch consumes up to more than 10 us time, which can have a significant impact on program throughput.
At the same time we can perform vmstat 1 to see if the context switch frequency is faster
$ vmstat 1-----------memory-------------Swap-------io-----system------CPU---- R B swpd Free buff cache si so bi bo in 1 0 0 4424988 457964 1154912 0 0 252 6 1 1 0 0 0 4420452 457964 1159900 0 0 0 0 1586 2069 6 1 0 1 0 0 4407676 457964 1171552 0 0 0 0 1436 1883 8 3 0 1 0 0 4402916 457964 1172032 0 0 0 84 22982 45792 9 4 2 1 0 0 4416024 457964 1158912 0 0 0 0 95382 198544 17 0 1 1 0 4416096 457964 1158968 0 0 0 79973 159934 7 0 1
0 0 4420384 457964 1154776 0 0 0 0 96265 196076 ten 1 1 0 0 4403012 457972 1171096 0 0 0 104321 213537 2
Then use Strace to observe which system call in the above program Unsafe.park () is causing the context switch:
$strace-F Java-CP. Contextswitchtest[pid 5969] Futex (0x9571a9c, Futex_wake_op_private, 1, 1, 0x9571a98, {futex_op_set, 0, FUTEX_ OP_CMP_GT, 1}) = 1[pid 5968] ) = 0[pid 5969] Futex (0X9571AD4, Futex_wait_private, 949, Null[pid 5968] Futex (0x9564368, futex_wake_private, 1) = 0[pid 5968] Futex ( 0x9571ad4, Futex_wake_op_private, 1, 1, 0x9571ad0, {futex_op_set, 0, FUTEX_OP_CMP_GT, 1}[pid 5969] ) = 0[pid 5968] ) = 1[pid 5969] Futex (0x9571628, Futex_wait_private, 2, NULL
Sure enough, it's futex.
Then use perf to see how the context affects the cache:
$ perf stat-e cache-misses java-9999991000000998930 998926998034 998204 for ' JAVA-CP. Contextswitchtest ': 2,550,605 cache-misses 90.221827008 seconds time Elapsed
There are more than 2.55 million cache misses in 1.5.
Well, it seems too long to be over. The next few posts will continue to analyze some interesting things.
(1) Memory barrier from Java Perspective (Barrier)
(2) CPU affinity (CPU Affinity) from the Java point of view
etc... Please follow
PS. In fact, an experiment was done to test the effect of CPU affinity on context switch.
$ taskset-c 0 Java-992713100000097842810000009898971000000 Average Time:2214ns
This command binds the process to the number No. 0 CPU, resulting in a small order of magnitude for the context switch, what is the reason?
Java Context Switch