The focus of this article is on performance issues for multithreaded applications. We'll first define performance and extensibility, and then we'll take a closer look at the Amdahl rule. Below we will examine how to use different technical methods to reduce the lock competition, and how to use code to achieve.
1, performance
We all know that multithreading can be used to improve the performance of the program, because we have multi-core CPU or multiple CPUs. Each CPU's kernel can do its own task, so breaking a large task into a series of small tasks that can run independently of each other can improve the overall performance of the program. For example, there is a program used to change the size of all the pictures in a folder on your hard disk, and you can improve its performance by using multi-threaded technology. Using a single-threaded approach, you can only iterate through all of the picture files and perform modifications, and if our CPU has multiple cores, it can only take advantage of one of the cores. Using multithreading, we can have a producer thread scan the file system to add each picture to a queue, and then execute the tasks with multiple worker threads. If we have the same number of working threads as the total CPU core, we can ensure that every CPU core has a job to do until the task is fully executed.
For another program that requires more IO waiting, the use of multithreaded technology can also improve overall performance. Let's say we're writing a program that needs to crawl all the HTML files for a site and store them on a local disk. The program can start from a Web page, and then parse all the links to this site in this page, and then crawl the links, so the cycle. Since our request to the remote Web site to receive all of the Web page data will need to wait a while, so we can give this task to multiple threads to execute. Let one or more threads parse the HTML pages that have been received and put the found links in the queue so that all the other thread is responsible for requesting the page. Unlike the previous example, in this example, you can still gain performance gains even if you use more threads than the CPU core.
The above two examples show us that high performance is doing as many things as possible in a short window of time. This is, of course, the most classic explanation for the term performance. But at the same time, using threads can also be a good way to improve the response speed of our programs. Imagine that we have such a graphical interface of the application, there is an input box above, the input box under the name is called "Processing" button. When the user presses the button, the application needs to render the button's state again (the button appears to be pressed, when the left mouse button is released and the original is restored), and the user's input is processed. If the task of processing user input is time-consuming, single-threaded programs cannot continue to respond to user's other input actions, such as user clicks or mouse pointer movement events from the operating system, and the response of these events requires a separate thread response.
Scalability (scalability) means that the program has the ability to achieve higher performance by adding computational resources. Imagine we need to resize a lot of pictures, because the CPU core of our machine is limited, so increasing the number of threads does not always improve performance accordingly. Conversely, because the scheduler is responsible for creating and shutting down more threads, it can also consume CPU resources, potentially degrading performance.
1.1 Amdahl Law
The previous paragraph mentions that in some cases, adding additional computational resources can improve the overall performance of the program. To be able to figure out how much performance gains we can get when we add additional resources, it is important to examine what parts of the program are serially running (or running synchronously) and which parts are running in parallel. If we take the code that needs to be synchronized to a ratio of B (for example, the number of lines of code that needs to be executed synchronously), the total number of cores in the CPU is N, then, according to the Amdahl rule, the upper limit of the performance improvement we can obtain is:
If n tends to infinity, (1-b)/n converges to 0. Therefore, we can ignore the value of this expression, so the performance elevation number converges to 1/b, where B represents the proportion of code that must be run synchronously. If B equals 0.5, that means half the code in the program can't run in parallel, and 0.5 is 2, so even if we add countless CPU cores, we get a maximum performance increase of twice times. Let's say we've modified the program now, only 0.25 of the code must be synchronized, and now 1/0.25=4 that our program will be about 4 times times faster than the hardware on a single core if it runs on hardware with a large CPU.
On the other hand, through the Amdahl rule, we can calculate the proportion of the synchronization code that the program should have based on the speed target we want to get. If we want to reach 100 times times faster, and 1/100=0.01, it means that the number of code that we program to execute synchronously cannot be more than 1%.
Summing up the Amdahl law we can see that the maximum performance improvement we get by adding extra CPUs depends on how small the portion of code that the program synchronizes. Although in practice, it's not always easy to figure out this ratio, let alone face some big business systems, but the Amdahl rule gives us a very important revelation that we have to think very carefully about the code that must be executed synchronously, and try to reduce this part of the code.
1.2 Effect on performance
The article is written here and we have shown that adding more threads can improve the performance and responsiveness of the program. But on the other hand, it's not easy to get these benefits, and it's going to cost a bit. The use of threads can also have an impact on performance improvements.
First, the first influence comes from the time the thread was created. During the creation of a thread, the JVM needs to request the appropriate resources from the underlying operating system and initialize the data structure in the scheduler to determine the order in which the threads are executed.
If you have the same number of threads as the core of the CPU, each thread will run on a core so that they may not be interrupted often. But in fact, when your program is running, the operating system will have its own operations that require CPU processing. So even in this case, your thread is interrupted and waits for the operating system to resume its operation. When your number of threads exceeds the core number of CPUs, the situation can become worse. In this case, the JVM's process scheduler interrupts some threads to allow other threads to perform, and the current state of the thread that is just running needs to be saved when the thread switches, so that the data state can be recovered the next time it runs. Not only that, the scheduler also updates its own internal data structure, which also consumes CPU cycles. All of this means that context switching between threads consumes CPU computing resources, resulting in a performance overhead that is less than single-threaded.
Another cost of multithreaded routines comes from synchronized access protection for shared data. We can use the Synchronized keyword for synchronous protection, or you can use the volatile keyword to share data among multiple threads. If more than one thread wants to access a shared data structure, there is a race scenario, and the JVM needs to decide which process to begin with, and after which process. If you decide that the thread you want to execute is not the currently running thread, a thread switch occurs. The current thread needs to wait until it succeeds in acquiring the lock object. The JVM can decide for itself how to perform this "wait", if the JVM expects a short time to successfully acquire the lock object, the JVM can use a radical wait method, such as a constant attempt to acquire the lock object until it succeeds, in which case it may be more efficient, because the comparison process context switch This is a quicker way to do it. Moving a waiting-state thread back to the execution queue can also incur additional overhead.
Therefore, we should try our best to avoid the context switch caused by the lock competition. The following section describes two ways to reduce the occurrence of this competition.
1.3 Lock Competitions
As mentioned in the previous section, competing access to a lock by two or more threads can incur additional computational overhead because the occurrence of a competition forces the scheduler to move a thread into an aggressive wait state, or to have it wait for a state that triggers two context switches. In some cases, the consequences of locking competition can be mitigated by the following methods:
1, reduce the scope of the lock;
2, reduce the need to acquire the frequency of locks;
3, try to use the optimistic lock operation supported by hardware, rather than synchronized;
4, as little as possible with synchronized;
5, reduce the use of object caching
1.3.1 Reduced synchronization domain
This first method can be applied if the code holds the lock over the necessary time. Usually we can move one or more lines of code out of the sync area to reduce the time that the current thread holds the lock. The fewer code that runs in the synchronization area, the sooner the current thread releases the lock, allowing other threads to acquire the lock earlier. This is consistent with the Amdahl rule because doing so reduces the amount of code that needs to be executed synchronously.
To better understand, look at the following source code:
public class Reducelockduration implements Runnable {private static final int number_o
F_threads = 5;
Private static final map<string, integer> Map = new hashmap<string, integer> ();
public void Run () {for (int i = 0; i < 10000; i++) {synchronized (map) {UUID randomuuid = Uuid.randomuuid ();
Integer value = integer.valueof (42);
String key = Randomuuid.tostring ();
Map.put (key, value);
} Thread.yield (); } public static void Main (string[] args) throws Interruptedexception {thread[] threads = new Thread[number_of_threa
DS];
for (int i = 0; i < number_of_threads i++) {threads[i] = new Thread (new Reducelockduration ());
Long Startmillis = System.currenttimemillis ();
for (int i = 0; i < number_of_threads i++) {Threads[i].start ();
for (int i = 0; i < number_of_threads i++) {threads[i].join ();
} System.out.println ((System.currenttimemillis ()-startmillis) + "MS"); }
}
In the above example, we have five threads competing to access the shared map instance, so that at the same time only one thread can access the map instance, we will add the Key/value action to the map to the synchronized protected code block. When we look at this code carefully, we can see that some code that calculates key and value does not need to be synchronized, and that key and value belong only to the thread that currently executes the code, only to the current thread, and not to other threads. So we can move these sentences out of sync protection. As follows:
public void Run () {for
(int i = 0; i < 10000; i++) {
UUID randomuuid = Uuid.randomuuid ();
Integer value = integer.valueof (n);
String key = Randomuuid.tostring ();
Synchronized (map) {
map.put (key, value);
}
Thread.yield ();
}
The effect of reducing synchronization code can be measured. On my machine, the execution time of the entire program was reduced from 420ms to 370ms. Look, just moving three lines of code out of the Sync protection block can reduce the running time of the program by 11%. The Thread.yield () code is intended to induce thread context switching, because this code tells the JVM that the current thread wants to hand over the computing resources currently in use so that other threads waiting to run. This also leads to more lock competition because, if not, one thread takes a core longer and reduces thread context switching.
1.3.2 Split Lock
Another way to reduce lock contention is to spread a lock-protected code into several smaller pieces of protection. This approach can be useful if you use a lock in your program to protect multiple different objects. Suppose we want to use a program to count some data, and we implement a simple counting class to hold a number of different statistical metrics, and each is represented by a basic count variable (a long type). Because our programs are multi-threaded, we need to synchronize the operations that access these variables because they come from different threads. The easiest way to do this is to add the Synchronized keyword to each function that accesses these variables.
public static class Counteronelock implements Counter {
private long customercount = 0;
Private long shippingcount = 0;
Public synchronized void Incrementcustomer () {
customercount++;
}
Public synchronized void Incrementshipping () {
shippingcount++;
}
Public synchronized Long Getcustomercount () {return
customercount;
}
Public synchronized Long Getshippingcount () {return
shippingcount;
}
}
This means that each modification of these variables throws a lock on the other counter instances. If another thread wants to call the increment method on another variable, it can only wait for the previous thread to release the lock control before it has an opportunity to do so. In this case, the use of separate synchronized protection for each variable will improve execution efficiency.
public static class Counterseparatelock implements Counter {
private static final Object Customerlock = new Object (); C2/>private static final Object Shippinglock = new Object ();
Private long customercount = 0;
Private long shippingcount = 0;
public void Incrementcustomer () {
synchronized (customerlock) {
customercount++;
}
}
public void incrementshipping () {
synchronized (shippinglock) {
shippingcount++;
}
}
Public long Getcustomercount () {
synchronized (customerlock) {return
customercount;
}
}
Public long Getshippingcount () {
synchronized (shippinglock) {return
shippingcount;
}
}
}
This implementation introduces a separate synchronized object for each count metric, so when a thread wants to increase the customer count, it must wait for another thread that is increasing the customer count to complete, Instead of waiting for another thread that is increasing the shipping count.
Using the following classes, we can easily compute the performance improvements that the split lock brings.
public class Locksplitting implements Runnable {private static final int number_of_threads = 5;
Private Counter Counter;
public interface Counter {void Incrementcustomer ();
void Incrementshipping ();
Long Getcustomercount ();
Long Getshippingcount ();
public static class Counteronelock implements Counter {...}
public static class Counterseparatelock implements Counter {...}
Public locksplitting (Counter Counter) {this.counter = Counter; public void Run () {for (int i = 0; i < 100000; i++) {if (Threadlocalrandom.current (). Nextboolean ()) {Counte
R.incrementcustomer ();
else {counter.incrementshipping (); }} public static void Main (string[] args) throws Interruptedexception {thread[] threads = new thread[number_of_th
Reads];
Counter Counter = new Counteronelock ();
for (int i = 0; i < number_of_threads i++) {threads[i] = new Thread (new locksplitting (counter));
Long Startmillis = System.currenttimemillis (); for (int i = 0;i < number_of_threads;
i++) {Threads[i].start ();
for (int i = 0; i < number_of_threads i++) {threads[i].join ();
} System.out.println ((System.currenttimemillis ()-Startmillis) + "MS");
}
}
On my machine, the implementation of a single lock takes an average of 56ms, and the implementation of two individual locks is 38ms. The time elapsed was about 32% lower.
Another way to improve it is that we can even further separate the read and write with different locks to protect. The original counter class provides a way to read and write to count metrics, but in fact, read operations do not require synchronous protection, and we can safely allow multiple threads to read the values of the current metric in parallel, while write operations must be protected synchronously. The Java.util.concurrent package provides an implementation of the Readwritelock interface, which can be easily achieved.
The Reentrantreadwritelock implementation maintains two different locks, a protection read operation, and a protection write operation. Both locks have the operation of acquiring and releasing locks. Only when no one acquires a read lock can a write lock be successfully obtained. Conversely, as long as the write lock is not fetched, the read lock can be acquired simultaneously by multiple threads. To demonstrate this approach, the following counter class uses Readwritelock, as follows:
public static class Counterreadwritelock implements Counter {private final Reentrantreadwritelock Customerlock = new Re
Entrantreadwritelock ();
Private final Lock Customerwritelock = Customerlock.writelock ();
Private final Lock Customerreadlock = Customerlock.readlock ();
Private final Reentrantreadwritelock Shippinglock = new Reentrantreadwritelock ();
Private final Lock Shippingwritelock = Shippinglock.writelock ();
Private final Lock Shippingreadlock = Shippinglock.readlock ();
Private long customercount = 0;
Private long shippingcount = 0;
public void Incrementcustomer () {customerwritelock.lock ();
customercount++;
Customerwritelock.unlock ();
public void incrementshipping () {shippingwritelock.lock ();
shippingcount++;
Shippingwritelock.unlock ();
Public long Getcustomercount () {customerreadlock.lock ();
Long Count = CustomerCount;
Customerreadlock.unlock ();
return count;
Public long Getshippingcount () {shippingreadlock.lock (); Long Count = ShippingCount;
Shippingreadlock.unlock ();
return count;
}
}
All read operations are read-lock protected, and all writes are protected by write locks. This implementation can lead to a greater performance improvement than in the previous section if the read operation performed in the program is much larger than the write operation, because the read operation can be performed concurrently.
1.3.3 Separation Lock
The previous example shows how to separate a single lock into separate locks, so that each thread simply gets the lock on the object they are about to modify. But on the other hand, this approach also increases the complexity of the program, which can cause deadlocks if it is not implemented properly.
The detach lock is a method similar to a split lock, but a split lock is an increase in the lock to protect a different code fragment or object, while a separate lock uses a different lock to protect the values of different ranges. This idea is used by Concurrenthashmap in the JDK's java.util.concurrent package to improve the performance of programs that rely heavily on HashMap. On implementation, the CONCURRENTHASHMAP uses 16 different locks internally, rather than encapsulating a hashmap of synchronous protection. 16 locks each is responsible for protecting one of the 16 points of the bucket (bucket) of the synchronized access. This way, when different threads want to insert keys into different segments, the corresponding actions are protected by different locks. But conversely, there are some bad problems, such as the completion of some operations that now require acquiring multiple locks rather than a lock. If you want to copy the entire map, all 16 locks need to be acquired to complete.
1.3.4 Atomic operation
Another way to reduce the lock competition is to use atomic operations, which will elaborate on the principles in other articles. The Java.util.concurrent package provides classes that are encapsulated in atomic operations for some common underlying data types. The implementation of an atomic operation class is based on a processor-supplied comparison substitution feature (CAS), which performs update operations only when the value of the current register is the same as the old value provided by the operation.
This principle can be used to increase the value of a variable in an optimistic manner. If our thread knows the current value, it will attempt to use CAs to perform the increment operation. If the value of the variable has been modified by another thread during the period, the so-called current value provided by the thread is not the same as the real value, when the JVM tries to regain the current value and try again, repeatedly until it succeeds. While looping can waste some CPU cycles, the advantage of doing so is that we don't need any form of synchronization control.
The following implementations of the counter class take advantage of the way atoms operate, and you can see that no synchronized code is used.
public static class Counteratomic implements Counter {
private Atomiclong CustomerCount = new Atomiclong ();
Private Atomiclong Shippingcount = new Atomiclong ();
public void Incrementcustomer () {
customercount.incrementandget ();
}
public void incrementshipping () {
shippingcount.incrementandget ();
}
Public long Getcustomercount () {return
customercount.get ();
}
Public long Getshippingcount () {return
shippingcount.get ();
}
}
Compared with the Counterseparatelock class, the average elapsed time was reduced from 39ms to 16ms, down by about 58%.
1.3.5 Avoid hotspot Code Snippets
A typical list implementation changes the value of a variable to record the number of elements contained in the list by maintaining a variable in the content, and every time the element is deleted or added to the list. If the list is used in a single threaded application, it is understandable that the value of the last computed is returned directly each time the size () is invoked. If the count variable is not maintained within the list, each call to the size () action will cause the list to iterate over the number of elements.
This kind of data structure all uses the optimization way, when in the multi-threaded environment, but will become a problem. Suppose we share a list among multiple threads, and multiple threads simultaneously add or delete elements to the list, while querying for large lengths. At this point, the count variable inside the list becomes a shared resource, so all access to it must be synchronized. Therefore, the counting variable becomes a hotspot in the whole list implementation.
The following code fragment shows the problem:
public static class Carrepositorywithcounter implements Carrepository {private map<
String, car> cars = new hashmap<string, car> ();
Private map<string, car> trucks = new hashmap<string, car> ();
Private Object Carcountsync = new Object ();
private int carcount = 0; public void Addcar {if (Car.getlicenceplate (). StartsWith ("C")) {synchronized (cars) {car Foundcar = cars.
Get (Car.getlicenceplate ());
if (Foundcar = = null) {Cars.put (Car.getlicenceplate (), car);
Synchronized (Carcountsync) {carcount++;
}} else {synchronized (trucks) {Car Foundcar = Trucks.get (Car.getlicenceplate ());
if (Foundcar = = null) {Trucks.put (Car.getlicenceplate (), car);
Synchronized (Carcountsync) {carcount++;
public int Getcarcount ()}}} {synchronized (Carcountsync) {return carcount; }
}
}
The above implementation of this carrepository has two list variables, one for the car wash element, one for putting the truck elements, and for a way to query the total size of the two list. The optimization approach is that each time a car element is added, the value of the internal count variable is incremented, while the increased operation is protected by synchronized, and the method of returning the count value is the same.
To avoid this extra code synchronization overhead, look at another implementation of the Carrepository: it no longer uses an internal count variable, but rather a real-time count of the number in the way it returns to the car. As follows:
public static class Carrepositorywithoutcounter implements Carrepository {
private map<string, car> cars = new H Ashmap<string, car> ();
Private map<string, car> trucks = new hashmap<string, car> ();
public void Addcar (car car) {
if (car.getlicenceplate (). StartsWith ("C")) {
synchronized (cars) {
car Foundcar = Cars.get (Car.getlicenceplate ());
if (Foundcar = = null) {
cars.put (car.getlicenceplate (), car);}}}
else {
synchronized (trucks ) {car
Foundcar = Trucks.get (Car.getlicenceplate ());
if (Foundcar = = null) {
trucks.put (car.getlicenceplate (), car)
;
}
}} public int Getcarcount () {
synchronized (cars) {
synchronized (trucks) {return
cars.size () + trucks.size ();
}
}
}
}
Now, only in the Getcarcount () method, two-list access needs to be synchronized, as in the previous implementation, the synchronization overhead is no longer available each time a new element is added.
1.3.6 avoids object cache reuse
In the first version of the Java VM, the new keyword was used to create a more expensive object, so many developers were accustomed to using object reuse patterns. To avoid creating objects again and again, the developer maintains a buffer pool that can be saved in the buffer pool after each instance of the object is created, and can be fetched directly from the buffer pool the next time another thread needs to be used.
At first glance, this approach is reasonable, but the pattern can be problematic in multithreaded applications. Because the buffer pool of objects is shared among multiple threads, the operation of all threads accessing the objects in them requires synchronous protection. The cost of this synchronization is greater than the creation of the object itself. Of course, creating too many objects can add to the burden of garbage collection, but even taking this into account, avoiding the performance boost from synchronizing your code is still better than using the object caching pool.
The optimization schemes described in this article show once again that every possible optimization method must be carefully evaluated in real application. Immature optimization schemes may seem plausible on the surface, but in fact they are likely to turn into performance bottlenecks.