[Java Performance] Performance of threads and synchronization-thread pool/ThreadPoolExecutors/ForkJoinPool

Last Update:2014-09-26 Source: Internet

Author: User

Tags http status code 500 stream api

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

[Java Performance] Performance of threads and synchronization-thread pool/ThreadPoolExecutors/ForkJoinPool
Thread Pool and ThreadPoolExecutors

Although the Thread type can be directly used in the program for Thread operations, the Thread pool is used, especially on the Java EE application server, generally, several thread pools are used to process requests from the client. The support for thread pools in Java comes from ThreadPoolExecutor. Some application servers use ThreadPoolExecutor to implement the thread pool.

The most important parameter for optimizing the performance of the thread pool is the size of the thread pool.

For any thread pool, they work in almost the same way:

The task is put into a queue (the number of queues is not fixed)
The thread obtains the task from the queue and executes the task.

After the thread completes the task, continue to get the task from the queue. If the queue is empty, the thread enters the waiting state.

The thread pool usually has the minimum and maximum number of threads:

The minimum number of threads, that is, the minimum number of threads to be maintained in the thread pool when the task queue is empty. This is because creating a thread is a resource-consuming operation and should be avoided as much as possible, when a new task is put into the queue, there will always be threads that can process it immediately.

Maximum number of threads. the maximum number of threads that a thread pool can possess when too many tasks need to be processed. This is to ensure that there will not be too many threads created, because the running of threads depends on CPU resources and other resources. When there are too many threads, the performance will be reduced.

Among ThreadPoolExecutor and its related types, the minimum number of threads is called the Core Pool Size. In the implementation of other Java application servers, this number may be called MinThreads, but they have the same concept.

However, when the thread pool is Resizing, the implementation of ThreadPoolExecutor and other thread pools may be very different.

The simplest case is that when a new task needs to be executed and all threads are occupied, threadPoolExecutor and Other implementations usually create a new thread to execute this new task (until the maximum number of threads is reached ).

Set the maximum number of threads

How to determine the optimum maximum number of threads depends on the following two aspects:

Task features

Computer hardware

To facilitate the discussion, we assume that the JVM has four available CPUs. The task is also very clear, that is, to maximize the "squeeze" of their resources, do everything possible to improve the CPU utilization.

Therefore, the maximum number of threads must be set to 4 at least, because there are four available CPUs, which means that up to four tasks can be executed in parallel. Of course, Garbage Collection also has some impact in this process, but they often do not need to use the entire CPU. One exception is that when the CMS or G1 garbage collection algorithm is used, sufficient CPU resources are required for garbage collection.

Is it necessary to set a larger number of threads? This depends on the features of the task.

If the task is computing-intensive, it means that the task does not need to perform I/O operations, such as reading databases and reading files. Therefore, they do not involve synchronization and are completely independent between tasks. For example, if you use a batch processing program to read data from the Mock data source and test the performance when there are different threads in the non-thread pool, the following table is obtained:

Number of threads	Execution time (SEC)	Baseline percentage
1	255.6	100%
2	134.8	52.7%
4	77.0	30.1%
8	81.7	31.9%
16	85.6	33.5%

Draw some conclusions from the above:

When the number of threads is 4, the optimal performance is achieved, and the performance is not better when the number of threads is increased, because the CPU utilization has reached the highest, adding a thread only increases the competition for CPU resources between threads, thus reducing performance.

The baseline percentage is not an ideal 25% even when the CPU usage reaches the highest level, because the CPU resources are not exclusively exclusive to the application threads while the program is running, some background threads sometimes need CPU resources, such as GC threads and some system threads.

When the calculation is triggered by Servlet, the performance data is as follows (Load Generator will send 20 requests at the same time ):

Number of threads	Operations per second (OPS)	Baseline percentage
4	77.43	100%
8	75.93	98.8%
16	71.65	92.5%
32	69.34	89.5%
64	60.44	78.1%

Conclusion from the preceding table:

When the number of threads is 4, the performance is optimal. Because the task type is computation-intensive and has only four CPUs, when the number of threads is 4, the optimal condition is achieved.
As the number of threads increases, the performance degrades because the threads compete for CPU resources, causing frequent thread switching to execute the Context Environment, which only wastes CPU resources.

The speed of performance degradation is not obvious. This is also because the task type is computing-intensive. If the performance bottleneck is not the computing resources provided by the CPU, but external resources, such as databases and file operations, the performance may be more obvious when the number of threads is increased.

Next, from the Client perspective, what is the impact of the number of concurrent clients on the Server response time? In the same environment, when the number of concurrent clients increases gradually, the response time will change as follows:

Concurrent Client threads	Average response time (SEC)	Baseline percentage
1	0.05	100%
2	0.05	100%
4	0.05	100%
6	0.076	152%
8	0.104	208%
16	0.212	424%
32	0.437	874%
64	0.909	1818%

The task type is computation-intensive. When the number of concurrent clients is 1, 2, and 4, the average response time is optimal. However, when there are four more clients, the performance will decrease significantly as the Client increases.

When the number of clients increases, you may want to increase the performance by increasing the number of threads in the server thread pool. However, in the case of CPU-intensive tasks, this will only reduce the performance. Because the bottleneck of the system is the CPU resources, increasing the number of threads in the thread pool will only make the competition for such resources more intense.

Therefore, in the face of performance issues. The first step is to always understand the bottleneck of the system so that we can be targeted. If we initiate the so-called "tuning" to make the competition for bottleneck resources more intense, we will only bring about a further decline in performance. On the contrary, if you reduce the competition for bottleneck resources, the performance will usually increase.

In the preceding scenario, if you consider ThreadPoolExecutor, there will always be tasks in the task queue being suspended (Pending) (Because each request of the Client corresponds to a task), all available threads are working, and the CPU is running at full capacity. At this time, the number of threads added to the thread pool will let these added threads receive some pending tasks. What will happen? In this case, the competition for CPU resources among threads is more intense, reducing performance.

Set the minimum number of threads

After setting the maximum number of threads, you also need to set the minimum number of threads. For most scenarios, you can set it to be equal to the maximum number of threads.

The initial intention of setting the minimum number of threads to be smaller than the maximum number of threads is to save resources, because every time a thread is created, it consumes a certain amount of resources, especially the resources required by the thread stack. However, when the maximum number of threads is selected for hardware resources and task features in a system, it indicates that the system will always use these threads, it is better to have the thread pool prepare the required threads at the beginning. However, the impact of setting the minimum number of threads to a smaller value than the maximum number of threads is also very small. Generally, the differences are not noticed.

In a batch processing program, it is not important whether the minimum number of threads is equal to the maximum number of threads. Because the last thread always needs to be created, the running time of the program should be almost the same. It has little impact on server programs, but generally, threads in the thread pool should be created during the warm-up phase, this is why we recommend that you set the minimum number of threads to the maximum number of threads.

In some scenarios, you also need to set a different minimum number of threads. For example, if a system needs to process a maximum of 2000 tasks at the same time and the average number of tasks is only 20, you need to set the minimum number of threads to 20, instead of equal to 2000 of the maximum number of threads. At this time, if the minimum number of threads is equal to the maximum number of threads, the resources occupied by Idle threads are considerable, especially when variables of the ThreadLocal type are used.

Thread Pool Task count (Thread Pool Task Sizes)

The thread pool has a list or queue data structure to store tasks to be executed. Obviously, in some cases, the number of tasks increases faster than the execution speed. If this task represents a request from the Client, it means that the Client will wait for a long time. Obviously this is unacceptable, especially for server programs that provide Web Services.

Therefore, the thread pool has the opportunity to limit the number of tasks in the list/queue. However, like setting the maximum number of threads, there is no optimal number of tasks. This depends on the specific task type and continuous performance testing.

For ThreadPoolExecutor, when the number of tasks reaches the maximum, attempts to add new tasks will fail. ThreadPoolExecutor has a rejectedExecution method to reject the task. This will cause the application server to return an HTTP status code 500. Of course, this information should be sent to the Client in a more friendly way, for example, explaining why your request was rejected.

Custom ThreadPoolExecutor

When the thread pool meets the following conditions, a new thread is created:

Tasks need to be executed
All threads in the current thread pool are in the working state.

The number of threads in the current thread pool does not reach the maximum number.

How the thread pool creates this new thread is based on the type of the task queue:

The task queue is SynchronousQueue, which is characterized by the fact that it cannot place any tasks in its queue. When a task is submitted, the thread pool using SynchronousQueue will immediately create a thread for the task (if the maximum number of threads is not reached, the task will be rejected ). This queue is suitable for use when the number of tasks is small. That is to say, when using this queue, there is no container for unexecuted tasks to temporarily store them.
A task Queue is an Unbound Queue. A non-bound Queue can be a Queue such as a Queue blockingqueue. In this case, no submitted task is rejected. However, the thread pool ignores the maximum number of threads, which means that the maximum number of threads in the thread pool is changed to the minimum number of threads. Therefore, when using this queue, the maximum number of threads is usually equal to the minimum number of threads. This is equivalent to using a thread pool with a fixed number of threads.

A task Queue is a finite Queue. When a Queue is a finite Queue such as ArrayBlockingQueue, it is more complicated to decide when to create a new thread. For example, the minimum number of threads is 4, the maximum number of threads is 8, and the task queue can accommodate up to 10 tasks. In this case, when the task is gradually added to the queue until the queue is full (10 tasks), there are still four worker threads in the thread pool, that is, the minimum number of threads. Only when there are still tasks that want to be placed in the queue, the thread pool will create a new thread and take a task from the queue header, to make room for the latest submitted task.

Just follow the KISS Principle (Keep It Simple, Stupid) to learn how to customize ThreadPoolExecutor. For example, set the maximum number of threads to the minimum number of threads, and select a limited queue or an unlimited queue as needed.

Summary

A thread pool is a useful example of an object pool, which can save the resource overhead when creating them. The thread pool also limits the number of threads in the system.
The number of threads in the thread pool must be carefully set; otherwise, increasing the number of threads will only result in performance degradation.

When customizing ThreadPoolExecutor, follow the KISS Principle and generally provide the best performance. ForkJoinPool

A new thread pool is introduced in Java 7: ForkJoinPool.

Like ThreadPoolExecutor, it also implements the Executor and ExecutorService interfaces. It uses an infinite queue to save the tasks to be executed, and the number of threads is passed through the constructor. If the expected number of threads is not passed into the constructor, the number of CPUs available on the current computer is set to the number of threads as the default value.

ForkJoinPool is mainly used to solve the problem by Divide-and-Conquer Algorithm. Typical applications such as the Quick Sort Algorithm. The main point here is that ForkJoinPool requires a relatively small number of threads to process a large number of tasks. For example, to sort 10 million pieces of data, the task will be divided into two 5 million sorting tasks and a merge task for these two sets of 5 million pieces of data. Similarly, 5 million of data will be split, and a threshold value will be set at the end to specify the number of data shards to stop. For example, when the number of elements is less than 10, the split will be stopped and the elements will be sorted Using Insert sorting.

In the end, there will be about 2000000 + tasks. The key to the problem is that a task can be executed only after all its subtasks are completed.

Therefore, when ThreadPoolExecutor is used, there will be a problem with the grouping method, because the threads in ThreadPoolExecutor cannot add another task in the task queue and continue to execute the task after the task is completed. When ForkJoinPool is used, the thread can create a new task and suspend the current task. In this case, the thread can select a subtask for execution from the queue.

For example, if we need to count the number of elements smaller than 0.5 in a double array, we can use ForkJoinPool for implementation as follows:

public class ForkJoinTest {    private double[] d;    private class ForkJoinTask extends RecursiveTask
                 
                   {        private int first;        private int last;        public ForkJoinTask(int first, int last) {            this.first = first;            this.last = last;        }        protected Integer compute() {            int subCount;            if (last - first < 10) {                subCount = 0;                for (int i = first; i <= last; i++) {                    if (d[i] < 0.5)                        subCount++;                    }                }            else {                int mid = (first + last) >>> 1;                ForkJoinTask left = new ForkJoinTask(first, mid);                left.fork();                ForkJoinTask right = new ForkJoinTask(mid + 1, last);                right.fork();                subCount = left.join();                subCount += right.join();            }            return subCount;        }    }    public static void main(String[] args) {        d = createArrayOfRandomDoubles();        int n = new ForkJoinPool().invoke(new ForkJoinTask(0, 9999999));        System.out.println("Found " + n + " values");    }}

The preceding keys are the fork () and join () methods. In the thread used by ForkJoinPool, an internal queue is used to operate the tasks to be executed and subtasks to ensure their execution order.

So what is the performance difference when ThreadPoolExecutor or ForkJoinPool is used?

First, the ForkJoinPool can use a limited number of threads to complete many parent-child tasks. For example, four threads are used to complete more than 2 million tasks. However, it is impossible to use ThreadPoolExecutor, because threads in ThreadPoolExecutor cannot choose to execute subtasks first. To complete 2 million parent-child tasks, 2 million threads are required, obviously this is not feasible.

Of course, in the above example, we can also avoid the division and control method. Because of the independence between tasks, we can divide the entire array into several regions and then use ThreadPoolExecutor to solve the problem, this method does not create a large number of subtasks. The Code is as follows:

public class ThreadPoolTest {    private double[] d;    private class ThreadPoolExecutorTask implements Callable
                 
                   {        private int first;        private int last;        public ThreadPoolExecutorTask(int first, int last) {            this.first = first;            this.last = last;        }        public Integer call() {            int subCount = 0;            for (int i = first; i <= last; i++) {                if (d[i] < 0.5) {                    subCount++;                }            }            return subCount;        }    }    public static void main(String[] args) {        d = createArrayOfRandomDoubles();        ThreadPoolExecutor tpe = new ThreadPoolExecutor(4, 4, Long.MAX_VALUE, TimeUnit.SECONDS, new LinkedBlockingQueue());        Future[] f = new Future[4];        int size = d.length / 4;        for (int i = 0; i < 3; i++) {            f[i] = tpe.submit(new ThreadPoolExecutorTask(i * size, (i + 1) * size - 1);        }        f[3] = tpe.submit(new ThreadPoolExecutorTask(3 * size, d.length - 1);        int n = 0;        for (int i = 0; i < 4; i++) {            n += f.get();        }        System.out.println("Found " + n + " values");    }}

When ForkJoinPool and ThreadPoolExecutor are used respectively, the time they handle this problem is as follows:

Number of threads	ForkJoinPool	ThreadPoolExecutor
1	3.2 s	0.31 s
4	1.9 s	0.15 s

GC during execution is also monitored. It is found that the total GC time is 1.2 s when ForkJoinPool is used, while ThreadPoolExecutor does not trigger any GC operations. This is because a large number of subtasks are created during the ForkJoinPool operation. When they are executed, they will be recycled. Otherwise, ThreadPoolExecutor does not create any subtasks, and thus does not cause any GC operations.

Another feature of ForkJoinPool is its ability to implement Work Stealing. Each thread in the thread pool maintains a queue to store tasks to be executed. When all tasks in the thread's queue are completed, it will get unexecuted tasks from other threads and help them execute them.

You can use the following code to test the Work Stealing feature of ForkJoinPool:

for (int i = first; i <= last; i++) {    if (d[i] < 0.5) {        subCount++;    }    for (int j = 0; j < d.length - i; j++) {        for (int k = 0; k < 100; k++) {            dummy = j * k + i; // dummy is volatile, so multiple writes occur            d[i] = dummy;        }    }}

Because the number of loops in the layer (j) depends on the value of I in the outer layer, the execution time of this Code depends on the value of I. When I = 0, the execution time is the longest, while when I = last, the execution time is the shortest. This means that the workload of the task is different. When the I value is small, the workload of the task is large. As I increases, the workload of the task becomes smaller. Therefore, this is a typical scenario where task load is not balanced.

In this case, it is not appropriate to select ThreadPoolExecutor, because the threads in it do not pay attention to the difference in the number of tasks between each task. When the thread of the task with the smallest number of tasks is completed, it will be in the Idle state (Idle), waiting for the task with the largest number of tasks to be completed.

The ForkJoinPool scenario is different. Even if the workload of a job is different, when a thread executes a job with a heavy workload, other Idle threads will help it complete the remaining tasks. Therefore, the thread utilization is improved and the overall performance is improved.

The execution time of these two thread pools when the task workload is not balanced:

Number of threads	ForkJoinPool	ThreadPoolExecutor
1	54.5 s	53.3 s
4	16.6 s	24.2 s

Note that when the number of threads is 1, the execution time difference between the two is not obvious. This is because the total computing workload is the same, while the ForkJoinPool is slow because it creates many tasks and increases the GC workload.

When the number of threads increases to 4, the execution time is significantly different. The ForkJoinPool performance is nearly 50% better than ThreadPoolExecutor. It can be seen that Work Stealing is not balanced when the number of tasks is not balanced, ensures resource utilization.

Therefore, when the task volume is balanced, it is always better to select ThreadPoolExecutor. Otherwise, select ForkJoinPool.

In addition, for ForkJoinPool, there is another factor that will affect its performance, that is, the threshold value to stop splitting tasks. For example, in the previous quick sorting, when the number of remaining elements is less than 10, the creation of subtasks will be stopped. The following table shows the ForkJoinPool performance under different thresholds:

Number of threads	ForkJoinPool
20	17.8 s
10	16.6 s
5	15.6 s
1	16.8 s

It can be found that when the threshold value is different, it will also affect the performance. Therefore, when ForkJoinPool is used, this threshold value is tested. using the most appropriate value also contributes to overall performance.

Automatic Parallelization)

In Java 8, the concept of automatic parallelization is introduced. It allows some Java code to be automatically executed in parallel, provided that ForkJoinPool is used.

Java 8 adds a general thread pool for ForkJoinPool to process tasks that are not explicitly submitted to any thread pool. It is a static element of the ForkJoinPool type and has the default number of threads equal to the number of processors on the running computer.

When the new method added on the Arrays class is called, automatic parallelization will occur. For example, it is used to sort the parallel and fast sorting of an array and to traverse the elements in an array in parallel. Automatic parallelization is also applied to the newly added Stream API of Java 8.

For example, the following code is used to traverse the elements in the list and perform the required calculations:

Stream
                 
                   stream = arrayList.parallelStream();stream.forEach(a -> {    String symbol = StockPriceUtils.makeSymbol(a);    StockPriceHistory sph = new StockPriceHistoryImpl(symbol, startDate, endDate, entityManager);});

The calculation of the elements in the list is executed in parallel. The forEach method creates a task for the computing operation of each element, which is processed by the general thread pool in the ForkJoinPool mentioned above. The preceding parallel computing logic can also be completed using ThreadPoolExecutor, but ForkJoinPool is superior in terms of code readability and amount.

For the number of threads in the general thread pool of ForkJoinPool, you can use the default value, that is, the number of processors of the computer at runtime. To adjust the number of threads, you can set the system attributes:-Djava.util.concurrent.ForkJoinPool.common.parallelism=N

The following set of data is used to compare the performance when the general thread pool in ThreadPoolExecutor and ForkJoinPool is used to complete the preceding simple computing:

Number of threads	ThreadPoolExecutor (seconds)	ForkJoinPool Common Pool (seconds)
1	255.6	135.4
2	134.8	110.2
4	77.0	96.5
8	81.7	84.0
16	85.6	84.6

Note that when the number of threads is 1, 2, 4, the performance difference is obvious. The performance of the ForkJoinPool common thread pool with 1 thread and ThreadPoolExecutor with 2 threads is very close.

The reason for this is that the forEach method uses some tricks. It also uses the thread that executes forEach as a worker thread in the thread pool. Therefore, even if you set the number of threads in the general thread pool of ForkJoinPool to 1, there will actually be two working threads. Therefore, when using forEach, The ForkJoinPool common thread pool with a thread number of 1 is equivalent to ThreadPoolExecutor with a thread Number of 2.

Therefore, when the ForkJoinPool general thread pool actually requires four working threads, you can set it to 3, and the available working threads at runtime are 4.

Summary

Use ForkJoinPool when processing recursive grouping algorithms.
Carefully set the threshold value for no job Division. This threshold value has an impact on performance.
Some features in Java 8 use the general thread pool in ForkJoinPool. In some cases, you need to adjust the default number of threads in the thread pool.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More