|
<Tr Valign = "TOP"> <TD width = "8"> Src = "// www.ibm.com/ I /c.gif"/> </TD> <TD width = "16"> Height = "16" src = "// www.ibm.com/ I /c.gif"/> </TD> <TD class = "small" Width = "122"> <p> <SPAN class = "Ast"> Javascript is not displayed </Span> </P> </TD> </tr>
|
|
Print this page
|
|
|
Send this page as an email
|
|
|
Sample Code
|
|
|
Level: Advanced Dai Xiaojun Daixiaoj@cn.ibm.com ), Software engineer, IBM China Software Development Center Gan Zhi Ganzhi@cn.ibm.com ), Senior Software Engineer, IBM China Software Development Center Qi Yao Qiyaoj@cn.ibm.com ), Software engineer, IBM China Software Development Center Luo Zhida Luozd@cn.ibm.com ), Software engineer, IBM China Software Development Center October 10, 2008
When After the CPU enters the multi-core era, performance tuning of software is no longer a simple task. Programs without parallelism may run slower on new hardware than before. When the CPU When the number increases, it is wise for chip manufacturers to reduce the CPU running frequency to achieve the best performance/power consumption ratio. Compared with C/C ++ programmers Writing multi-threaded applications in Java is much simpler. However, it is not easy for multi-threaded programs to achieve high performance. For software developers, It is not surprising that parallel programs are not faster than serial programs during testing. After all, before the multi-core era, the widely accepted parallel software development guidelines are usually too simple and arbitrary.
In this article, we will introduce the general steps to improve the performance of Java multi-threaded applications. With some simple rules provided in this article, we can obtain high-performance and scalable applications.
Why is the performance not growing?
Multi-nuclear power brings about a substantial increase in performance, which can be easily observed through some simple tests. If we write a multi-threaded program and accumulate a local variable in each thread, we can easily see that multi-core and parallel performance improves exponentially. This is very easy to do, isn't it? In reference resources Here is an example. However, contrary to our tests, we seldom see such perfect scalability in actual software applications. There are two factors that impede us from achieving perfect scalability. First, we are faced with theoretical limitations. Secondly, implementation problems often occur during software development. Let's take a look at the three performance curves shown in Figure 1: Figure 1. Performance Curve
Work To pursue a perfect software engineer, we hope to see linear growth in program performance as the number of threads increases, that is, Figure 1 . What we do not want to see most is the green curve. No matter how many new CPUs are invested, the performance has not increased at all. (With the CPU The growth and performance decline curve also exists in the actual project ). The red lines in the figure indicate that the General 90-10 rule is not applicable to scalability. Suppose there are 10% The expansion curve is shown in the red line. As shown in the figure, when 90% of the code can be perfectly parallel In this case, we can only achieve about 5 times of performance. If a task has a part that cannot be parallel, then in the real world, our performance curve is roughly located in the gray area in Figure 1. In this article, we will not try to challenge the theoretical limits. It is not an easy task to explain how a Java programmer can reach the limit as much as possible. What causes poor scalability?
Yes There are many reasons for poor scalability, the most significant of which is the abuse of locks. There is no way to do this. We were taught like this: "Do you want multi-thread security? Add a lock ". Think about Python And Java's collections. synchronizedxxxx () What's wrong with the series of methods that follow giants? Yes, it is very convenient to use locks to protect key areas, and it is easier to ensure correctness. However, locks also mean that only one process can enter the key areas, while others can Cheng is waiting! If you observe that the CPU is idle and the software runs slowly, it is wise to check the use of the lock. Java lock monitor in performance inspector is a good open-source tool for Java programs. Tune a multi-threaded Application
Next, we will provide an example program and demonstrate how to achieve better scalability on a multi-core platform. This example demonstrates a hypothetical log server. It receives logs from multiple sources and saves them to the file system. For the sake of simplicity, our example code does not contain any network-related code,Main() The function starts multiple threads to send log information to the log server. For eager readers, Let's first look at the optimization results: Figure 2. Daily server optimization result
In In, the blue curve is a lock-based old-fashioned log server, while the green curve is the log server after we have optimized the performance. You can see that logserverbad The performance of logservergood increases linearly as the number of threads increases. If you do not mind using a third-party library The lockfreequeue of Project Kunming can further provide better scalability: Figure 3. Use the lock-Free Data Structure
In The third curve indicates that the concurrent1_queue in the standard library is replaced with lockfreequeue. Performance curve. We can see that if the number of threads is small, there is little difference between the two curves, but after the number of single threads increases to a certain extent, the lock-free data structure has obvious advantages. The following describes the tools and techniques used in the above example to help us create highly scalable Java applications. Use JlM to analyze applications
JlM provides lock hold time and conflict statistics for Java applications and JVM. The following functions are provided:
- Count conflicting locks
- Number of successfully obtained locks
- Number of recursive locks
- Number of times the Lock Applying thread is blocked
- Cumulative lock hold time. For platforms that support 3tier spin locking, you can also obtain the following information:
- Number of spin loop request locks
- Number of times that the request thread requests the lock in the outer layer (thread yield loop)
- Use the rtdriver tool to collect more detailed information
- Jlmlitestart: Collects counters only
- Jlmstart: collects statistics on counters and hold time only.
- Jlmstop: Stop data collection
- Jlmdump: prints the data collection and continues the collection process.
- Time when garbage collection (GC) is removed from lock hold time
- GC time is removed from the hold time of all held locks in the GC cycle
Use atomicinteger to count
Generally, when we implement counter or random number generator used by multiple threads, locks are used to protect shared variables. The disadvantage of doing so is that if the lock competition is too powerful, it will damage the throughput, because the competition synchronization is very expensive. Although the volatile variable can store shared variables at a lower cost than synchronization, it can only ensure that other threads can immediately see the write to the volatile variable, the atomicity of read-Modify-write cannot be guaranteed. Therefore, the volatile variable cannot be used to implement the correct counter and random number generator. Starting from JDK 5,java.util.concurrent.atomic Atomic variables are introduced in the package, including atomicinteger, atomiclong, atomicboolean, and arrays atomicintergerarray and atomiclongarray. Atomic variables ensure++ ,-- ,+= ,-= And other operations. With these data structures, You can implement more efficient counters and random number generators. Add a lightweight thread pool -- executor
Large Most concurrent applications are managed based on tasks. Generally, we create a separate thread for each task to execute. This will bring about two problems: 1. A large number Threads (> 100) consume system resources, increase Thread Scheduling overhead, and cause performance degradation. 2. For short-lived tasks, frequent creation and elimination of threads is not a wise choice. Because The overhead of creating and killing threads may be greater than the performance benefit of multithreading. A more reasonable way to use multithreading is to use the thread pool (thread Pool ). Java. util. Concurrent provides a flexible thread pool implementation: Executor Framework. This framework can be used for asynchronous task execution and supports many different types of task execution policies. It also provides a standard method for decoupling between task submission and task execution. Runnable describes tasks in a common way. The implementation of executor also provides support for the lifecycle and hook Functions, such as statistical collection, application management, and monitoring. Executing task threads in the thread pool can reuse existing threads to avoid creating new threads. . This reduces the overhead of thread creation and elimination when processing multiple tasks. At the same time, when the task arrives, the working thread usually exists, and the waiting time for creating the thread does not delay the execution of the task. High responsiveness. By appropriately adjusting the thread pool size, you can get enough threads to keep the processor busy. At the same time, you can also prevent excessive threads from competing with each other for resources, resulting in application consumption in thread management. Excessive resources. Executor provides some useful preset thread pools by default, which can be created by calling the static factory method of executors.
- Newfixedthreadpool: provides a thread pool with the maximum number of threads.
- Newcachedthreadpool: provides a thread pool without the maximum number of threads.
- Newsinglethreadexecutor: provides a single-threaded thread pool. Ensure that tasks are executed in the order specified in the task queue (FIFO, lifo, priority.
- Newscheduledthreadpool: provides a thread pool with the maximum number of threads and supports scheduled and periodic task execution.
Use Concurrent Data Structure
Collection The Framework has brought a lot of convenience to Java programmers, but in the multi-core era, Collection The framework has become somewhat unsuitable. Shared data between multiple threads is always stored in the data structure, such as map, stack, queue, list, and set. By default, these data structures in the Collection framework are not safe for multithreading. That is to say, these data structures cannot be securely accessed by multiple threads at the same time. JDK Synchronizedcollection provides a thread-safe interface for these classes.synchronized Keyword implementation is equivalent to adding a global lock to the entire data structure to ensure thread security. Java. util. Concurrent Provides more efficient collection, such as concurrenthashmap/set, concurrent1_queue, Concurrentskiplistmap/set, copyonwritearraylist/Set . These data structures are designed for multi-thread concurrent access, using fine-grained locks and the new lock-free algorithm. In addition to higher performance under multi-threaded conditions Put-if-absent is an atomic function suitable for concurrent applications. Other considerations
Do not put too much pressure on the memory system
For example If the thread needs to allocate memory during execution, this will not cause problems in Java. The modern JVM is highly optimized, and it usually keeps one block for each thread Buffer. In this way, as long as the buffer is not used up, you do not need to deal with the global heap. JVM Memory will have to be allocated to the global heap, which usually results in a serious reduction in scalability. In addition, the pressure on GC will further reduce program scalability. Although we have parallel GC, but its scalability is usually not ideal. If a program that runs cyclically needs to allocate temporary objects for each execution, we can consider using threadlocal and Softreference technology to reduce memory allocation. Use threadlocal
Threadlocal Class can be used to save the state information of the private thread, which is very convenient for some applications. Generally, it has a positive impact on scalability. It can provide a private variable for each thread, So multiple threads No need to synchronize between them. It should be noted that before JDK 1.6, threadlocal has a very inefficient implementation. If you need to use threadlocal in JDK 1.5 or an older version Threadlocal, You need to carefully evaluate its impact on performance. Similarly, currently, in JDK 6, reentrantreadwritelock The implementation is also quite inefficient. If you want to improve the scalability by using the non-mutex feature between read locks, you also need to perform profile to confirm its applicability. The lock granularity is very important.
Rough The granularity global lock ensures thread security while compromising application performance. Careful consideration of the lock granularity is important when building highly scalable Java applications. When the CPU When the number and number of threads are small, the global lock will not cause fierce competition, so the cost of getting a lock is very small (JVM has optimized this situation ). With the CPU As the number and number of threads increase, the competition for global locks becomes increasingly fierce. Except for one CPU that acquires the lock, other CPUs that attempt to obtain the lock can continue to work. Can only be idle, resulting in the entire system CPU The utilization rate is too low, so can the system be fully utilized. When we encounter a highly competitive global lock, we can try to divide the lock into multiple fine-grained locks, each of which protects part of the shared resources. Subtract The granularity of the lock can reduce the degree of competition of the lock. Java. util. Concurrent. concurrenthashmap improves The performance of hashmap in multi-threaded applications. In concurrenthashmap, the default constructor uses 16 locks to protect the entire hash map. . Users can use thousands of locks through parameter settings, which is equivalent to dividing the entire hash map into thousands of fragments, each of which uses one lock for protection. Conclusion
Check the hotspot area in the profile result by selecting an appropriate profile tool. Use data structures, thread pools, and fine-grained locks suitable for multi-threaded access to reduce hotspot areas. Repeat this process to continuously improve application scalability. It is not easy to build highly scalable Java applications on multiple cores. Reducing conflicts and synchronization between threads is the key to improving scalability. Some common tools and techniques introduced in this article can help programmers, but more situations depend on specific applications.
Download
Description |
Name |
Size |
Download Method |
Java program examples used in this article |
Javascale.zip |
10 KB |
HTTP
|
|
|
Information about the Download Method |
|
References
- In the Kunming open-source project
Download the source code of all projects from the website.
- In the author's blog
There is a simple program to measure the computing performance of multiple cores. Although this example is based on C ++, its conclusion is equally applicable to Java programs.
- View Python documentation
Obtain more information about global intepreter lock and the reasons for its existence.
- Download the open-source tool performance inspector.
To observe the use of the lock.
- Refer to Brian's Java Theory and Practice: popular Atoms
Obtain more information about atomicinteger.
|