Performance bottleneck analysis through the Java threading stack

Last Update:2018-07-12 Source: Internet

Author: User

Tags cpu usage

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Improving performance means doing more with fewer resources. In order to take advantage of concurrency to improve system performance, we need to make more efficient use of existing processor resources, which means we expect the CPU to be as busy as possible (not to allow CPU cycles to do useful things, of course, rather than to let the CPU cycle for useless computations). If the program is constrained by the current CPU computing power, we can improve overall performance by adding more processors or by clustering. Overall, performance is improved, and only the current restricted resources need to be addressed, and the currently restricted resources may be:

CPU: If the current CPU is already close to 100% utilization, and the Code business logic can no longer be simplified, then the performance and reach of the system can be achieved only by increasing the processor to improve performance
Other resources: such as number of connections. You can modify the code and use the CPU as much as possible to get a great performance boost

If your system has the following characteristics, it indicates that the system has a performance bottleneck:

CPU usage is not approaching 100% as the system progressively increases pressure (e.g.)
continues to run slowly. It is often found that applications are running slowly. Improved overall response time by changing environmental factors (load, number of connections, etc.)
System performance has gradually decreased over time. In the case of stable load, the longer the system runs, the slower it becomes. may be due to exceeding a certain threshold range, the system is running frequently error causing the system to deadlock or crash
System performance decreases with increasing load.

A good program should be able to take full advantage of the CPU. If a program on a single CPU machine, no matter how much pressure can not make the CPU utilization near 100%, indicating that the program is problematic. A system's performance bottleneck analysis process is as follows:

Advanced single-process performance bottleneck analysis, limited to single-process performance to achieve optimal.
Perform an overall performance bottleneck analysis. Because the single process performance is optimal, not necessarily the overall system performance is optimal. In multi-threaded situations, lock contention can also cause performance degradation.

High performance in a variety of applications, there are different meanings:

Some occasions high performance means the user's speed experience, such as interface operation, etc.
In some cases, high throughput means high performance, such as SMS or MMS, the system is more focused on throughput, and not sensitive to the processing time of each message
There are occasions where the combination of the two

The ultimate goal of performance tuning is that the system's CPU utilization is close to 100%, and if the CPU is not fully utilized, there are several possibilities:

Insufficient pressure exerted.
There are bottlenecks in the system

1 Common performance bottlenecks 1.1 due to improper synchronization caused by resource contention 1.1.1 unrelated two functions, a common lock, or different shared variables share the same lock, needlessly create a resource contention

The following is a common mistake

Two unrelated methods (not using the same shared variable), shared this lock, resulting in artificial resource contention the above code will synchronized on every method of the class, violating the principle of what locks are protected. For a method with no shared resources, the same lock is used, causing unnecessary waits for humans. Java default provides this lock, so many people like to use synchronized lock directly on the method, in many cases this is inappropriate, if you do not consider clearly, it is easy to cause the lock particles to spend large:

Even the code in a method does not require lock protection everywhere. If the whole method uses synchronized, then it is possible to extend the scope of the synchronized to people. Locking at the method level is a rough lock-usage habit.

The above code should become the following

This causes the current thread to take up locks for too long, and other threads that need to be locked to wait, resulting in a significant performance impact1.1.2 locks are large, and after access to the shared resources is complete, no subsequent code is placed outside of the synchronized synchronization code block

A single CPU will take time-consuming operations out of the synchronization block, in some cases can improve performance, and some occasions do not: The above code, will lead to a thread for a long time to occupy the lock, and in such a long time other threads can only wait, this writing in different situations have different room for promotion:

- The time-consuming code of the synchronization block is CPU-intensive (pure CPU operation, etc.), there is no code for low CPU consumption such as disk io/network IO, in which case, because the CPU executes this code is 100% utilization, so shrinking the synchronization block will not bring any performance improvement. However, shrinking the synchronization block at the same time does not result in performance degradation.
- The time-consuming code in the synchronization block is the code of low CPU consumption such as disk/network IO, when the CPU is idle when the front thread is executing code that does not consume the CPU, if the CPU is busy at this time, can bring the overall performance improvement, so in this scenario, the time-consuming operation of the code in the synchronization, Must be able to improve the overall performance of the (? ）
Multi-CPU applications that take time-consuming operations out of sync blocks can always improve performance
- The time-consuming code of the synchronization block is CPU-intensive code (pure CPU operation, etc.), there is no low CPU consumption such as disk io/network IO Code, in this case, because it is multi-CPU, the other CPU may be idle, so the reduction of the synchronization block can let other threads immediately get execution of this code, can bring performance improvement
- The time-consuming code in the synchronization block is the code for low CPU consumption such as disk/network IO, and when the front-thread is executing code that does not consume the CPU, there is always a CPU that is idle, and if the CPU is busy at this time, it can bring overall performance improvement, so in this scenario, Putting the code for time-consuming operations out of sync blocks is certainly an improvement over the overall performance

In any case, narrowing the sync range has no negative impact on the system, and in most cases it will improve performance, so be sure to narrow the sync range so the above code should be changed to

Sleep abuse, especially when using sleep in polling, can give the user a noticeable sense of delay, which could be modified to notify and wait1.1.3 Other issues

String + abuse, each + will produce a temporary object, and have a copy of the data
Inappropriate threading Model
Inefficient SQL statements or inappropriate database design
Poor performance due to improper GC parameter settings
Insufficient number of threads
Frequent GC due to memory leaks

2.2 Means and tools for performance bottleneck analysis

The performance bottlenecks created by the above mentioned reasons can be found through thread stack analysis to find the root cause.

2.2.1 How to simulate and discover performance bottlenecks

Several characteristics of performance bottlenecks:

The current performance bottleneck is only one place, only to know the next one if this is resolved. Without addressing the current performance bottleneck, the next performance bottleneck will not occur. As shown, the second paragraph is the bottleneck, after solving the second bottleneck, the first paragraph becomes the bottleneck, so repeatedly find all the performance bottlenecks

Performance bottlenecks are dynamic, where low load is not a bottleneck, and can be a bottleneck under high load. Due to the overhead of profiling tools such as Jprofile, which makes it impossible for the system to achieve the performance required at the time of this bottleneck, it is a truly effective way to thread stack analysis in this scenario.

In view of the performance bottleneck of the above characteristics, in the performance simulation, you must use a higher than the current system under the pressure of simulation, or performance bottlenecks will not occur. The steps are as follows:

2.2.2 How to identify performance bottlenecks through the thread stack

Through the thread stack, it is easy to identify performance bottlenecks that occur when high loads are used in multi-threaded situations. Once a system has a performance bottleneck, the most important is to identify performance bottlenecks and then modify them based on the identified performance bottlenecks. General multi-threaded system, first according to the function of the thread classification (group), the execution of the same function code of the thread as a group to analyze. This group of threads is used for statistical analysis when using stacks for analysis. If a thread pool serves different functional code, then the thread pool's threads are analyzed as a group.

In general, once a system has performance bottlenecks and is analyzed from the stack, there are three of the most typical stack features:

The majority of the threads ' stacks are represented in the same invocation context, and only a very small number of idle threads are left. The possible causes are as follows:
- The number of threads is too low
- Lock grain over large resulting lock competition
- Resource competition
- A large number of time-consuming operations in the lock range
- Slow handling of remote communication
The vast majority of threads out of wait state, only a few working threads, the overall performance does not go. Possible reason is that the system has a critical path, the critical path has reached the bottleneck
The total number of threads is very small (some thread pool implementations are creating threads on demand, possibly creating threads in programs

An example

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465 666768697071727374757677787980818283

"Thread-243" Prio=1 tid=0xa58f2048 nid=0x7ac2 runnable[0xaeedb000. 0xaeedc480]at java.net.SocketInputStream.socketRead0 (Native Method) at Java.net.SocketInputStream.read ( socketinputstream.java:129) at oracle.net.ns.Packet.receive (Unknown Source) ... at Oracle.jdbc.driver.LongRawAccessor.getBytes () at Oracle.jdbc.driver.OracleResultSetImpl.getBytes ()-Locked < 0x9350b0d8> (a Oracle.jdbc.driver.OracleResultSetImpl) at Oracle.jdbc.driver.OracleResultSet.getBytes (O) ... At Org.hibernate.loader.hql.QueryLoader.list () at Org.hibernate.hql.ast.QueryTranslatorImpl.list () ... at Com.wes.NodeTimerOut.execute (nodetimerout.java:175) at Com.wes.timer.TimerTaskImpl.executeAll (Timertaskimpl.java : 707) at Com.wes.timer.TimerTaskImpl.execute (timertaskimpl.java:627)-Locked <0x80df8ce8> (a Com.wes.timer.TimerTaskImpl) at Com.wes.threadpool.RunnableWrapper.run (runnablewrapper.java:209) at Com.wes.threadpool.pooledexecutorex$worker.run () at Java.lang.Thread.run (thread.java:595) "Thread-248" Prio=1 tid=0xa58f2048 Nid=0x7ac2 runnable[0xaeedb000. 0xaeedc480]at java.net.SocketInputStream.socketRead0 (Native Method) at Java.net.SocketInputStream.read ( socketinputstream.java:129) at oracle.net.ns.Packet.receive (Unknown Source) ... at Oracle.jdbc.driver.LongRawAccessor.getBytes () at Oracle.jdbc.driver.OracleResultSetImpl.getBytes ()-Locked < 0x9350b0d8> (a Oracle.jdbc.driver.OracleResultSetImpl) at Oracle.jdbc.driver.OracleResultSet.getBytes (O) ... At Org.hibernate.loader.hql.QueryLoader.list () at Org.hibernate.hql.ast.QueryTranslatorImpl.list () ... at Com.wes.NodeTimerOut.execute (nodetimerout.java:175) at Com.wes.timer.TimerTaskImpl.executeAll (Timertaskimpl.java : 707) at Com.wes.timer.TimerTaskImpl.execute (timertaskimpl.java:627)-Locked <0x80df8ce8> (a Com.wes.timer.TimerTaskImpl) at Com.wes.threadpool.RunnableWrapper.run (runnablewrapper.java:209) at Com.wes.threadpool.pooledexecutorex$worker.run () at Java.lang.Thread.run (thread.java:595) ... "Thread-238" prio= 1 tid=0xa4a84a58 niD=0x7abd in Object.wait () [0xaec56000.. 0xaec57700]at java.lang.Object.wait (Native Method) at Com.wes.collection.SimpleLinkedList.poll ( simplelinkedlist.java:104)-Locked <0x6ae67be0> (a com.wes.collection.SimpleLinkedList) at Com.wes.XADataSourceImpl.getConnection_internal (xadatasourceimpl.java:1642) ... at Org.hibernate.impl.SessionImpl.list () at Org.hibernate.impl.SessionImpl.find () at Com.wes.DBSessionMediatorImpl.find () at COM.WES.RESOURCEDBINTERACTORIMPL.GETCALLBACKOBJ () at Com.wes.NodeTimerOut.execute (nodetimerout.java:152) at Com.wes.timer.TimerTaskImpl.executeAll () at Com.wes.timer.TimerTaskImpl.execute (timertaskimpl.java:627)-Locked <0x80e08c00> (a Com.facilities.timer.TimerTaskImpl) at Com.wes.threadpool.RunnableWrapper.run (runnablewrapper.java:209) at Com.wes.threadpool.pooledexecutorex$worker.run () at Java.lang.Thread.run (thread.java:595) "Thread-233 "Prio=1 tid=0xa4a84a58 Nid=0x7abd in object.wait () [0xaec56000]. 0xaec57700] at java.lang.Object.wait (NativeMethod) at Com.wes.collection.SimpleLinkedList.poll (simplelinkedlist.java:104)-Locked <0x6ae67be0> (a Com.wes.collection.SimpleLinkedList) at Com.wes.XADataSourceImpl.getConnection_internal (Xadatasourceimpl.java : 1642) ... at org.hibernate.impl.SessionImpl.list () @ org.hibernate.impl.SessionImpl.find () at Com.wes.DBSessionMediatorImpl.find () at COM.WES.RESOURCEDBINTERACTORIMPL.GETCALLBACKOBJ () at Com.wes.NodeTimerOut.execute (nodetimerout.java:152) at Com.wes.timer.TimerTaskImpl.executeAll () at Com.wes.timer.TimerTaskImpl.execute (timertaskimpl.java:627)-Locked <0x80e08c00> (a Com.facilities.timer.TimerTaskImpl) at Com.wes.threadpool.RunnableWrapper.run (runnablewrapper.java:209) at Com.wes.threadpool.pooledexecutorex$worker.run () at Java.lang.Thread.run (thread.java:595) ...

From the stack, there are 51 (socket) accesses, 50 of which are JDBC database access. Other methods are blocked on the java.lang.Object.wait () method.

2.2.3 Other ways to lift high performance

Reduce the granularity of locks, such as the implementation of the CONCURRENTHASHMAP default Array with 16 locks (with one side effect: Locking the entire container can be laborious, adding a global lock)

2.2.4 Performance Tuning End condition

Performance tuning always has a termination condition that can be terminated if the system meets the following two conditions:

Algorithm is optimized enough
Insufficient CPU utilization due to improper use of threads/Resources

Performance bottleneck analysis through the Java threading stack

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More