http://www.ibm.com/developerworks/cn/aix/library/es-Javaperf/es-Javaperf2.html maximizing Java performance on AIX, part 2nd: Speed requirements
This five-part series provides several techniques and techniques that are often used to optimize Java™ applications for optimal performance on AIX®. It also provides a discussion of the applicability of each technique. With these tips, you should be able to quickly optimize your Java environment to suit the needs of your application.
See more in this series | 0 Reviews:
Amit Mathur ([email protected]), senior Technical consultant and solution implementation Manager, IBM
December 17, 2007
Develop and deploy your next application on the IBM Bluemix cloud platform.
Get started with your trial
Introduction
This is the 2nd in a five-part series about Java performance optimizations on AIX. It is strongly recommended that you read the 1th part of this series before proceeding further (if you have not done so).
This article studies methods to maximize system execution speed and throughput. For programs that involve the user interface, we will also look at how to ensure that the responsiveness of the system remains within an acceptable level.
You should review the first part of this article to learn about general techniques for most situations. We also provide a quick reference for tools that are useful for CPU bottleneck detection and research. The next section describes the various types of applications and how to optimize them. This discussion will take advantage of your application knowledge to determine which techniques are best for you. The third section describes the various techniques. The next article in this series will be discussed at the end of this article.
Back to top of page
CPU bottlenecks
This article will discuss how to make your application faster or more responsive, or achieve these two goals at the same time.
By comparing the actual to the expected performance numbers, you can usually determine whether the application is running too slowly. Alternatively, the application's user interface may be fixed periodically, or the network connection to the application may time out due to the application being busy. Use topas
or tprof
will show whether CPU utilization is up to 100%. You need to be able to distinguish between abnormal activity and improper size settings, and if you need a faster CPU or a higher-spec computer, there is not much room for adjustment.
As a first step, you should use topas
or other similar tools to determine whether Java is the largest CPU user. If you see Java in a lower position in the CPU user list, it might not be much use to perform CPU-specific optimizations. We provide a brief overview in part 1th topas
.
Ideally, the CPU utilization of the application reaches or exceeds 90%. If you have reached this stage and are still dissatisfied with the throughput, perhaps the specifications for the computer you are using are not high enough. If you use DLPAR, you can try adding another or two CPUs and measuring differences.
Vmstat
vmstat
Can be used to provide a variety of statistical information about the system. For CPU-specific work, you can try the following commands:
VMSTAT-T 1 3
This command performs 3 samples, each 1 seconds apart, with a timestamp ( -t
). You can of course change the parameters as you wish. The output of this command is shown below:
Kthr Memory page faults CPU time ---------------------------------------------------------- -------------R B AVM fre re pi po fr SR Cy in Sy CS US Sy ID WA hr mi se0 0 45483 221 0 0 0 0 1 0 224 326 362 7 0 1 5:10:220 0 45483 0 0 0 0 0 0 159, 1 1 98 0 15:10:232 0 45483 0 0 0 0 0 0 145 0 9 1 15:10:24
Some of the things to watch in this output include:
- Column r(run queue) and B(blocked) begin to rise, especially above 10. This is usually a sign that there are too many processes in contention with the CPU.
- If CS(context switching) rises very high compared to the number of processes, you may need
vmtune
to use it to optimize the system. This topic is beyond the scope of this series.
- In the CPU section,US(User time) indicates the time spent in the program. Assuming Java is at the top of the
tprof
list, you might want to optimize your Java application.
- In the CPU section, if sys(System time) is higher than expected, and you still have the ID(idle) Time remaining, this may indicate lock contention. Check
tprof
to see the lock-related calls in kernel time. You may want to try to run multiple instances of the JVM. You can also javacore
find deadlocks in files.
- In the CPU section, if wa(I/O waits) is very high, this may indicate a disk bottleneck, and you should use
iostat
and other tools to view disk usage.
- The value nonzero in the pi,po(page in/out) column may indicate that the system is using paging, and you may need more memory. Perhaps the stack size you set may be too high for some instances of the JVM. This may also mean that the heap you allocate is larger than the amount of memory on the system. Of course, you may have other applications that use memory, or the file pages may occupy too much memory.
Iostat
You can use the iostat
vmstat
same CPU information that is provided, and you can also get information such as disk I/O.
Ps
ps
is a very flexible tool to determine which programs are running on the system and which resources are being used by those programs. It displays statistics and status information about the processes on the system, such as process or thread ID, I/O activity, CPU, and memory usage.
Ps-ef | grep java
This command will allow you to determine the ID of all active Java processes. Many other commands require you to determine the process ID first, and use to -ef
help you differentiate between multiple Java processes by displaying their command-line arguments.
Ps-p pid-m-O THREAD
Using the PID (process ID) of the Java process of interest, you can check how many threads have been created. This is especially useful for situations where you want to monitor a large application, and you can wc -l
channel the above output to get the number of threads created by the JVM. This can be done in a loop so that you can detect whether certain threads are started or terminated when they should not be started or terminated.
PS Au[x]
This is useful for getting%cpu and%memory data, sorted by the most used users. This is useful for quickly locating bottlenecks on the system.
PS V[g]
Displays virtual memory usage. Note that the preferred method for monitoring native and Java heaps is to use Svmon. This is described in detail in part 3rd of this series.
PS Eww PID
Using the PID (process ID), you can get the output of the environment settings for the process. For example, this will show the full file path of the Java being executed, and this information may not appear in the normal PS manifest. Please note that in order to get a complete list of the environment, we recommend that you create a javadump file instead (see IBM developer Kits-diagnosis documentation for more information).
Sar
sar -u -P ALL x y
Can be used to check the CPU usage balance between multiple CPUs. If the usage distribution is unbalanced, this may indicate that the application is not threaded, and you may need to create multiple instances of the application. The following example makes two samples on a dual-processor system that reaches 80% utilization, five seconds apart.
# Sar-u-P all 5 2AIX aix4prt 0 5 000544144c00 02/09/0115:29:32 CPU %usr %sys %wio%idle15:29:37
0 0 -1 0 - 0 2015:29:42 0 0 1 0 - 45 0 22Average 0 0 1 0 - 0 21
You may also see that all CPUs are up to 100% utilization, or that only a single CPU is up to 100% utilization (when the JVM is performing compression).
Tprof
tprof
is one of the AIX legacy tools that provides detailed CPU usage analysis for each AIX process ID and name. AIX 5.2 has completely rewritten it, and the following example uses the AIX 5.1 syntax. You should refer to AIX 5.2 performance Tools Update:part 3 to learn about the new syntax.
The simplest way to invoke this command is to use:
# tprof-kse-x "Sleep 10"
After 10 seconds, a new file is generated __prof.all
that contains information about what commands are using the CPU on the system. Search FREQ
, the information should look similar to the following example:
Process FREQ Total Kernel User Shared other======= = = = ===== ====== = = ====== = ====oracle 244 10635 3515 6897 223 0java 247 3970 617 0 2062 1291wait 1515 1515 0 0 0 ... ======= = = = ===== ===== = ====== =====total 1060 19577 7947 7252 3087 1291
This example shows that more than half of the CPU time is associated with an Oracle application, and that Java is using approximately 3970/19577
or 1/5 of the CPU. Wait usually means idle time, but it can also include the I/O wait portion of CPU usage.
To determine if there is a lot of lock contention, you should also check the KERNEL section:
Total Ticks for all Processes (KERNEL) = 7787Subroutine Ticks % Source Address bytes============= ===== = ======== ======== ======.unlock_enable_mem 2286 11.7 low.s 930c 1f4.waitproc_find_run_queue 1372 7.0.. /.. /.. /.. /.. /SRC/BOS/KERNEL/PROC/DISPATC h.c 2a6ec 2b0.e_block_thread 893 4.6.. /.. /.. /.. /.. /src/bos/kernel/proc/sleep2.
For the Shared Objects section, find the libjvm.a
name, especially gc_* or close to any GC phrase (Mark, Sweep, Compact). If a large amount of such content is found, the JVM process may require GC optimizations.
You should also look for important subroutines that consume a large percentage of the CPU clock cycle. For example, an tprof
output shows clProgramCounter2Method
a fairly high value:
Subroutine Ticks % Source Address bytes============= ===== = = ======== ======== ======. Clprogramcounter2method 3551 14.8/userlvl/ca131/src/jvm/sov/cl/clloadercache.c
After examining a number of such examples, we find that Throwable.printStackTrace
the removal call can lead to significant performance improvements. This particular method is derived from tprof
the analysis output.
Java-specific Tips
In almost all cases (see the Tips for exceptions), the JIT compiler must be turned on, as this results in a performance variance equivalent to the difference between executing the byte code and executing the native code. The JIT provides up to 25 times times improvements, so it is a critical performance component for Java.
Garbage collection is another vital component, so it must be checked and adjusted as needed. Note that although enabling GC tracing (using -verbosegc
) has a slight negative impact, the advantages of being able to monitor and analyze the heap are clearly outweighed by the negative impact. Another consideration is that the heap of good conditions minimizes the -verbosegc
amount of information that is printed, so by adjusting the heap, you can also minimize the overhead of performing additional traces.
Back to top of page
Feature-Based Optimization techniques
Let's look at the different characteristics of a typical application. You should navigate to behaviors that are similar to your application, whether designed or observed, and apply the appropriate techniques.
Long-term application
IBM Java is designed to provide better features for long-running applications such as server code. If, for some reason, you try to run a test case that lasts less than 5 minutes, you may find that the preparation of IBM Java for long-term operation affects the startup time. If your application's quick Start is more important than long-term operation, check out the tip CPU001: Quick Start Application and tip CPU004: Completely remove the GC. If the technique CPU004: completely removing the GC does not apply to you, you can instead consider the trick CPU012: avoid heap resizing. Under strict circumstances, if your test case is so short that JIT initialization is too expensive, you might want to test with JIT off. Note that we did not propose to disable JIT as a separate performance technique because, as mentioned in the previous article, this could be the worst thing that can affect application performance.
If your application is able to withstand a slight delay in startup time, you should refer to tip CPU003: Compile all content and tricks on the first call CPU008: Use a small heap. For long-running applications with the obvious "initialize" and "run" phases, Tip CPU003: It is convenient to compile all the content on the first call.
Interaction level
Depending on whether your code is computationally intensive, the responsiveness of the JVM can be of varying importance from very critical to irrelevant. If the JVM you are trying to optimize is running the GUI, a long GC pause will be unacceptable. At the same time, long pauses may be acceptable if you are running multiple instances of the JVM that allow load sharing, or if you are performing batch mode processing.
For applications that cannot withstand long pauses, see tip CPU002: Using concurrent GC, Tip CPU004: Completely remove GC, Tip CPU007: Disable explicit System.GC () invocation, Tip CPU008: Use small heap, tip CPU009: Eliminate " Tag Stack Overflow "and tip CPU012: avoid heap sizing. The trick CPU004 in most cases applies only to applications that run for a short time. Note that the skill CPU008 must be considered in conjunction with the memory characteristics of the application, as it may have the opposite effect if applied improperly.
For applications that can withstand longer pauses, you should consider the trick CPU003: Compile all content on the first call. Note that long pauses are not a good thing in most cases, so even if the application is able to withstand it, you should consider and correct the problem because you cannot get any benefit from the misconfigured JVM instance.
CPU consumption
If you are running an application whose number of threads exceeds the number of installed CPUs, it is normal to observe that overall CPU utilization remains at 90% or higher, and any type of background processing can compromise the throughput of your application. On the other hand, if your application is a server and its threads are asleep most of the time and only wake up in response to incoming requests, you might be able to use background processing to reduce the impact of a long GC pause.
For applications that are CPU intensive and want to minimize background processing, consider the trick CPU007: Disable explicit System.GC () calls, Tip CPU008: Use small heaps and tricks CPU009: Eliminate tag stack overflow. As mentioned earlier, you should CPU008 the trick: use a small heap and memory features to consider it together.
For non-CPU-intensive applications, it is highly recommended to consider CPU002: using concurrent GC. This will benefit by reducing the overall pause time when the GC cycle arrives.
Well-defined referential local area
If the application has some methods that are often executed, while others are seldom executed, the trick is CPU003: compiling everything on the first call will be a good performance-enhancing method.
Degree of parallelism
If an application runs multiple threads to complete its work, it will benefit from a system with a large number of CPUs. For dynamic partitioning, adding more CPUs immediately improves because the Java thread can be dispatched to the newly added CPU immediately. Tip CPU005: Use a lot of threads, tricks CPU006: Reduce lock contention and tip CPU011: Systems with more than 24 CPUs discuss other optimizations that you can try.
However, if the application has only a single thread of execution, it is limited by the processing power of a single CPU. In this case, you might want to try the trick CPU002: using concurrent GC and Tip CPU010: Single CPU system. If you are trying to run multiple instances of the JVM on a system (for example, in a clustered environment), Tip CPU010: A single CPU system is particularly helpful.
Back to top of page
General skill Set
The following will refer to the Java command-line arguments (specified before the Class/jar file name) as "switches." For example, the command line "java-mx2g Hello" has a single switch "-mx2g".
Tip CPU001: Quickly launch applications
You can use non-standard switches to -Xquickstart
shorten the startup time of your application. This switch minimizes the JIT optimization level and re-applies optimizations only if the applicable method becomes active again. For applications whose execution is not focused on a few methods, the result is a much faster startup time.
Note: Due to the multi-stage optimization approach, this switch may have a negative impact on long-running applications. Please note:
Tip CPU002: Using concurrent GC
You can specify the concurrency tag garbage Collection policy (Concurrent Mark Garbage Collection policy) to reduce the amount of pause time introduced by GC cycles. This is -Xgcpolicy:optavgpause
specified using the switch.
Note: In some cases, CPU-intensive applications may exhibit throughput degradation with the concurrency token specified.
Tip CPU003: Compile all content (or selected methods) on first call
You can set the environment variable ibm_mixed_mode_threshold to 0 to turn off the mixed-mode interpreter (Mixed-mode interpreter). The result is that all methods are JIT-compiled the first time they are called. Add the following line to the environment settings, or simply run the line command before starting Java:
Export Ibm_mixed_mode_threshold=0
You can also experiment with non-0 values to determine whether a particular MMI threshold can provide better performance than a zero value. For Aix,java 1.3.1 Use 600 as a threshold, while Java 1.4 uses a value greater than 1000 (note that these values may change). IBM Developer Kits-diagnosis Documentation provides more information in the "Selecting The MMI Threshold" section under the "JIT Diagnostics" chapter.
If you only want to affect certain classes, you can use Jitc_compileopt=force (0) {classname}{methodname} instead. Example:
Export Jitc_compileopt=force (0) {com/myapp/*}{*}
This example compiles all the methods of all classes in the package the first time it is loaded com.myapp.*
.
Export Jitc_compileopt=force (0) {*}{uniquename}
This example compiles all methods named "UniqueName" on the first load.
Export Jitc_compileopt=force (0) {Com/myapp/special}{specialmethod}
This example compiles this particular method only the first time it is loaded. In addition to * (representing 0 or more characters), you can also use the "?" Wildcard characters as a single character.
You can specify multiple classes and/or methods using the following syntax:
Export Jitc_compileopt=force (0) {CLASS1}{METHOD1}{CLASS2}{METHOD2}
Make sure to document clearly that this is an optimization rather than a fix!
Please note: The startup time of the application may be extended by this setting.
Tip CPU004: Completely remove GC
You can set the startup and maximum heap sizes to a very large value so that no allocation failures occur during run time. VERBOSEGC should be enabled for these runs to ensure that the policy is valid!
Note: When a GC occurs, its cycle can be quite long, so this technique can only be used in rare cases.
Tip CPU005: Using a large number of threads
For scaling up to a larger number of threads, you should use -Xss
a switch to specify a value that is less than the default value (typically a. K, but may vary depending on the Java version). This will allow you to scale to a larger number of threads while reducing the native memory footprint of your application.
Note: If the stack size is too small, you may experience a stack overflow (Overflow) exception.
Tip CPU006: Reducing lock contention
If the application architecture allows, you can try to run multiple Java instances to reduce lock contention. This is facilitated by an application server that allows such configuration, for example, WebSphere allows you to use multiple nodes on the same physical computer.
Note: This can only obscure the problem; You should review the section of code that caused excessive lock contention. You can use Tprof or Java analytics to locate areas that need to be reviewed.
Tip CPU007: Disabling explicit System.GC () calls
With non-standard switches -Xdisableexplicitgc
, you can reduce the need for System.GC () calls in the delete code. Removing these calls will return the GC management work to the JVM.
Note: This is a bad idea if you need to implement a System.GC () call through a feature (for example, through a button on the application screen), because the button will become unusable. There may be other reasonable reasons why the System.GC () call can exist in the code.
Tip CPU008: Using small heaps
Use a heap size that never allows compression time to become intolerable. If, for some reason, the application eventually results in a large amount of compression, using a heap of up to 1 GB of compression takes much less time than the heap.
Note: This optimization is counterproductive if more compression is triggered due to a smaller heap. You can use this technique only if you want to create a large number of temporary objects.
Tip CPU009: Eliminate tag Stack Overflow
If you observe the "Flag stack Overflow (Overflow)" Message in the VERBOSEGC log, you can reduce the number of objects that remain active in the heap so that these messages disappear. Newer versions of Java have much better MSO processing capabilities. This technique is included because MSO can seriously damage the performance of an application and must be treated as a flaw rather than an optimization.
Tip CPU010: Single CPU system
You can use bindprocessor
commands to bind a Java process to a specific processor. You might consider using this technique to avoid multiple JVM instances competing for CPU scheduling. If the system is not a single-processor computer, you may want to set the -Xgcthreads0
.
If you are running an application on a single-CPU LPAR that does not reconfigure to dynamically add more CPUs, you can also export no_lpar_reconfiguration=1 to achieve better performance in some cases.
Note: You are disabling its best performance characteristics by forcing Java to run in a single-CPU configuration. No_lpar_reconfiguration also disables the dynamic configuration of the Java adaptation DLPAR and should be used with caution.
Tip CPU0011: More than 24 CPUs in a system
For 24 to 32-way systems, you should use the -Xgcpolicy:subpool
test because this GC policy is optimized to deliver better performance for larger configurations.
Tip CPU012: Avoid heap sizing
You can maintain a fixed-size heap to avoid taking time to adjust the heap size when the percentage of free space falls below (or above) a value.
Note: Even if the heap usage is 10% of its highest level, the memory footprint of the application will remain at the specified heap size.
Back to top of page
Summarize
This article describes how to use the AIX tool for Java performance monitoring and provides a list of common adjustments that you can use to optimize the CPU usage of your application. The next article in this series discusses memory tuning for Java applications on AIX.
Reference Learning
- Other parts of the article series:
- 1th part
- 3rd part
- 4th part
- 5th part
Maximizing Java performance on AIX, part 2nd: Speed requirements