This is the fifth article in the series "Become a Java GC expert". In the first in-depth Java garbage collection mechanism, we have learned the different GC algorithm flow, GC working principle, the new generation (young Generation) and the old generation (older Generation) concept. You should be aware of the 5 GC types in JDK7 and the impact of various types on your application.
In the second how to monitor Java garbage collection, it explains how the JVM actually performs garbage collection, how we monitor the GC and which tools can make the process more efficient.
The third article on how to optimize the Java garbage collection mechanism shows some best practices based on real-world cases. It also explains how to minimize the use of objects in old areas, avoiding the frequent execution of full GC. It also explains how to set the type and memory size of the GC.
In the fourth chapter of the MaxClients parameter of Apache and its impact on Tomcat execution FULLGC, the importance of the MaxClients parameter and its significant impact on the overall system performance during the garbage collection process are explained.
The five articles will explain the principles of Java Program Performance tuning, especially the necessary knowledge in this process and the need to determine whether your program needs tuning. It also describes the problems you may encounter during tuning. At the end of this article, I'll give you some advice on how to make better decisions when tuning Java programs.
Overview
Not every program needs to be tuned. If a program behaves as expected, you don't have to give extra effort to improve its performance. However, after the completion of the program debugging, it is difficult to meet its performance requirements immediately, so there is tuning this work. Regardless of the programming language, tuning the application requires a wealth of technical knowledge and a high concentration of attention. In addition, you should not tune two programs in the same way, because each program has its own unique way of working and different ways of using resources. This is why tuning requires more basic knowledge than writing programs. For example, you need to be familiar with virtual machines, operating systems, and computer architectures. And when you are faced with a program written on the basis of this knowledge, you can successfully tune it.
Sometimes tuning a Java program requires only modifying JVM parameters, such as GC parameters. But there are times when you need to modify the program code. Either way, you first need to monitor the process of executing the Java program. So this article will explain the following questions:
- How do I monitor Java programs?
- What parameters should be set for the JVM?
- How do I determine if I need to modify my code?
The necessary knowledge of tuning Java programs
Java programs run in a Java virtual machine. So for tuning, you need to understand the JVM's workflow. I have a blog post Understanding JVM internals that will give you a deep understanding of the JVM.
The knowledge of the JVM operation process in this article is primarily about GC and hotspot. Although only these two areas of knowledge may not be able to tune all Java programs, these two factors in most cases affect the performance of Java programs.
It is important to note that the JVM is also an application process from an operating system perspective. In order to create a good operating environment for the JVM, you also need to understand the process of allocating resources to the operating system. This means that you want to tune your Java program, and you should understand how the operating system or hardware works in addition to the JVM.
The knowledge needed and the Java language itself. It is also important to understand lock and concurrency, class loading, and object creation.
When you start tuning Java programs, you should integrate the above knowledge to complete the work.
The process of optimizing Java program performance
Figure 1 is a Java Program Performance tuning flowchart, picking free Charlie Hunt and Binu John's Java performance.
Figure 1:java Process of program performance tuning
JVM Distributed model
The JVM distributed model is used to decide whether to execute Java programs on one JVM or multiple JVMs. You can choose according to their effectiveness, responsiveness, and maintainability. When running the JVM on multiple servers, you can also choose to run multiple JVMs on a single server or on each server to run one JVM. For example, for each server, you can run a JVM that uses 8GB of heap memory, or you can run 4 JVMs that use 2GB. You should decide this number based on the number of processor cores and the nature of the program. Using 2GB of heap memory is better than 8GB when the responsiveness is prioritized, because it enables full GC to be completed in a shorter amount of time. Of course, 8GB of heap memory can reduce the frequency of full GC. If your program uses an internal cache, you can also improve responsiveness by increasing the cache hit ratio. To sum up, choosing the right model requires considering the characteristics of the application, and then selecting one in various models to avoid weaknesses.
JVM Architecture
Choosing a JVM is actually deciding whether to use a 32-bit or 64-bit JVM. Under the same conditions, you'd better use a 32-bit. Because 32-bit JVMs are better than 64-bit performance. However, the maximum supported heap memory for a 32-bit JVM is 4GB (either on 32-bit or 64-bit, the actual assignable size is only 2-3GB). If you need a larger heap of memory, it is appropriate to use a 64-bit JVM.
Table 1: Performance comparisons (data sources)
Test Benchmark |
time (seconds) |
coefficient |
C + + Opt |
23 |
1.0x |
C + + DBG |
197 |
8.6x |
Java 64-bit |
134 |
5.8x |
Java 32-bit |
290 |
12.6x |
Java 32-bit gc* |
106 |
4.6x |
Java 32-bit SPEC gc* |
89 |
3.7x |
Scala |
82 |
3.6x |
Scala low-level* |
67 |
2.9x |
Scala Low-level gc* |
58 |
2.5x |
Go 6g |
161 |
7.0x |
Go pro* |
126 |
5.5x |
The next step is to run the program to test its performance. This process includes GC tuning, changing operating system settings, and modifying code. For these tasks, you can use system monitoring tools or profiling tools.
Note: Tuning for responsiveness and tuning for throughput may use different methods. If Stop-the-word occurs frequently (the serial GC temporarily interrupts program execution), the program's responsiveness is reduced. For example, perform full GC at high throughput. Do not forget that in tuning is often gains. The tradeoff is that it happens not only between responsiveness and throughput. For example, use more CPU resources to reduce memory usage, or have to endure the decline in one of the performance metrics of responsiveness and throughput. The opposite is also possible, and the actual tuning should be performed according to the priority of each indicator.
The process in Figure 1 above shows a performance tuning process that can be used almost all Java programs, including swing applications. However, this approach is somewhat inappropriate for the server-side programs that our company NHN used to provide Web services. The process in Figure 2 below is modified according to Figure 1 , which is simpler and more suitable for NHN.
Figure 2: Tuning process for Java programs in HNH
Where theSelect JVM represents the use of a 32-bit JVM if possible, unless you need a 64-bit JVM to maintain a few gigabytes of cache.
Now, follow the process in Figure 2 , and you'll learn every step of the job.
JVM parameters
I'll focus on how to set the appropriate JVM parameters for the Web server program. Although not necessarily suitable for all cases, the best GC algorithm is concurrent Mark Sweep (CMS garbage collection), especially for Web service-side programs. Because low latency is very important. Of course, when using CMS, due to the allocation of Cenozoic space (New area), the Stop-the-world phenomenon may occur for a long time, but it may be solved by adjusting the size of the Cenozoic space or the proportion of the whole heap space.
It is equally important to specify the size of the Cenozoic space and to specify the entire size of the heap memory. You'd better use it –XX:NewRatio
to specify the size ratio of the Cenozoic and the entire heap, or use it directly –XX:NewSize
to specify the required Cenozoic space. This configuration is necessary because most of the objects will not survive for a long time. In a Web program, most of the other objects are created only during the process, in addition to caching the data HttpRequest
HttpResponse
. This time will hardly exceed 1 seconds, indicating that these objects will not survive longer than 1 seconds. If the Cenozoic space is not large enough, the object will be moved to the old age space in order to make room for new objects to use. Old area garbage collection is much more expensive than the new generation space, so it is necessary to set up a sufficient amount of cenozoic space.
However, when the size of the Cenozoic space exceeds a specific level, the program's responsiveness is reduced. Because of the garbage collection process of the Cenozoic space, the data is basically copied from one survivor area to another (from space and to space). In addition, the phenomenon of stop-the-world occurs when garbage collection takes place in the new generation space and the old age space. If the new generation of space becomes larger, then the space of the survivor area will be larger, so more data will be replicated each time. Based on this feature, we should NewRatio
allocate the new generation space of the appropriate size by specifying the parameters of the hotspot JVM in different operating systems.
Table 2: Default values under different operating systems and configurations NewRatio
operating system and Parameters |
default-xx:newratio |
Sparc-server |
2 |
Sparc-client |
8 |
X86-server |
8 |
X86-client |
12 |
If set NewRatio
, then the entire heap space 1/(NewRatio +1)
is the size of the Cenozoic space. The table above shows that the Newratio default value of Sparc-server is very small, because Sparc was used more for high-end applications than x86 's operating system, and this value was set for them. But now the performance of the x86 operating system is greatly improved, it is common to use them as servers. So specifying Newratio as 2 or 3 is a better choice, just like the configuration on sparc-server .
Alternatively, you can specify NewSize
and MaxNewSize
replace Newratio. Then the size of the Cenozoic space creation is the specified newsize, which can then grow to a value of maxnewsize. Eden (the area where the newly created objects are stored) and the survivor area two regions increase proportionally. And you for-XMS ( Translator Note: The original is-xs, should be a clerical error) and-XMX set the same value, the MaxSize and maxnewsize set to the same is also a good choice.
If you specify both Newratio and newsize, you should use the larger one. Then, when the heap space is created, you can use the following expression to calculate the initial Cenozoic space size:
1 |
min(MaxNewSize, max(NewSize, heap/(NewRatio+ 1 ))) |
In any case, it is impossible to find the right heap space and the size of the new generation space by just one attempt. Based on my experience with running a Web server in NHN, it is recommended to use the following JVM parameters to run Java programs. After monitoring the performance of the program under these parameters, you will be able to choose a more appropriate GC algorithm or configuration.
Table 3: Recommended JVM parameters
type |
Parameters |
Operating mode |
-sever |
Total Heap Memory size |
Set the same values for-XMS and-XMX. |
Cenozoic Space Size |
-xx:newratio:2 to 4. -xx:newsize=? –xx:maxnewsize=?. It is also possible to use newsize instead of newratio. |
Persistent generation space size |
-xx:permsize=256m-xx:maxpermsize=256m. Set a value that does not have a problem in the run, and this parameter does not affect performance. |
GC Log |
-XLOGGC: $CATALINA _base/logs/gc.log-xx:+printgcdetails-xx:+printgcdatestamps. Logging GC logs does not specifically affect Java program performance, and it is recommended that you log as much as possible. |
GC algorithm |
-xx:+useparnewgc-xx:+cmsparallelremarkenabled-xx:+useconcmarksweepgc-xx:cmsinitiatingoccupancyfraction=75. These configurations are generally recommended, but depending on the characteristics of the program, others may be better. |
Create a heap memory dump file when oom occurs |
-xx:+heapdumponoutofmemoryerror-xx:heapdumppath= $CATALINA _base/logs |
Operations after an oom occurs |
-xx:onoutofmemoryerror= $CATALINA _home/bin/stop.sh or-xx:onoutofmemoryerror= $CATALINA _home/bin/restart.sh. After the memory dump file is logged, an appropriate operation is performed to manage the needs. |
Measuring the performance of the program
In order to get the performance of the program, you need the following information:
- System Throughput (TPS, OPS): The overall conceptual understanding of the program's performance.
- Requests per second (request per Second–rps): In strict terms, RPS and pure responsiveness are different, but you can interpret it as responsiveness. With this indicator, you can see how long it takes the user to get the results of the request.
- standard deviation of RPS : It is also necessary to include RPS for the event, if possible. Once a deviation occurs, you should check the GC or network system.
For more accurate performance, you should wait until the program is fully booted and then take the measurement, since the bytecode is subsequently compiled by the hotspot JIT to the local machine code. Overall, you need to test for at least 10 minutes with tools such as Ngrinder after the program has loaded the specified functionality.
Real tuning
If the results of the Ngrinder test meet expectations, you do not need to tune the program for performance. If you do not achieve the expected results, you should perform tuning to solve the problem. The next step is to explain the method by example.
Stop-the-world time is too long
Stop-the-world may be too lengthy because the GC parameters are unreasonable or the code is not implemented correctly. You can locate problems by analyzing tools or heap dump files (heap dumps), such as checking the type and number of objects in heap memory. If you find a lot of unnecessary objects in it, it's best to improve the code. If there is a particular problem in the process of creating an object, it is best to simply modify the GC parameters.
In order to properly adjust GC parameters, you need to obtain a GC log that is long enough, and you must know what will cause a long stop-the-world. To learn more about choosing the right GC parameters, you can read a blog post from my colleague: how to Monitor Java garbage Collection.
Low CPU usage
When the system is blocked, throughput and CPU usage are reduced. This may be due to network system or concurrency problems. To solve this problem, you can analyze thread dump information or use analysis tools. Read this article to learn more about thread dump analysis: How to Analyze Java thread dumps.
You can use commercially available analysis tools to perform accurate analysis of the thread lock, but most of the time, you can get enough information by simply using the CPU Analyzer in JVISUALVM .
High CPU Usage
If throughput is low but CPU usage is high, it is likely that inefficient code is causing it. In this case, you should use the Profiling tools to locate bottlenecks in the code for performance. The available tools are:JVISUALVM,Eclipse TPTP , or JProbe.
Tuning method
It is recommended that you use the following methods to tune your program.
First, check that performance tuning is necessary. Measuring performance is not a simple job, and you cannot guarantee that you will get a satisfactory result every time. Therefore, if the program already meets the expected performance requirements, you do not need to add additional input to the tuning.
The problem is only in one place, and all you have to do is get rid of it. The 28 law (Pareto principle) also applies to performance tuning. This is not to say that the low performance of a module must come from a single problem, but rather that we should focus on the problem that has the greatest impact when tuning. After dealing with the most important, you should solve the rest. It is recommended to fix only one problem at a time.
Also need to take into account the balloon effect (Balloon effect), there will be lost. You can improve responsiveness by using caching, but when the cache grows, it takes longer to execute a full GC. In general, throughput and responsiveness can deteriorate if you want low memory usage. Therefore, you need to know what is most important to your program and what is secondary.
So far, you should have learned how to tune your Java program for performance tuning. I have to omit some of these details in order to introduce the specific process of performance measurement, but I think it's enough to deal with most Java Web server programs.
Finally, good luck tuning!
Become a Java GC expert (5)-java Performance Tuning principles