Garbage collection optimization for high throughput and low latency Java applications

Last Update:2015-12-15 Source: Internet

Author: User

Tags high cpu usage

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original link: LinkedIn translation: Importnew.com-hejiani
Link: http://www.importnew.com/11336.html

High-performance applications form the backbone of modern networks. LinkedIn has many internal high-throughput services to meet user requests thousands of times per second. To optimize the user experience, it is important to respond to these requests with low latency.

For example, one feature that users often use is to learn about dynamic information-a list of professional activities and content that is constantly being updated. Dynamic information is ubiquitous in LinkedIn, including company pages, school pages, and the most important home pages. The basic dynamic Information data platform indexes the updates of various entities in our economic Atlas (members, companies, groups, etc.), and it must implement related updates with high throughput and low latency.

Figure 1 LinkedIn Dynamic information

These high-throughput, low-latency Java applications turn into products, and developers must ensure consistent performance at every stage of the application development cycle. Determining the settings for optimized garbage collection (garbage COLLECTION,GC) is critical to achieving these metrics.

This article uses a series of steps to clarify requirements and optimize the GC, and the target audience is a developer who is interested in using system methods to optimize GC for high throughput and low latency for applications. The approach in this article comes from LinkedIn's process of building the next generation of dynamic information data platforms. These methods include, but are not limited to, the CPU and memory overhead of the concurrency tag purge (Concurrent Mark Sweep,cms) and G1 garbage collector, to avoid persistent GC cycles caused by long-lived objects, and to optimize GC thread task assignments to improve performance. And the GC pause time to predict the desired OS settings.

The right time to optimize GC?

GC runs vary with code-level optimizations and workloads. It is therefore important to adjust the GC on a near-complete code base where performance optimizations have been implemented. But it is also necessary to perform a preliminary analysis of the basic end-to-end prototype, which uses stub code and simulates the workloads that represent the product environment. This captures the true boundaries of the schema latency and throughput, and then decides whether to scale vertically or horizontally.

In the prototype phase of the next generation dynamic information data platform, virtually all end-to-end functionality is implemented and the query payload served by the current product infrastructure is simulated. From there, we have a variety of GC features that measure the performance of the workloads and are long enough to run.

Steps to optimize GC

The following are the overall steps to optimize GC for high throughput, low latency requirements. It also includes specific details on the implementation of the dynamic information data platform prototype. You can see that the PARNEW/CMS has the best performance, but we also experimented with the G1 garbage collector.

1. Understanding the basics of GC

Understanding the GC working mechanism is important because you need to adjust a large number of parameters. Oracle's hotspot JVM Memory Management whitepaper is a great source of information to start learning about the hotspot JVM GC algorithm. To learn about the G1 garbage collector, see this paper.

2. Carefully consider GC requirements

To reduce the GC overhead of application performance, some characteristics of GC can be optimized. These GC features, such as throughput, latency, and so on, should be observed over a long period of time to ensure that feature data comes from multiple GC cycles where the number of objects processed by the application changes.

The Stop-the-world garbage collector suspends the application thread when it recycles garbage. The duration and frequency of pauses should not adversely affect the application's compliance with SLAs.
Concurrent GC algorithms compete with application threads for CPU cycles. This overhead should not affect application throughput.
The non-compression GC algorithm causes heap fragmentation, which causes the full GC to Stop-the-world for long periods of time.
Garbage collection work requires memory consumption. Some GC algorithms produce higher memory consumption. If your application requires a large heap space, make sure that the GC's memory overhead is not too large.
A clear understanding of GC logs and commonly used JVM parameters is necessary for simple tuning of GC operations. The GC run changes as the complexity of the code grows or the work characteristics change.

We started experimenting with the Linux OS hotspot java7u51,32gb heap memory, 6GB Cenozoic (young Generation) and -XX:CMSInitiatingOccupancyFraction a value of 70 (the space occupancy rate of the old age GC triggering). Sets the large heap memory used to maintain object caches for long-lived objects. Once this cache is populated, the proportion of objects promoted to the old age is significantly reduced.

With the initial GC configuration, 80ms of new generation GC pauses occur every three seconds, and more than 99.9 of applications delay 100ms. Such a GC is likely to be suitable for many applications where the SLA is less stringent on latency. However, our goal is to minimize the latency of 99.9 of applications, which is essential for GC optimization.

3. Understanding GC Metrics

be measured before optimization. Learn more about GC logs (using these options: -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime ) to have a general grasp of the GC characteristics of the application.

LinkedIn's internal monitoring and reporting system, ingraphs and Naarad, generates a variety of useful indicator visualizations, such as the GC pause time percentage, the maximum duration of a pause, and the GC frequency over a long period of time. In addition to Naarad, there are many open source tools such as Gclogviewer that can create visual graphics from GC logs.

At this stage, it is necessary to determine whether the GC frequency and pause duration affect the ability of the application to meet the latency requirements.

4. Reduce GC Frequency

In the generational GC algorithm, the reduction of the recovery frequency can be achieved by: (1) reducing the object allocation/lift rate, (2) increasing the size of the generation space.

In the Hotspot JVM, the new generation GC pause time depends on the number of objects after a garbage collection, not the size of the Cenozoic itself. The effect of increasing the size of the Cenozoic on application performance requires careful evaluation:

If more data survives and is replicated to the survivor region, or if more data is raised to the old age per garbage collection, increasing the Cenozoic size may result in a longer generation of GC pauses.
On the other hand, if the number of surviving objects does not increase significantly after each garbage collection, the pause time may not be prolonged. In this case, reducing the GC frequency may reduce the overall latency of the application and/or increase throughput.

For most applications where short-lived objects are used, only the parameters mentioned above need to be controlled. For applications that create long-lived objects, it is important to note that objects that are promoted may not be recycled by the older GC cycles for a long time. If the old-age GC trigger threshold (percentage of old-age space occupancy) is low, the application will fall into an ongoing GC cycle. Setting a high GC trigger threshold can avoid this problem.

Because our application maintains a large cache of long-lived objects in the heap, the old generation GC trigger threshold is set to -XX:CMSInitiatingOccupancyFraction=92 -XX:+UseCMSInitiatingOccupancyOnly . We also tried to increase the Cenozoic size to reduce the generation recovery frequency, but it was not used because it increased application latency.

5. Shorten GC Pause time

Reducing the Cenozoic size can shorten the generation of GC pauses because it is copied to the survivor region or has fewer data to be promoted. However, as mentioned earlier, we want to look at the effect of reducing the Cenozoic size and the resulting GC frequency increase for overall application throughput and latency. The new Generation GC pause times also depend on the tenuring threshold (elevation threshold) and the size of the space (see step 6th).

Use the CMS to try to minimize heap fragmentation and associated with the old age garbage collection full GC pause time. The -XX:CMSInitiatingOccupancyFraction old-age GC is triggered at a low threshold by controlling the object lift ratio and the reduced value. For details on all options and their related tradeoffs, check out the Java garbage collection and Java garbage Collection Essentials for Web services.

We observed that most of the new generation of the Eden region was recycled and that almost no objects died in the survivor region, so we reduced the tenuring threshold from 8 to 2 (using option: -XX:MaxTenuringThreshold=2 ) to shorten the time it takes for the generation of garbage collection to be spent on data replication.

We also note that the new generation recovery pause time is prolonged as the space occupancy rate of the old age increases. This means that the pressure from the old age makes the object promotion take more time. To solve this problem, increase the total heap memory size to 40GB, reduce -XX:CMSInitiatingOccupancyFraction the value to 80, and start the old age recovery faster. Although -XX:CMSInitiatingOccupancyFraction the value is reduced, increasing the heap memory avoids the constant old-age GC. At this stage, we obtained a 70ms Cenozoic recovery pause and a 99.9 delay of 80ms.

6. Optimize task assignment for GC worker threads

To further shorten the Cenozoic pause time, we decided to study the options for optimizing the task with GC thread binding.

-XX:ParGCCardsPerStrideChunkoption controls the task granularity of the GC worker thread, which helps to get the best performance without patching, which optimizes the card table scan time for new generation garbage collection. Interestingly, the new generation of GC time is extended with the increase of the old age space. Set this option value to 32678, and the new generation recovery pause time is reduced to an average of 50ms. At this time the 99.9 application delay 60ms.

There are other options to map the task to the GC thread, and if the OS allows, the -XX:+BindGCTaskThreadsToCPUs option binds the GC thread to the individual CPU cores. -XX:+UseGCTaskAffinityuse the affinity parameter to assign a task to a GC worker thread. However, our application does not discover any benefits from these options. In fact, some surveys show that these options do not work on Linux systems.

7. Understanding the CPU and memory overhead of GC

Concurrent GC typically increases CPU usage. We observed a well-run CMS default setting, and the increased CPU usage caused by concurrent GC and G1 garbage collector work significantly reduced the throughput and latency of the application. Compared to CMS, G1 may occupy more memory overhead than the application. For low-throughput, non-compute-intensive applications, the GC's high CPU usage may not need to be feared.

Figure 2 Percentage of CPU usage for PARNEW/CMS and G1%: The relative CPU usage varies significantly by node usage G1
Option-xx:g1rsetupdatingpausetimepercent=20

Figure 3 Parnew/cms and G1 service requests per second: nodes with lower throughput use G1
Option-xx:g1rsetupdatingpausetimepercent=20

8. Optimize system memory and I/O management for GC

Typically, GC pauses occur at (1) Low user time, high system time and high clock time and (2) Low user time, low system time, and high clock time. This means that there is a problem with the underlying process/os setup. The situation (1) may indicate that Linux steals pages from the JVM, as the case (2) may indicate that the Linux boot GC thread clears the disk cache while waiting for I/O threads to fall into the kernel. In these cases, you can refer to the PPT for setting parameters.

To avoid run-time performance loss, use the JVM option to -XX:+AlwaysPreTouch Access and clear 0 pages when you start the app. Set vm.swappiness to zero, unless absolutely necessary, the OS does not exchange pages.

You may be using the mlock JVM page pin in memory so that the OS does not swap out pages. However, if the system runs out of all memory and swap space, the OS recycles the memory through the kill process. Typically, the Linux kernel chooses a process that has a high memory footprint but does not have a long running time (the workflow of the killing process in an oom situation). For us, this process is likely to be our application. A service with gracefully degraded (moderately degraded) features would be better, and a sudden service failure would herald a less-than-good operability-so we didn't use it mlock but vm.swappiness avoided possible exchange penalties.

GC optimization of the LinkedIn dynamic information data platform

For this platform prototype system, we use two algorithms from the hotspot JVM to optimize garbage collection:

The new generation garbage collection uses the Parnew, the old age garbage collection uses the CMS.
The new generation and the old age use G1. The G1 is used to solve the problem of a stable, predictable pause time of less than 0.5 seconds when the heap size is 6GB or larger. While we were using the G1 experiment, we did not get a predictable value for GC performance or pause time like parnew/cms, although various parameters were adjusted. We queried a bug[3] that was associated with a memory leak using G1, but the root cause was not yet determined.

Use Parnew/cms to apply a new generation of 40-60ms every three seconds and a CMS cycle per hour. The JVM options are as follows:

JVM Sizing options-server-xms40g-xmx40g-xx:maxdirectmemorysize=4096m-xx:permsize=256m-xx:maxpermsize=256m   Young generation OPTIONS-XX:NEWSIZE=6G-XX:MAXNEWSIZE=6G-XX:+USEPARNEWGC-XX:MAXTENURINGTHRESHOLD=2-XX: survivorratio=8-xx:+unlockdiagnosticvmoptions-xx:pargccardsperstridechunk=32768//old Generation  Options-XX: +useconcmarksweepgc-xx:cmsparallelremarkenabled-xx:+parallelrefprocenabled-xx:+cmsclassunloadingenabled  - Xx:cmsinitiatingoccupancyfraction=80-xx:+usecmsinitiatingoccupancyonly   //Other Options-xx:+alwayspretouch- xx:+printgcdetails-xx:+printgctimestamps-xx:+printgcdatestamps-xx:+printtenuringdistribution-xx:+ Printgcapplicationstoppedtime-xx:-O Mitstacktraceinfastthrow

With these options, the latency of 99.9 of applications is reduced to 60ms for throughput of thousands of read requests.

Reference:

[1] -XX:+BindGCTaskThreadsToCPUs it does not seem to work on Linux systems because hotspot/src/os/linux/vm/os_linux.cpp the distribute_processes method is not implemented in JDK7 or JDK8.
[2] -XX:+UseGCTaskAffinity options on all platforms in JDK7 and JDK8 do not seem to work because the Affinity property of the task is always set to sentinel_worker = (uint) -1 . See the source code hotspot/src/share/vm/gc_implementation/parallelScavenge/{gcTaskManager.cpp，gcTaskThread.cpp, gcTaskManager.cpp} .
[3] G1 There are some memory leaks bug, may java7u51 not modified. This bug was fixed only in Java 8.

Garbage collection optimization for high throughput and low latency Java applications

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More