Garbage collection optimization for high throughput and low latency Java applications

Last Update:2017-03-02 Source: Internet

Author: User

Tags high cpu usage

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Garbage collection optimization for high throughput and low latency Java applications

High-performance applications form the backbone of modern networks. LinkedIn has many internal high-throughput services to meet user requests thousands of times per second. To optimize the user experience, it is important to respond to these requests with low latency.

For example, one feature that users often use is to learn about dynamic information-a list of professional activities and content that is constantly being updated. Dynamic information is ubiquitous in LinkedIn, including company pages, school pages, and the most important home pages. The basic dynamic Information data platform indexes the updates of various entities in our economic Atlas (members, companies, groups, etc.), and it must implement related updates with high throughput and low latency.

Figure 1 LinkedIn dynamic information

These high-throughput, low-latency As Java applications evolve into products, developers must ensure consistent performance at every stage of the application development cycle. Determining the settings for optimized garbage collection (garbage COLLECTION,GC) is critical to achieving these metrics.

This article uses a series of steps to clarify requirements and optimize GC, the target audience is a developer who is interested in using a system approach to optimize GC for high throughput and low latency for application implementation. The approach in this article comes from LinkedIn's process of building the next generation of dynamic information data platforms. These methods include, but are not limited to, the CPU and memory overhead of the concurrency tag purge (Concurrent Mark Sweep,cms) and G1 garbage collector, to avoid persistent GC cycles caused by long-lived objects, Optimize GC thread task assignments for performance gains, and GC pause times to predict desired OS settings.

Optimized The right time for GC?

GC runs vary with code-level optimizations and workloads. It is therefore important to adjust the GC on a near-complete code base where performance optimizations have been implemented. But it is also necessary to perform a preliminary analysis of the basic end-to-end prototype, which uses stub code and simulates the workloads that represent the product environment. This captures the true boundaries of the schema latency and throughput, and then decides whether to scale vertically or horizontally.

in the prototype phase of the next generation dynamic information data platform, virtually all end-to-end functionality is implemented and the query payload served by the current product infrastructure is simulated. From there, we have a variety of GC features that measure the performance of the workloads and are long enough to run .

Optimized Steps for GC

The following are optimized to meet high throughput, low latency requirements GC's overall steps. It also includes specific details on the implementation of the dynamic information data platform prototype. You can see that the PARNEW/CMS has the best performance, but we also experimented with the G1 garbage collector.

1. Understanding the basics of GC

Understanding The GC working mechanism is important because a large number of parameters need to be adjusted. Oracle's hotspot JVM Memory Management whitepaper is a great source of information to start learning about the hotspot JVM GC algorithm. To learn about the G1 garbage collector, see this paper .

2. Carefully consider GC requirements

to reduce application performance. GC overhead, which optimizes some of the GC's characteristics. These GC features, such as throughput, latency, and so on, should be observed over a long period of time to ensure that feature data comes from multiple GC cycles where the number of objects processed by the application changes.

The Stop-the-world garbage collector suspends the application thread when it recycles garbage. The duration and frequency of pauses should not adversely affect the application's compliance with SLAs.
Concurrency The GC algorithm competes with the application thread for CPU cycles. This overhead should not affect application throughput.
do not compress GC algorithms can cause heap fragmentation, resulting in full GC Stop-the-world pauses for long periods of time.
garbage collection work requires memory consumption. Some GC algorithms produce higher memory consumption. If your application requires a large heap space, make sure that the GC's memory overhead is not too large.
Clear Understanding GC logs and common JVM parameters are necessary for simple tuning of GC operations. The GC run changes as the complexity of the code grows or the work characteristics change.

we use Linux os hotspot java7u51,32GB heap memory, 6GB Cenozoic (young generation) and-xx:cmsinitiatingoccupancyfraction The value is 70 (the space occupancy rate when the GC triggers in the old age) to begin the experiment. Sets the large heap memory used to maintain object caches for long-lived objects. Once this cache is populated, the proportion of objects promoted to the old age is significantly reduced.

using the initial GC configuration, 80ms of new generation GC pauses occur every three seconds, and over 99.9 of applications delay 100ms. Such a GC is likely to be suitable for many applications where the SLA is less stringent on latency. However, our goal is to minimize the latency of 99.9 of applications, which is essential for GC optimization.

3. Understanding GC Metrics

be measured before optimization. Learn the details of GC logs using these options:-xx:+printgcdetails-xx:+printgctimestamps-xx:+printgcdatestamps-xx:+ Printtenuringdistribution-xx:+printgcapplicationstoppedtime) can have a general grasp of the GC characteristics of the application.

LinkedIn's internal monitoring and reporting system,ingraphs and Naarad, generates a variety of useful indicator visualizations, such as the GC pause time percentage, the maximum duration of a pause, and the GC frequency over a long period of time. In addition to Naarad, there are many open source tools such as gclogviewer that can create visual graphics from GC logs.

at this stage, it is necessary to determine Does the GC frequency and pause duration affect the ability of the application to meet the latency requirements.

4. Reduce GC Frequency

in the generational In GC algorithm, the reduction of the recovery frequency can be achieved by: (1) reducing the object allocation/lift rate, (2) increasing the size of the generation space.

in the In the Hotspot JVM, the new generation GC pause time depends on the number of objects after a garbage collection, not the size of the Cenozoic itself. The effect of increasing the size of the Cenozoic on application performance requires careful evaluation:

if more data survives and is replicated to the survivor region, or if more data is raised to the old age per garbage collection, increasing the Cenozoic size may result in a longer generation of GC pauses.
On the other hand, if the number of surviving objects does not increase significantly after each garbage collection, the pause time may not be prolonged. In this case, reducing the GC frequency may reduce the overall latency of the application and/or increase throughput.

for most applications where short-lived objects are used, only the parameters mentioned above need to be controlled. For applications that create long-lived objects, it is important to note that objects that are promoted may not be recycled by the older GC cycles for a long time . If the old-age GC trigger threshold (percentage of old-age space occupancy) is low, the application will fall into an ongoing GC cycle. Setting a high GC trigger threshold can avoid this problem.

because our application maintains a large cache of long-lived objects in the heap, the old generation The GC trigger threshold is set to-xx:cmsinitiatingoccupancyfraction=92-xx:+usecmsinitiatingoccupancyonly. We also tried to increase the Cenozoic size to reduce the generation recovery frequency, but it was not used because it increased application latency.

5. Shorten GC Pause time

reducing the Cenozoic size can shorten the Cenozoic GC pause time, as this is copied to the survivor area or the data is promoted less. However, as mentioned earlier, we want to look at the effect of reducing the Cenozoic size and the resulting GC frequency increase for overall application throughput and latency. The new Generation GC pause times also depend on the tenuring threshold (elevation threshold) and the size of the space (see step 6th).

Use The CMS tries to minimize heap fragmentation and associated with the old age garbage collection full GC pause time. The old-age GC is triggered at a low threshold by controlling the object lift ratio and decreasing the value of -xx:cmsinitiatingoccupancyfraction. For details on all options and their related tradeoffs, check out the Java garbage collection and Java garbage collection Essentials for WEB services.

we observed that Most of the new generation in the Eden region is recycled, with almost no objects dying in the Survivor region, so we reduced the tenuring threshold from 8 to 2 (using option:-xx:maxtenuringthreshold=2), To shorten the time that new generation garbage collection consumes on data replication.

We also note that the new generation recovery pause time is prolonged as the space occupancy rate of the old age increases. This means that the pressure from the old age makes the object promotion take more time. To solve this problem, increase the total heap memory size to 40GB, reduce the value of-xx:cmsinitiatingoccupancyfraction to 80, and start the old age recovery faster. Although the value of the-xx:cmsinitiatingoccupancyfraction is reduced, increasing heap memory avoids the constant GC of the old age . At this stage, we obtained a 70ms Cenozoic recovery pause and a 99.9 delay of 80ms.

6. Optimize task assignment for GC worker threads

further shortening the new generation pause time, we decided to study optimization and The options for the GC thread-bound task.

The-xx:pargccardsperstridechunk option controls the task granularity of the GC worker thread, which helps to get the best performance without patching , which optimizes the card table scan time for new generation garbage collection . Interestingly , the new generation of GC time is extended with the increase of the old age space. Set this option value to 32678, and the new generation recovery pause time is reduced to an average of 50ms. At this time the 99.9 application delay 60ms.

There are other options to map the task to GC thread, if the OS allows, the-xx:+bindgctaskthreadstocpus option binds the GC thread to the individual CPU cores. -xx:+usegctaskaffinity uses the affinity parameter to assign a task to a GC worker thread. However, our application does not discover any benefits from these options. In fact, some surveys show that these options do not work on Linux systems.

7. Understanding the CPU and memory overhead of GC

Concurrency GC usually increases CPU usage. We observed a well-run CMS default setting, and the increased CPU usage caused by concurrent GC and G1 garbage collector work significantly reduced the throughput and latency of the application. Compared to CMS, G1 may occupy more memory overhead than the application. For low-throughput, non-compute-intensive applications, the GC's high CPU usage may not need to be feared.

Figure 2 percentage of CPU usage for parnew/cms and G1 %: relative CPU use G1 for nodes with significant change in usage rate
option -xx:g1rsetupdatingpausetimepercent=20

Figure 3 parnew/cms and G1 service Requests per second: nodes with lower throughput use G1
option -xx:g1rsetupdatingpausetimepercent=20

8. Optimize system memory and I/O management for GC

Generally speaking, GC pauses occur at (1) Low user time, high system time and high clock time and (2) Low user time, low system time and high clock time. This means that there is a problem with the underlying process/os setup. The situation (1) may indicate that Linux steals pages from the JVM, as the case (2) may indicate that the Linux boot GC thread clears the disk cache while waiting for I/O threads to fall into the kernel.

to avoid run-time performance loss, start the app with The JVM option-xx:+alwayspretouch Access and clear 0 pages. Set vm.swappiness to zero, unless it is absolutely necessaryfor the OS not to swap pages.

You may use Mlock to put the JVM page pin in memory so that the OS does not swap out the page. However, if the system runs out of all memory and swap space, the OS recycles the memory through the kill process. Typically, the Linux kernel chooses a process that has a high memory footprint but does not have a long running time (the workflow of the killing process in an oom situation ). For us, this process is likely to be our application. A service with gracefully degraded (moderately degraded) features would be better, and a sudden service failure bodes ill for operability-so we do not use mlock but vm.swappiness to avoid possible exchange penalties.

GC optimization of the LinkedIn dynamic information data platform

for this platform prototype system, we use The two algorithms of the Hotspot JVM optimize garbage collection:

New generation garbage collection and use Parnew, old age garbage collection uses CMS.
used in the new generation and the old age G1. The G1 is used to solve the problem of a stable, predictable pause time of less than 0.5 seconds when the heap size is 6GB or larger. While we were using the G1 experiment, we did not get a predictable value for GC performance or pause time like parnew/cms, although various parameters were adjusted. We queried a bug[3] that was associated with a memory leak using G1, but the root cause was not yet determined.

Use Parnew/cms, apply a new generation of 40-60ms every three seconds and a CMS cycle per hour. The JVM options are as follows:

/p>

//JVM sizing options

-server-xms40g-xmx40g-xx: maxdirectmemorysize=4096m-xx:permsize=256m-xx:maxpermsize=256m

//Young generation options

-xx:newsize=6g-xx:maxnewsize=6g-xx:+useparnewgc-xx:maxtenuringthreshold=2-xx:survivorratio=8-xx:+ unlockdiagnosticvmoptions-xx:pargccardsperstridechunk=32768

//Old generation options

-XX:+ useconcmarksweepgc-xx:cmsparallelremarkenabled-xx:+parallelrefprocenabled-xx:+cmsclassunloadingenabled - xx:cmsinitiatingoccupancyfraction=80-xx:+usecmsinitiatingoccupancyonly

//Other options

-xx:+alwayspretouch-xx:+printgcdetails-xx:+printgctimestamps-xx:+printgcdatestamps-xx:+ Printtenuringdistribution-xx:+printgcapplicationstoppedtime-xx:-O mitstacktraceinfastthrow

with these options, the latency of 99.9 of applications is reduced to 60ms for throughput of thousands of read requests.

Reference:

[1]-xx:+bindgctaskthreadstocpus seems to have no effect on Linux systems because Hotspot/src/os/linux/vm/os_linux.cpp's Distribute_ The processes method is not implemented in JDK7 or JDK8.
[2] The-xx:+usegctaskaffinity option does not seem to work on all platforms in JDK7 and JDK8, because the Affinity property of the task is always set to Sentinel_worker = (UINT)-1. SOURCE See Hotspot/src/share/vm/gc_implementation/parallelscavenge/{gctaskmanager.cpp,gcTaskThread.cpp, GcTaskManager.cpp}.
[3] G1 There are some memory leaks bug, may java7u51 not modified. This bug was fixed only in Java 8.

Garbage collection optimization for high throughput and low latency Java applications

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More