GC Analysis of three-performance tuning experience Summary

Last Update:2015-08-31 Source: Internet

Author: User

Tags server memory

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Performance Tuning Experience Summary

The problem arises:

In the daily environment, a server, for example, the machine's number of visits per second is around 368, the maximum number of visits at 913. The performance of the external service is that every two or three hours there is a second level of time client request timeout, in the case of increased traffic, the user request time-out frequency significantly increased.

Direct analysis of the phenomenon:

By monitoring the GC to discover this phenomenon, promotion failed and concurrent mode failure appear frequently in GC. Due to the direct cause of promotion failed, when sending YGC, the old area is not able to be promoted because of fragmentation and free space. In the case of a server: Fragment compression is used to reclaim the old area memory using a CMS because Xx:+usecmscompactatfullcollection is configured on the wire. The reason to guess promotion failed is that the old area is not enough memory, not fragments. There are too many objects that have been promoted to the old area too frequently, or that objects in the old area have not been recycled, causing the promotion to the old area to be unsuccessful.

According to the phenomenon of the measured server guessing:

server long polling function : The Server service request type has the following
1) SDK various queries, Release
2) client short polling
3) Client synchronization Gets the temporary variables that are created in the
server side of the memory object based on the life cycle length can be divided into the following 2 classes
1) Response Service
2) monitor the number of data subscribers and the number of subscribed configurations of a data structure
The number of the former will be far more than the latter, so in the-xx:newsize=2g -xx:maxnewsize=2g -xx:survivorratio=10 configuration, only a few objects are promoted to the old area.
You can also see from the monitored data the number of times that a long poll published the previous server's MINOR&NBSP;GC and CMS&NBSP;GC statistics
-&NBSP;&NBSP;MINOR&NBSP;GC the number of times and QPS about
- cms average only 0 per day. Several times

After the server long polling feature is published : The Server service has the following types of requests
1) SDK various query, publish
2) Client Short polling
3) Client Long polling
4) client sync get configuration
In-memory objects in the server side can be divided into the following 3 classes depending on the life cycle length
1) temporary variable created in response to service [RT for one request]
2) variables created in response to long polling [long polling hold time 30s]
3) Data structures that monitor the number of subscribers and the number of subscribed configurations [live]
4) Data structures that monitor the number of long polling clients and the number of subscription configurations [live]
Guess conclusion : long polling throws . The reason is that as the number of requests for long polling grows, the amount of space that the system can reclaim per minor GC is decreasing with the JVM's memory configuration intact and the total QPS unchanged. Each time minor gc,copy to Survivor area of the object will be more, and the configuration of the 2G Cenozoic +survivorratio=10, the long polling memory used to survive at least 30s, resulting in a lot of objects are promoted to the old area. The result is the occurrence of promotion failed, concurrent mode faliure.

Real-World simulation validation guesses:

Test environment Preparation:

Hardware information:

Physical machine One: 16 cores. Frequency: 2.27. Linux iSearch006030.sqa.cm4 2.6.18-164.el5 #1 SMP Tue 15:51:48 EDT x86_64 x86_64 x86_64 gnu/linux. Memory: 24G.

Software information:

Server: one server

Main startup parameter configuration:-xms4g-xmx4g-xx:newsize=2g-xx:maxnewsize=2g-xx:permsize=128m-xx:maxpermsize=256m-xx:survivorratio=10- Xx:+useconcmarksweepgc-xx:+usecmscompactatfullcollection-xx:+cmsclassunloadingenabled-xx:+disableexplicitgc-xx : +heapdumponoutofmemoryerror-xx:heapdumppath=/home/admin/***/logs-verbose:gc-xloggc:/home/nami.zft/****/logs/ gc.log-xx:+printgcdetails-xx:+printgcdatestamps-djava.awt.headless=true-dsun.net.client.defaultconnecttimeout= 10000-dsun.net.client.defaultreadtimeout=30000-dsun.security.ssl.allowunsaferenegotiation=true

Data analysis and Preparation:

Pull down all the data from the server side of the daily environment, with 419 group,32,787 dataid in the data and a size of 151 MB. Organize your data, publish it to server, and make sure that the cache of your test server is consistent with your daily environment.

The ratio of long polling to end polling is approximately 1:1.5, and the length poll post accounts for 90% of the total request operation, and the GET request accounts for 10%. In order to speed up the occurrence of the phenomenon, the number of long polling per second surviving is at 1500, and the number of short polling requests per second is at 1000,change config per second in 200,get config at 200.

Test start and results:

Figure one in 10 concurrent, the user request the QPS change diagram

Figure II JVM old area memory utilization

Figure three GC time

Performance analysis: Under the above pressure, 8 minutes began to reproduce the daily environment phenomenon. Due to the performance indicators of the server, Cup,load, disk use, server memory all disk IO did not reach the bottleneck, so do not give a diagram analysis. In 10 concurrent successful TPS began to fall at 15.48 points, and presented an unstable state, one. You can see from figure two that the old area starts 2 minutes after the pressure began to increase linearly, the start of the 11-minute old area utilization reached 100% and maintained, proof that the old area must have some objects have not been released. So in figure three appeared full time is getting longer.

Reason location: Through the dump heap, it is found that there is an object that actually survives for an exceptionally long time, occupying the old area causing the old area to not be released. Positioned at 66.3% The memory consumption on the Org.apache.catalina.session.StandardManager object, referring to the implementation of the Tomcat container, the memory consumed on the session object, and found that the session survived the time taste oh 30 minutes. Refer to Blog http://ddupnow.iteye.com/blog/621619. Thus led to the YGC promotion failure.

Solution: Shorten session time

Result Validation :

Reduce session time by ensuring that the test environment is consistent with the previous step. Playback of the same performance scenario. The JVM has a usage rate of around 11%, and the full GC performs only one full GC within two hours, as shown in the results. Further validation is achieved by increasing the number of concurrent times to 200 concurrency and performance testing by up to 2 days. The TPS remained stable at around 1k, and the full GC number of times for the 1,old zone also eventually stabilized.

GC Analysis of three-performance tuning experience Summary

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More