[Go] on-line GC fault resolution process logging

Source: Internet
Author: User

Three or four hours of troubleshooting, finally solved the GC problem, record the resolution process here, we hope to help. This article assumes that the reader has basic knowledge of GC and JVM tuning and that the JVM tuning tool can be used to view another article that I have under the same classification:

http://my.oschina.net/feichexia/blog/196575

Background notes

The problem system is deployed on UNIX and has been running for more than two weeks before the problem occurs.

I used the Hadoop source code in the Countingbloomfilter, and changed it into a thread-safe implementation (see details: adjustedcountingbloomfilter), mainly using the Atomiclong CAs optimistic lock, Replace long[] with atomiclong[]. This leads to 5 large atomiclong arrays in the system (approximately 50MB per array), and each array contains a large number of Atomiclong objects (all Atomiclong objects occupy approximately 1.2G of memory). And these atomiclong arrays have survived for at least one day.

The server has been online before the mobile client, the client originally planned to go online this Thursday (when this article was written in Monday), so I also intend to continue to observe the system in the next few days, the operation is still open debug level log.

Some JVM parameters are summarized as follows (JVM parameters are configured in the setenv.sh of the bin directory under the root directory of the Tomcat server for the project deployment, and can be obtained by Ps-ef | grep XXX | Grep-v grep See):

?
1 -xx:permsize=256m -xx:maxpermsize= 256m -xms6000m -xmx6000m -xmn1500m -xss256k -xx:parallelgcthreads= 8  -xx:+useconcmarksweepgc -xx:+useparnewgc -xx:+ disableexplicitgc -xx:+cmsparallelremarkenabled -xx:+cmsclassunloadingenabled -xx:+ cmspermgensweepingenabled -xx:cmsinitiatingoccupancyfraction= 70  -xx:cmsfullgcsbeforecompaction= 5  < Code class= "Java Plain" >-xx:+usecmscompactatfullcollection -xx:+cmsscavengebeforeremark -xx:+ heapdumponoutofmemoryerror -xloggc:/usr/local/webserver/point/logs/gc.log -xx:+printgcdetails - Xx:+printgctimestamps -xx:+printgcapplicationstoppedtime -xx:+printgcapplicationconcurrenttime

You can see that the persistent generation is set to 256M, the heap memory is set to 6000M (-XMS and--xmx set to equal to avoid "heap turbulence", can reduce the GC number to a certain extent, but will increase the average time per GC consumption), the young generation is set to 1500M.

-XX:+USECONCMARKSWEEPGC set the old age using the CMS (Concurrent mark-sweep) collector. -XX:+USEPARNEWGC set up Cenozoic using the parallel collector, the-xx:parallelgcthreads parameter specifies the number of parallel collector worker threads, which is generally recommended to be the same number of CPUs when the CPU core count is less than or equal to 8, but when the CPU cores are greater than 8 o'clock the recommended setting is: 3 + [ (5*cpu_count)/8]. Other parameters are omitted.

Problem Discovery and resolution process

The morning test found me saying that the online system suddenly hung up and reported an Access timeout exception.

First of all, my first reaction was a system memory overflow or process being killed by the operating system. With Ps-ef | grep XXX | Grep-v grep command to see if the process is still in. Then look at Tomcat's Catlalina.out logs and system GC logs, and no memory overflow was found.

Next use jstat-gcutil PID 1000 to see the next heap in the occupation of each generation and GC situation, found a very scary phenomenon: Eden area occupies more than 77%, S0 occupy 100%,old and Perm area have a lot of space surplus.

Doubt is the new generation of space shortage, but there is no conclusive evidence, had to use Jstack to get thread dump information, do not see, see a problem (not found thread deadlock, here should be "live lock" problem):

From the first paragraph above, you can see that there is a low memory detector system internal thread (JVM-initiated monitoring and reporting of the lower RAM daemon thread) has been holding the lock 0x00 .... 00, while the following C2 CompilerThread1, C2 CompilerThread0, Signal dispatcher, and surrogate locker threads are waiting for this lock, causing the entire JVM process to hang.

Search on the Internet a lap, found that most of the proposed to adjust the heap of memory, so according to the proposed plan to adjust the size of the whole heap, the size of the new generation (-XMN parameters), the proportion of Survivor in the Cenozoic (-xx:survivorratio parameter). And because of the existence of Atomiclong array large object, so set-xx:pretenurethreshold=10000000, that is, if an object is more than 10M (in bytes, so the conversion is 10M) directly into the old age, Note that this parameter is only valid in the serial collector and the Parnew collector. In addition, I hope that a large number of long life cycle Atomiclong small objects can enter the old age, to avoid the old generation of Atomiclong array objects large reference to the new generation of Atomiclong objects, I adjusted small-xx: Maxtenuringthreshold (the default value of this parameter is 15), that is, now the young generation of objects can survive in the younger generation of 8 generations, if more than 8 generations are still alive, even if the young generation of memory enough will be promote to the old age. The JVM GC parameters with modifications or additions are as follows:

?
1 -Xms9000M -Xmx9000M -Xmn1500M -XX:SurvivorRatio=6-XX:MaxTenuringThreshold=8-XX:PretenureSizeThreshold=10000000

After rebooting the system with the Jstat-gcutil PID 1000 command to find a more frightening phenomenon, such as: Eden area memory continues to grow rapidly, survivor occupation is still very high, about every two minutes young GC once, and every time young It is unreasonable to expect a full GC to occur every three or four hours after the GC has a lot of memory usage in older generations .

The second column is S1, occupying up to 87.45%, and the third column is the Eden area memory footprint change, which can be seen growing very quickly.

So I use jmap-histo:live (Note that the JMAP command will trigger the full GC, the concurrent access to the larger online environment with caution) to see the next living object, found that there are some integer array and some character array occupied within the existence of continuous growth, And accounted for about a good hundreds of m of memory, and then after the young GC and then down, and then quickly grow again, and then young GC decline, cycle.

At this point, I suspect that a large number of integer array objects and character array objects are basically full of survivor, resulting in the newly generated integer array object and the character array object not enough to fit into the survivor after Eden is full. The object is then directly promote to the old generation, which is partly correct, explaining why S1 is so high, but it does not explain that the memory footprint of the above Eden area continues to rise.

So continue to look at the next interface call log, do not see, a look scare: Log refresh very fast (99% is the debug log). The original operation and testing in the case without notifying our service has been released in a channel yesterday in a version of Android online (no wonder the problem today), and then saw the use of the system has more than 6,400 users, thoroughly under their hole. This explains why an integer array and an array of character are used to keep up with memory growth, because a large number of system interface calls trigger a large number of debug log refreshes, and the write log is a heavyweight operation for the online system, whether it's CPU usage or memory consumption, So high concurrent on-line system must remember to increase the log level to info or even error.

Then modify the log level in log4j.properties to info, and then use the jmap-histo:live PID command to view the live object, and find that the integer array object and character[] array objects are significantly reduced, And the memory footprint is reduced to several m by the front hundreds of M.

Then use the Jstat-gcutil PID 1000 to view the following GC situation, as follows:

Obviously survivor occupancy is not so high, the most important young GC after the old generation of memory consumption will not increase, where the Eden area growth seems to be very fast, because at this time the number of users more than before. The problems that arise at this point are largely fixed, but are subject to follow-up observation.

Summarize

In general, there is a violation of the GC hypothesis in this system, that is, in the JVM heap has a large amount of life cycle of small objects (Atomiclong objects). This will undoubtedly bury the system.

The basic assumptions of GC generation are:

?
1 JVM堆中存在的大部分对象都是短生命周期小对象。

This is why the young generation of the hotspot JVM uses the replication algorithm.

Other recommended some very good GC reference articles (the first two are from the "in-depth understanding of Java Virtual Machine" book, the reference links are most of the information I consulted today, you can choose to see):

JVM memory Management: Deep Java memory area and Oom http://www.iteye.com/topic/802573

JVM Memory Management: In-depth garbage collector and memory allocation policy http://www.iteye.com/topic/802638

Oracle GC Tuning http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html

Java 6 JVM parameter Options Daquan http://kenwublog.com/docs/java6-jvm-options-chinese-edition.htm

Java HotSpot VM Options http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html

CMS GC Practice Summary http://www.iteye.com/topic/473874

JVM memory allocation and recovery http://blog.csdn.net/eric_sunah/article/details/7893310

Step-by-step optimization of the JVM family http://blog.csdn.net/zhoutao198712/article/category/1194642

Java thread dump analysis http://www.linuxidc.com/Linux/2009-01/18171.htm http://jameswxx.iteye.com/blog/1041173

JVM fault diagnosis using Java dump http://www.ibm.com/developerworks/cn/websphere/library/techarticles/0903_suipf_javadump/

Detecting low Memory in Java https://techblug.wordpress.com/2011/07/16/detecting-low-memory-in-java/

Detecting low Memory in Java Part 2 http://techblug.wordpress.com/2011/07/21/detecting-low-memory-in-java-part-2/

Http://blog.sina.com.cn/s/blog_56d8ea9001014de3.html

Http://stackoverflow.com/questions/2101518/difference-between-xxuseparallelgc-and-xxuseparnewgc

Http://stackoverflow.com/questions/220388/java-concurrent-and-parallel-gc

Http://j2eedebug.blogspot.com/2008/12/what-to-look-for-in-java-thread-dumps.html

Https://devcenter.heroku.com/articles/java-memory-issues

http://blog.csdn.net/sun7545526/article/category/1193563

Http://java.dzone.com/articles/how-tame-java-gc-pauses

2 Floor: Mr. Elegance Posted on 2015-01-07 16:49 replies. This review refers to the comments from "Wood wood three" I would like to ask a final picture, 6th column ygc=4, 8th column FGC=9,FGC more than YGC times, This phenomenon is normal in fact, this problem has not been solved, and then appeared, the root cause is that I rewrite a countingbloomfilter in the use of atomiclong[], a huge amount of atomiclong small objects instantly produced, Young generation can not put directly promote to the old age lead to promotion failed, later changed to Atomiclongarray to solve the problem.

[Go] on-line GC fault resolution process logging

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.