Troubleshooting of Full GC triggered by Java ExecutorService Thread Pool

Source: Internet
Author: User

Today, in the online Java code, I handled a problem caused by the ExecutorServicec thread pool and precipitated the processing process and some debugging.


Scenario Description

In the distributed task processing module, there is a Java daemon process that receives the Java code description task (jar) through the queue and generates the sub-process (separate JVM) class loader, process defined Java code, collect logs, and process results. There are hundreds of sub-processes. The startup parameters are as follows:

Java-server-Xss256k-Xmx4096m-Xms4096m-XX: NewRatio = 4-XX: PermSize = 96 m-XX: MaxPermSize = 96 m-XX: + PrintGC-XX: + PrintGCDetails-XX: + PrintGCDateStamps-Xloggc: gc. log-XX: + HeapDumpOnOutOfMemoryError-XX: ErrorFile =. /stderr-XX: + UseConcMarkSweepGC


Symptom

A large number of tasks are executed in the early morning. Normally,Memory maintained at 1 GB ~ 1.5 GB, because the Yong GC affinity will be paid attention to in the code, and the processing logic tends to be applicable to the memory for a short time, So YongGC has a slight number of times, each time a large amount of garbage collection, FullGC is very small.

Routine gc. log interception:

2014-04-02T00:15:13.336+0800: 31549.990: [GC 31549.990: [ParNew: 673793K->2029K(755008K), 0.0051360 secs] 731427K->60257K(4110464K), 0.0053540 secs] [Times: user=0.03 sys=0.00, real=0.00 secs]2014-04-02T02:02:17.091+0800: 37973.745: [GC 37973.745: [ParNew: 673197K->3098K(755008K), 0.0056950 secs] 731425K->61807K(4110464K), 0.0059840 secs] [Times: user=0.04 sys=0.00, real=0.00 secs]2014-04-02T03:25:39.567+0800: 42976.221: [GC 42976.221: [ParNew: 674266K->2130K(755008K), 0.0067750 secs] 732975K->60842K(4110464K), 0.0070530 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]2014-04-02T04:12:15.891+0800: 45772.545: [GC 45772.545: [ParNew: 673298K->4497K(755008K), 0.0103740 secs] 732010K->64271K(4110464K), 0.0106280 secs] [Times: user=0.06 sys=0.00, real=0.01 secs]2014-04-02T04:40:27.685+0800: 47464.339: [GC 47464.340: [ParNew: 675611K->1931K(755008K), 0.0076410 secs] 735386K->61723K(4110464K), 0.0078830 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]2014-04-02T05:02:00.719+0800: 48757.373: [GC 48757.373: [ParNew: 673095K->2770K(755008K), 0.0069400 secs] 732887K->62579K(4110464K), 0.0071810 secs] [Times: user=0.04 sys=0.00, real=0.00 secs]2014-04-02T05:18:45.616+0800: 49762.270: [GC 49762.270: [ParNew: 673911K->4611K(755008K), 0.0084600 secs] 733720K->64424K(4110464K), 0.0087260 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]2014-04-02T05:31:57.960+0800: 50554.614: [GC 50554.614: [ParNew: 675779K->7261K(755008K), 0.0111740 secs] 735592K->67237K(4110464K), 0.0114760 secs] [Times: user=0.06 sys=0.00, real=0.02 secs]2014-04-02T05:39:29.031+0800: 51005.685: [GC 51005.686: [ParNew: 678429K->8082K(755008K), 0.0091520 secs] 738405K->68060K(4110464K), 0.0094180 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]2014-04-02T05:45:38.114+0800: 51374.768: [GC 51374.768: [ParNew: 679250K->8359K(755008K), 0.0071780 secs] 739228K->69105K(4110464K), 0.0074630 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]

Daily jstat interception (after startup ):

[xxxxxxxxxxxxxxxxxxxx]$ sudo /usr/java/default/bin/jstat -gcutil 9974 1s  S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT     0.00   0.60  78.51   2.52  40.91    139    0.718     0    0.000    0.718  0.00   0.60  78.51   2.52  40.91    139    0.718     0    0.000    0.718  0.00   0.60  78.51   2.52  40.91    139    0.718     0    0.000    0.718  0.00   0.60  78.51   2.52  40.91    139    0.718     0    0.000    0.718  0.00   0.60  78.51   2.52  40.91    139    0.718     0    0.000    0.718

An alarm is triggered when JVM Memory exception occurs in the early morning. The memory usage is almost full, and the FullGC frequency increases greatly. As sub-processes interact with the process in memory,Therefore, if some tasks run oom directly, the error out of memroy: java heap space is returned.

Gc. log after memory surge:

2014-03-31T08:28:49.731+0800: 210168.258: [CMS-concurrent-sweep-start]2014-03-31T08:28:49.766+0800: 210168.294: [CMS-concurrent-sweep: 0.035/0.035 secs] [Times: user=0.03 sys=0.00, real=0.03 secs]2014-03-31T08:28:49.766+0800: 210168.294: [CMS-concurrent-reset-start]2014-03-31T08:28:49.776+0800: 210168.304: [CMS-concurrent-reset: 0.010/0.010 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]2014-03-31T08:28:49.851+0800: 210168.379: [GC 210168.379: [ParNew: 755007K->755007K(755008K), 0.0000420 secs]210168.379: [CMS: 3355177K->3355176K(3355456K), 0.2814320 secs] 4110185K->4110177K(4110464K), [CMS Perm : 40876K->40876K(98304K)], 0.2817730 secs] [Times: user=0.29 sys=0.00, real=0.28 secs]2014-03-31T08:28:50.134+0800: 210168.662: [GC [1 CMS-initial-mark: 3355176K(3355456K)] 4110182K(4110464K), 0.0070090 secs] [Times: user=0.00 sys=0.00, real=0.01 secs]2014-03-31T08:28:50.827+0800: 210169.355: [GC [1 CMS-initial-mark: 3355176K(3355456K)] 4110182K(4110464K), 0.0070100 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]2014-03-31T08:28:50.834+0800: 210169.362: [CMS-concurrent-mark-start]2014-03-31T08:28:50.884+0800: 210169.412: [CMS-concurrent-mark: 0.049/0.050 secs] [Times: user=0.25 sys=0.00, real=0.05 secs]2014-03-31T08:28:50.884+0800: 210169.412: [CMS-concurrent-preclean-start]2014-03-31T08:28:51.007+0800: 210169.534: [CMS-concurrent-preclean: 0.123/0.123 secs] [Times: user=0.12 sys=0.00, real=0.12 secs]2014-03-31T08:28:51.007+0800: 210169.534: [CMS-concurrent-abortable-preclean-start]2014-03-31T08:28:51.007+0800: 210169.534: [CMS-concurrent-abortable-preclean: 0.000/0.000 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]2014-03-31T08:28:51.007+0800: 210169.535: [GC[YG occupancy: 755007 K (755008 K)]210169.535: [Rescan (parallel) , 0.0059270 secs]210169.541: [weak refs processing, 0.0000110 secs] [1 CMS-remark: 3355177K(3355456K)] 4110185K(4110464K), 0.0060740 secs] [Times: user=0.07 sys=0.00, real=0.01 secs]2014-03-31T08:28:51.014+0800: 210169.541: [CMS-concurrent-sweep-start]2014-03-31T08:28:51.049+0800: 210169.577: [CMS-concurrent-sweep: 0.035/0.035 secs] [Times: user=0.04 sys=0.00, real=0.04 secs]2014-03-31T08:28:51.049+0800: 210169.577: [CMS-concurrent-reset-start]2014-03-31T08:28:51.059+0800: 210169.587: [CMS-concurrent-reset: 0.010/0.010 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]


It was found that FullGC was nearly 2 seconds, In addition, the memory cannot be recycled much at a time, but GC has been trying to recycle it. Through jstat, we can see that FullGC is frequently displayed, and jstat-gccause displays Allocation Failure, which leads to GC. Jmap-heap shows that the use in the OLD zone is already 98%,But the JVM does not have OOM and Crash. If Crash occurs, the Hypervisor restarts the process, and the process recovers and rewrites the lost data. However, a portion of the memory can be freed up each time it is recycled, so the JVM does not have a Crash (how will we avoid this JVM policy later ).


(Here is a tip. If jstat jmap does not respond for a long time and then generates the exception log, you can try it several times with the-F parameter .)


Check the memory with/usr/java/default/bin/jmap-histo 9974 and find that equivalent char [] and string are in use. However, the problem cannot be located because there is no trace or class information, so I decided to use/usr/java/default/bin/jmap-dump: format = B, file = jvm. dump. hprof 9974 makes an image of the entire Heap of the JVM (the JVM will automatically dump an hprof file at oom, with the file name rule java_pidxxx.hprof ).The total size of this image is the same as that of the current eden + from/to + old + perm.:

[xxxxxxxxxxxxxxxxxxxxxxxx]$ du -sh jvm.dump.hprof 4.0G    jvm.dump.hprof

At first, because the file is too large and online on the remote server, cross-Data Center access is slow.Later, I found that the hprof dumped by dump contains a large amount of text information, and compressed it with tar. The size after compression is 300 + M, but the company's Heap analysis is not easy to use, it took 20 minutes to get hprof to the local machine. Use MAT (Eclipse MAT can also be used) for analysis. Note that if hprof is relatively large, you need to adjust the maximum heap of MAT or Eclipse startup parameter-Xmx to be larger than hprof for analysis.. As follows:




<喎?http: www.bkjia.com kf ware vc " target="_blank" class="keylink"> VcD4KPHA + signature/Signature + rbn9xuTL + signature ++ signature + signature/Signature/dPDtb1IZWFwyc + signature/Signature + signature/CwLTL47TzuMW + signature + PGJyPgo8L3A + Signature = "http://www.2cto.com/uploadfile/Collfiles/20140403/20140403091851216.jpg" alt = "">


If you are not familiar with ExecutorService, You can Google or refer to my previous blog Taskqueue queue. BlockingQueue is an asynchronous intermediate storage queue for decoupling producers and consumers. In the current scenario, the thread pool size and concurrency have been controlled. Normally, BlockingQueue will not be too large if the consumer side runs normally. The consumer is a program with socket io. Is there an exception or blocking in network read/write?

Later, the log confirmed that a large number of socket exceptions were found,At that time, the network was disconnected for a long time. The read/write timeout of this socket is set to a relatively large source, in seconds, resulting in the consumption of BlockingQueue in the data speed is much smaller than the producer, resulting in larger Queue. When the objects in the Queue are strongly referenced, FullGC is continuously triggered..


Then many colleagues will be curious, why does the JVM not Crash after so frequent FullGC? The reason is that the JVM has two parameters not set:

-XX: + UseGCOverheadLimit. This parameter indicates how long the GC has exceeded. OOM

The code is modified later,Manually construct new ThreadPoolExecutor (xxx, new javasblockingdeque () to define a BlockingQueue instance. In the future, you can customize the ExecutorService policy to determine if the queue is too long, processing Method. The maximum element of the default BlockingQueue is Integer. MAX_VALUE. If the value is full, Abort directly loses data.



Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.