Hive optimizes GC overhead limit exceeded notes caused by hive Multi inserts

Last Update:2017-01-13 Source: Internet

Author: User

Tags gc overhead limit exceeded

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

When you have a need to do all kinds of statistics from a hive table, then fall into the statistical tables of each category to store. Naturally we would think of using the Hive Multi Insert statement to implement it. Because you can avoid multiple scans of the same original table data by using the Multi Insert statement. This document records the GC overhead limit exceeded problem that occurs once using the multi Insert statement.
Problem description

I have a need. From a domain-related table, the statistics of each dimension fall into the corresponding interior and exterior. Here is my SQL instance code:

+ Expand Source
The above statement generates 6 jobs, and you can use explain hsql to view the execution resolution process:

+ Expand Source
You can see Stage-6 is a root stage from above. Stage-6 is the first job that needs to be done, but the problem is there. GC Overhead limit exceeded!!!
From the jobhistory of failure, you can see that failure occurs in the map phase.

...
Map = 99%, reduce = 33%, Cumulative CPU 9676.12 sec
Map = 100%, reduce = 100%, Cumulative CPU 9686.12 sec
Which is found in the map phase. Let's look at the error stack first:

2015-12-01 18:21:02,424 INFO [communication thread] Org.apache.hadoop.mapred.Task:Communication exception: Java.lang.OutOfMemoryError:GC Overhead limit exceeded
At Java.nio.HeapByteBuffer. (heapbytebuffer.java:57)
At Java.nio.ByteBuffer.allocate (bytebuffer.java:331)
At Sun.nio.cs.StreamDecoder. (streamdecoder.java:250)
At Sun.nio.cs.StreamDecoder. (streamdecoder.java:230)
At Sun.nio.cs.StreamDecoder.forInputStreamReader (streamdecoder.java:69)
At Java.io.InputStreamReader. (inputstreamreader.java:74)
At Java.io.FileReader. (filereader.java:72)
At Org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.constructProcessInfo (procfsbasedprocesstree.java:381)
At Org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.updateProcessTree (procfsbasedprocesstree.java:162)
At Org.apache.hadoop.mapred.Task.updateResourceCounters (task.java:839)
At Org.apache.hadoop.mapred.Task.updateCounters (task.java:978)
At org.apache.hadoop.mapred.task.access$500 (task.java:77)
At Org.apache.hadoop.mapred.task$taskreporter.run (task.java:727)
At Java.lang.Thread.run (thread.java:745)
Map phase OUTOFMEMORYERROR:GC overhead limit exceeded.

Problem analysis

The common cause of OMM is known to all. Add Memory! Oh, I am not tyrants, and the reasons for OMM plus memory may not be able to solve, or look for internal causes. So how do you solve the OMM? First of all, we need to be clear about the possible causes of OMM. 1. Memory is really not enough to use the program. 2. The program has a memory leak or the program is not efficient. As a person determined to become a veteran program ape should start with the second. OK, let's analyze the analysis first:

Hive Program Operating Environment:

System 46 Ubuntu12.04, 8 cores, 32G Mem. Hadoop version 2.2.0, Hive 0.12. Data 100g+ Text. The maximum number of queues used is approximately 40% of the total. The above Hive program starts map number 380, reduce the number of about 120. It should be said that the number should not be large. But the problem is that it really omm. Should be used when hive procedures, not written by themselves. It should be unlikely that there is a memory leak code. Then it should be hive SQL unreasonable, the first thought is the efficiency of multi insert. Test: Run a single INSERT statement, that is, delete some INSERT statements.
The instance code is as follows:

+ Expand Source
The results are all able to run out. That is to say, multi inserts are more memory-intensive and OMM, not a problem with SQL programs. The biggest reason is that we give the program (MapReduce) too little memory. So let's look at how much memory we've configured. Execute the following command in the Hive CLI:

Hive> set mapreduce.map.java.opts;
mapreduce.map.java.opts=-xmx1500m

Hive> set mapreduce.reduce.java.opts;
mapreduce.reduce.java.opts=-xmx2048m

Hive> set MAPREDUCE.MAP.MEMORY.MB;
mapreduce.map.memory.mb=2048

Hive> set MAPREDUCE.REDUCE.MEMORY.MB;
mapreduce.reduce.memory.mb=3072
Our program problems appear in the map phase oMM, so the map's memory settings should be small (mapreduce.map.java.opts=1.5g). It is also set to a larger point, but the maximum allowable mapreduce.map.memory.mb (2g) cannot be manipulated by the map.

Summarize:

oMM for memory problems we need to start with two points:

Whether the program has a memory leak
Is the memory really set too small
For the first one, first troubleshoot a program problem. In the above case we used the multi insert to cause the memory to be not enough GC. Here you will ask what is GC overhead limit exceeded instead of Java heap space?

Explanation of GC overhead limit exceeded:
I. Description of the exception:
Exception in thread "main" Java.lang.OutOfMemoryError:GC overhead limit exceeded
Second, explain:
JDK6 new error type. Thrown when the GC takes up a lot of time to free up a small space. Usually because the heap is too small.
Cause of the exception: there is not enough memory.
Third, the solution:
1, to see if the system has the use of large memory code or dead loop.
2, you can add the JVM's startup parameters to limit the use of memory:-xx:-usegcoverheadlimit
So for this case, my optimization is as follows:

Set Mapreduce.map.java.opts=-xmx1800m-xx:-usegcoverheadlimit
This article is not about the JVM, so it's not a deep table. For multi Insert Optimizations, if you insert the same table, we can also use union ALL to replace

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More