hadoop1.0 Tasktracker Troubleshooting A memory leak due to distributed cache

Source: Internet
Author: User

In Friday, colleagues to the company said early in the morning when there are duty colleagues call him, some job stuck, run a long time did not run to complete, because it is early morning, he did not have time to detailed view of the log, simple to the problem of the tasktracker restarted a bit, Only one node of the Tasktracker process stopped, let me find out what the problem is. The following are the troubleshooting procedures:

1, landing to stop the TT process processor

(1), view disk space


The disk does not appear to be in a low space condition.

(2), top view load and memory usage:


According to the memory and load are not high, there is no zombie process.

2. View the process log

1, log4j log:

2014-11-28 06:47:43,813 INFO org.apache.hadoop.mapred.TaskTracker:Trying to Launch:attempt_201406171104_5451817_m_ 001292_0 which needs 1 slots
2014-11-28 06:47:43,813 INFO org.apache.hadoop.mapred.TaskTracker:In tasklauncher, current free slots:3 and trying to l Aunch attempt_201406171104_5451817_m_001292_0 which needs 1 slots
2014-11-28 06:47:43,814 INFO org.apache.hadoop.mapred.JvmManager:In Jvmrunner constructed JVM id:jvm_201406171104_ 5451797_m_-1554592361
2014-11-28 06:47:43,814 INFO Org.apache.hadoop.mapred.JvmManager:JVM Runner jvm_201406171104_5451797_m_-1554592361 spawned.
2014-11-28 06:47:43,821 INFO org.apache.hadoop.mapred.TaskController:Writing commands to/data2/hadoop/local/ ttprivate/tasktracker/optimus/jobcache/job_201406171104_5451797/attempt_201406171104_5451797_m_000521_0/ taskjvm.sh
2014-11-28 06:47:43,992 INFO org.apache.hadoop.mapred.tasktracker.clienttrace:src:10.39.4.178:50060, dest: 10.39.5.241:47487, Bytes:18, Op:mapred_shuffle, Cliid:attempt_201406171104_5450184_m_001640_0, duration:235398
2014-11-28 06:47:44,317 INFO org.apache.hadoop.mapred.tasktracker.clienttrace:src:10.39.4.178:50060, dest: 10.39.0.140:60631, Bytes:18, Op:mapred_shuffle, Cliid:attempt_201406171104_5450184_m_001640_0, duration:163982
2014-11-28 06:47:44,580 INFO Org.apache.hadoop.mapred.TaskTracker:JVM with id:jvm_201406171104_5451797_m_- 1554592361 given TASK:ATTEMPT_201406171104_5451797_M_000521_0
2014-11-28 06:47:44,895 INFO org.apache.hadoop.mapred.tasktracker.clienttrace:src:10.39.4.178:50060, dest: 10.39.4.164:50407, Bytes:18, Op:mapred_shuffle, Cliid:attempt_201406171104_5450184_m_001640_0, duration:168406
2014-11-28 06:47:45,057 INFO Org.apache.hadoop.mapred.TaskTracker:SHUTDOWN_MSG:

No exception occurred in the log.

2, because the process has been stopped, it is impossible to view the process stack and memory mapping information, but when we configure the Hadoop process, the JVM parameter started the GC log printing, the following is the GC log information:

2014-11-28t06:47:08.025+0800:26433049.397: [Full GC [psyounggen:1340224k->1286605k (1368576K)] [Psoldgen: 2796223k->2796223k (2796224K)] 4136447k->4082829k (4164800K) [pspermgen:21554k->21553k (21568K)] GC time Woul D Exceed Gctimelimit of 98%
, 5.5652750 secs] [times:user=0.00 sys=5.56, real=5.57 secs]
2014-11-28t06:47:18.126+0800:26433059.499: [Full GC [psyounggen:1340224k->1340224k (1368576K)] [Psoldgen: 2796223k->2796223k (2796224K)] 4136447k->4136447k (4164800K) [pspermgen:21554k->21554k (21568K)], 6.6131270 secs] [times:user=0.00 sys=6.61, real=6.61 secs]
2014-11-28t06:47:24.740+0800:26433066.112: [Full GC [psyounggen:1340224k->1286378k (1368576K)] [Psoldgen: 2796223k->2796223k (2796224K)] 4136447k->4082602k (4164800K) [pspermgen:21554k->21553k (21568K)], 5.5440730 secs] [times:user=0.00 sys=5.55, real=5.55 secs]
2014-11-28t06:47:30.994+0800:26433072.367: [Full GC [psyounggen:1340224k->1340224k (1368576K)] [Psoldgen: 2796223k->2796223k (2796224K)] 4136447k->4136447k (4164800K) [pspermgen:21554k->21554k (21568K)], 6.7307300 secs] [times:user=0.00 sys=6.73, real=6.73 secs]
2014-11-28t06:47:37.725+0800:26433079.098: [Full GC [psyounggen:1340224k->1287718k (1368576K)] [Psoldgen: 2796223k->2796223k (2796224K)] 4136447k->4083942k (4164800K) [pspermgen:21554k->21553k (21568K)], 5.7407480 secs] [times:user=0.00 sys=5.75, real=5.74 secs]
Heap
Psyounggen total 1368576K, used 1330631K [0x00007f66b3ab0000, 0x00007f6709000000, 0x00007f6709000000)
Eden Space 1340224K, 99% used [0x00007f66b3ab0000,0x00007f6704e21d20,0x00007f6705780000)
From space 28352K, 0% used [0x00007f6705780000,0x00007f6705780000,0x00007f6707330000)
To space 29504K, 0% used [0x00007f6707330000,0x00007f6707330000,0x00007f6709000000)
Psoldgen total 2796224K, used 2796223K [0x00007f6609000000, 0x00007f66b3ab0000, 0x00007f66b3ab0000)
Object Space 2796224K, 99% used [0x00007f6609000000,0x00007f66b3aaffe8,0x00007f66b3ab0000)
Pspermgen total 21760K, used 21574K [0x00007f6603c00000, 0x00007f6605140000, 0x00007f6609000000)
Object Space 21760K, 99% used [0x00007f6603c00000,0x00007f6605111a98,0x00007f6605140000)

The memory consumed 99% in the old age before the process was stopped, and the FGC,FGC took more time than the GC time limit of 98%.

From here, the problem is that the tasktracker process because the old memory is full, has been in the FGC, and FGC is stop the world, that is, the FGC process is unable to provide external services, this is the job task stuck in the root cause. Since the tasktracker process heap memory is found because of the reason, then what causes the heap memory is full, we configure the tasktracker heap memory when the configuration of 4G memory, it is supposed to be fully sufficient, it is difficult to have a memory leak, In order to confirm the memory leaks, the memory information of the dump process needs to be logged to another Tasktracker process surviving node, executing jstat-gcutil ${pid}:

[Email protected] ~]$ Jstat-gcutil 27617
S0 S1 E O P ygc ygct FGC fgct GCT
0.00 0.00 100.00 100.00 99.85 621679 20840.395 47500 251938.770 272779.165
[Email protected] ~]$ Jstat-gcutil 27617
S0 S1 E O P ygc ygct FGC fgct GCT
0.00 0.00 78.03 100.00 99.76 621679 20840.395 47500 251945.371 272785.766
[Email protected] ~]$ Jstat-gcutil 27617
S0 S1 E O P ygc ygct FGC fgct GCT
0.00 0.00 78.41 100.00 99.76 621679 20840.395 47500 251945.371 272785.766
[Email protected] ~]$ Jstat-gcutil 27617
S0 S1 E O P ygc ygct FGC fgct GCT
0.00 0.00 78.77 100.00 99.76 621679 20840.395 47500 251945.371 272785.766

What is surprising is that seemingly normal nodes, there is also the same problem, there is a memory leak, the old memory consumption has been 100%, and has been executing FGC.

Perform:

Jmap-histo:live 27617  


As seen from Jmap, Trackerdistributedcachemanager$cachestatus and Taskdistributedcachemanager$cachefile instances have 100多万个, Taskdistributedcachemanager instances have 860,000, these objects are related to the Tasktracker distributed cache, it is difficult to be distributed cache problems.

In the Jira above the Hadoop search CacheStatus, sure enough to search:

Https://issues.apache.org/jira/browse/MAPREDUCE-3343?jql=text%20~%20%22cachestatus%22

According to Jira above, the job is initialized on the TT with Jobid as key,Taskdistributedcachemanager as value is set to Jobarchives's Map object, However, after the job is completed, TT does not immediately clear the Jobarchives object in the relevant job of the distributed cache information, this problem to solve is very simple, only need to clear the distributed cache after the job is completed, Jira in the corresponding patch.

hadoop1.0 Tasktracker Troubleshooting A memory leak due to distributed cache

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.