Hadoop version 1.0.3
Problem Description:
As the number of daily Mr Jobs increases, users often block when submitting jobs, which means that Jobtracker has been congested. This situation began to occur frequently, we adjust the number of RPC handler threads on the Jobtracker side, and periodically to the Jobtracker stack information analysis, if the RPC handler thread is all blocked live, dump out the stack information, and timely issued an alarm.
Causes and solutions:
After analyzing the Jobtracker stack information of several times, we find that every time jobtracker congestion, everyone is stealing the lock of Dataqueue object in RPC.
For example: Compute node/client RPC call to Jobtracker
(1) Org.apache.hadoop.mapred.JobTracker getreducetaskreports (2) Org.apache.hadoop.mapred.JobTracker.getMapTaskReport (3) org.apache.hadoop.mapred.JobTracker.heartbeat (4) org.apache.hadoop.mapred.JobTracker.getJobStatus ....
These methods all need to Jobtracker object locking (is the synchronized method, or in the function physical to this lock, details visible jobtracker code).
Through the Jobtracker thread stack information found, everyone is robbing Jobtracker object lock:
<0x000000050ce06ae8> (Aorg.apache.hadoop.mapred.JobTracker)
While holding the Jobtracker object lock for the IPCHandler14 thread, it blocks the call to the Dfsclient.writechunk method, which needs to write data to the Dataqueue.
Holds the Dataqueue object lock <0x0000000753bd9a48 > (a java.util.LinkedList) for the Datastreamer thread, which is writing the Job.history file to HDFs
almost every time a congestion occurs, Datastreamer The threads are all writing History file, write HDFS is a very slow operation, then, the problem / Will the bottleneck be here?
Randomly viewed the next job.history file from the 50070 interface of HDFs:
Surprised to find that blocksize only 3KB.
So, the problem is, some job.history files are very large (MB level, such as when the user uploads the dependency type jar package, this history file is very Large)
Assuming that the Job.history file has 3m,blocksize 3KB, you need to namenode apply for 1K block, set 1k pipeline to Datanode .... It is no wonder that Datastreamer.writechunk waits for the lock of the Dataqueue object (since the pipeline only transmits 3kB of data at a time).
After viewing the Jobtracker code, it is found that the block size of job.hstory is specified by the parameter mapred.jobtracker.job.history.block.size. This default value is 3M, I do not know which predecessor for what reason to change it to 3KB, re-adjust to 3mb,jobtracker no longer congestion. In addition, we separate the jobhistoryserver from Jobtracker and use a single machine to Jobhistoryserver to relieve the pressure of Jobtracker.
<property> <name>mapred.jobtracker.job.history.block.size</name> <value>3145728</ Value> <description>the block size of the job history file. Since the jobrecovery uses job history, it important to dump job history to disk as soon as possible. Note that this is a expert level parameter. Thedefault value is set to 3 MB. </description></property>
(PS: Familiar with the hadoop1.x know, jobtracker the resource management and job monitoring together, limiting the jobtracker job throughput capacity, hadoop2.x (YARN) to the Jobtracker function was split: ResourceManager and Applicationmaster, can fundamentally solve the problem of low throughput rate of jobtracker operations)
A jobtracker congestion problem troubleshooting process