Advertising Product Technology department has a job always stuck on a reduce, running for several hours also run, after their initial troubleshooting to find the cause of the problem, e-mail let me help to see, I looked at this streaming job is implemented in Python, and listen to their description, March before 17 The job is no problem, here are some possible problems:
1. Problem with cluster node
2, the configuration parameters of the job is not correct, resulting in reduce operation problems
3, Data problems
Then hit to troubleshoot these problems.
1th, I looked at the node that was stuck in the reduce run, the NodeManager log is normal, the DMESG log is normal, the load and memory usage are normal, troubleshooting the possibility of cluster problems.
2nd, I found the Java process that was stuck with reduce running, through the Jstat-gcutil $pid and top-p $pid to see the memory and CPU usage is normal, heap memory is sufficient, a long time to appear FGC, not a memory problem, Look at the CPU usage, only 0.3% of the use, seriously normal, it seems that is not a CPU problem.
3rd, I looked at the log of reduce, the reduce task has completed the copy and merge, is in the process of data processing is stuck, that is, when executing the Python script is stuck, because the previous run successfully, so the script should be no problem, That should be a data problem, because the problem is always stuck in the suffix is "r_000119" on the reduce task, I looked for the MAPREDUCE environment variable, let him do the reduce task in the Python script add a logic, if the environment variable MapReduce _TASK_ID (Task ID) contains the string "r_000119", the data is printed to the standard error output to troubleshoot the problem, because the problem will only appear in the "r_000119" task, so the other reduce do not hit, so that occupy a large number of cluster space, Finally found to be a data problem, logically speaking, a key data will not exceed 1000, but through the printing found that some key value only more than 10000, and the Python code logic inside the task data will not exceed 1000, do not do fault tolerance considerations, Cause the task to stay stuck for a long time.
Summarize:
Troubleshooting the use of mapreduce environment variable mapreduce_task_id, in fact, there are other commonly used environment variables, I will list here, after good backup:
Name |
Type |
Description |
Mapreduce.job.id |
String |
The Job ID |
Mapreduce.job.jar |
String |
Job.jar location in Job directory |
Mapreduce.job.local.dir |
String |
The job specific shared scratch space |
Mapreduce.task.id |
String |
The task ID |
Mapreduce.task.attempt.id |
String |
The task attempt ID |
Mapreduce.task.ismap |
Boolean |
Is this a map task |
Mapreduce.task.partition |
Int |
The ID of the task within the job |
Mapreduce.map.input.file |
String |
The filename, the map is reading from |
Mapreduce.map.input.start |
Long |
The offset of the start of the map input split |
Mapreduce.map.input.length |
Long |
The number of bytes in the map input split |
Mapreduce.task.output.dir |
String |
The task ' s temporary output directory |
related parameters in streaming "." Replace with "_".
One time because the data problem caused by reduce is stuck streaming job problem troubleshooting