One time because the data problem caused by reduce is stuck streaming job problem troubleshooting

Source: Internet
Author: User

Advertising Product Technology department has a job always stuck on a reduce, running for several hours also run, after their initial troubleshooting to find the cause of the problem, e-mail let me help to see, I looked at this streaming job is implemented in Python, and listen to their description, March before 17 The job is no problem, here are some possible problems:

1. Problem with cluster node

2, the configuration parameters of the job is not correct, resulting in reduce operation problems

3, Data problems

Then hit to troubleshoot these problems.

1th, I looked at the node that was stuck in the reduce run, the NodeManager log is normal, the DMESG log is normal, the load and memory usage are normal, troubleshooting the possibility of cluster problems.

2nd, I found the Java process that was stuck with reduce running, through the Jstat-gcutil $pid and top-p $pid to see the memory and CPU usage is normal, heap memory is sufficient, a long time to appear FGC, not a memory problem, Look at the CPU usage, only 0.3% of the use, seriously normal, it seems that is not a CPU problem.

3rd, I looked at the log of reduce, the reduce task has completed the copy and merge, is in the process of data processing is stuck, that is, when executing the Python script is stuck, because the previous run successfully, so the script should be no problem, That should be a data problem, because the problem is always stuck in the suffix is "r_000119" on the reduce task, I looked for the MAPREDUCE environment variable, let him do the reduce task in the Python script add a logic, if the environment variable MapReduce _TASK_ID (Task ID) contains the string "r_000119", the data is printed to the standard error output to troubleshoot the problem, because the problem will only appear in the "r_000119" task, so the other reduce do not hit, so that occupy a large number of cluster space, Finally found to be a data problem, logically speaking, a key data will not exceed 1000, but through the printing found that some key value only more than 10000, and the Python code logic inside the task data will not exceed 1000, do not do fault tolerance considerations, Cause the task to stay stuck for a long time.

Summarize:

Troubleshooting the use of mapreduce environment variable mapreduce_task_id, in fact, there are other commonly used environment variables, I will list here, after good backup:

Name Type Description
Mapreduce.job.id String The Job ID
Mapreduce.job.jar String Job.jar location in Job directory
Mapreduce.job.local.dir String The job specific shared scratch space
Mapreduce.task.id String The task ID
Mapreduce.task.attempt.id String The task attempt ID
Mapreduce.task.ismap Boolean Is this a map task
Mapreduce.task.partition Int The ID of the task within the job
Mapreduce.map.input.file String The filename, the map is reading from
Mapreduce.map.input.start Long The offset of the start of the map input split
Mapreduce.map.input.length Long The number of bytes in the map input split
Mapreduce.task.output.dir String The task ' s temporary output directory
related parameters in streaming "." Replace with "_".

One time because the data problem caused by reduce is stuck streaming job problem troubleshooting

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.