One time because the data problem caused by reduce is stuck streaming job problem troubleshooting

Last Update:2015-03-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Advertising Product Technology department has a job always stuck on a reduce, running for several hours also run, after their initial troubleshooting to find the cause of the problem, e-mail let me help to see, I looked at this streaming job is implemented in Python, and listen to their description, March before 17 The job is no problem, here are some possible problems:

1. Problem with cluster node

2, the configuration parameters of the job is not correct, resulting in reduce operation problems

3, Data problems

Then hit to troubleshoot these problems.

1th, I looked at the node that was stuck in the reduce run, the NodeManager log is normal, the DMESG log is normal, the load and memory usage are normal, troubleshooting the possibility of cluster problems.

2nd, I found the Java process that was stuck with reduce running, through the Jstat-gcutil $pid and top-p $pid to see the memory and CPU usage is normal, heap memory is sufficient, a long time to appear FGC, not a memory problem, Look at the CPU usage, only 0.3% of the use, seriously normal, it seems that is not a CPU problem.

3rd, I looked at the log of reduce, the reduce task has completed the copy and merge, is in the process of data processing is stuck, that is, when executing the Python script is stuck, because the previous run successfully, so the script should be no problem, That should be a data problem, because the problem is always stuck in the suffix is "r_000119" on the reduce task, I looked for the MAPREDUCE environment variable, let him do the reduce task in the Python script add a logic, if the environment variable MapReduce _TASK_ID (Task ID) contains the string "r_000119", the data is printed to the standard error output to troubleshoot the problem, because the problem will only appear in the "r_000119" task, so the other reduce do not hit, so that occupy a large number of cluster space, Finally found to be a data problem, logically speaking, a key data will not exceed 1000, but through the printing found that some key value only more than 10000, and the Python code logic inside the task data will not exceed 1000, do not do fault tolerance considerations, Cause the task to stay stuck for a long time.

Summarize:

Troubleshooting the use of mapreduce environment variable mapreduce_task_id, in fact, there are other commonly used environment variables, I will list here, after good backup:

Name	Type	Description
Mapreduce.job.id	String	The Job ID
Mapreduce.job.jar	String	Job.jar location in Job directory
Mapreduce.job.local.dir	String	The job specific shared scratch space
Mapreduce.task.id	String	The task ID
Mapreduce.task.attempt.id	String	The task attempt ID
Mapreduce.task.ismap	Boolean	Is this a map task
Mapreduce.task.partition	Int	The ID of the task within the job
Mapreduce.map.input.file	String	The filename, the map is reading from
Mapreduce.map.input.start	Long	The offset of the start of the map input split
Mapreduce.map.input.length	Long	The number of bytes in the map input split
Mapreduce.task.output.dir	String	The task ' s temporary output directory

related parameters in streaming "." Replace with "_".

One time because the data problem caused by reduce is stuck streaming job problem troubleshooting

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

One time because the data problem caused by reduce is stuck streaming job problem troubleshooting

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

One time because the data problem caused by reduce is stuck streaming job problem troubleshooting

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support