Objective
For Hadoop clusters, node corruption is a very common phenomenon.
A big feature of Hadoop is the corruption of a node, which does not affect the operation of the entire distributed task.
Here's an analysis of how the Hadoop platform is done.
Hardware failure
Hardware failures can be divided into two types-jobtracker node corruption and tasktracker node corruption.
1. Jobtracker node corruption
This is the most serious error in the Hadoop cluster.
With this error, you can only re-select the Jobtracker node, and in the selection period, all the tasks must be stopped, and the tasks that have already been completed must be all over again.
2. Tasktracker node corruption
This is the most common error in a Hadoop cluster. For this type of error, Hadoop has a good error-handling mechanism.
The heartbeat communication mechanism of Jobtracker and Tasktracker requires tasktracker to ensure that progress is reported to jobtracker within 1 minutes.
If the time Jobtracker is not received, the Tasktracker will be removed from the set of waiting schedules;
If you receive a report of a failed task, move the tasktracker to the end of the queue to wait for it to be queued again. However, if a Tasktracker reports four failures in a row, it will also be moved out of the task waiting queue.
Summary
The handling and maintenance of faults are usually managed by special personnel.
This part of the content is not to do a dig.
Also, why do all the other map tasks have to be re-executed when one of the multiple map tasks in a map node fails?
And the reduce node only uses the one task that failed to re-execute?
This question has been consulted on the CSDN, I believe there will be an answer soon.
Analysis---Error handling mechanism of map/reduce working mechanism