Hadoop Error Handling Mechanism
1. hardware faults
Hardware faults refer to jobtracker faults or tasktracker faults.
Jobtracker is a single point. If a fault occurs, hadoop cannot handle it yet. Only the most reliable hardware can be used as jobtracker.
Jobtracker uses the heartbeat (one minute cycle) signal to check whether tasktracker is faulty or overloaded.
Jobtracker removes the faulty tasktracker from the task node list.
If the faulty node is executing the map task and has not completed the task, jobtracker will ask other nodes to re-execute the map task.
If the failed node has not completed the reduce task, jobtracker will ask other nodes to continue the unfinished reduce task.
2. Task failed
Task failure caused by code defects or process crashes
JVM automatically exits. If you want to send an error message to the tasktracker parent process, the error message will also be written to the log.
The tasktracker listener will find that the process exits, or if the information is not updated for a long time, mark the task as failed.
After a failed task is marked, the task counter minus 1 to receive the new task and send a heartbeat signal to jobtracker about the task failure.
After jobtracker learns that the task has failed, it will re-put the task into the scheduling queue and re-allocate it before executing it.
If a task fails more than four times (configurable), it will not be executed again, and the job will also be declared as a failure.
Hadoop Error Handling Mechanism