Source Address: http://storm.apache.org/documentation/Fault-tolerance.html
This paper mainly introduces the design details of storm as fault-tolerant system.
What happens when a worker dies?
When the worker dies, supervisor will restart it. If worker startup always fails, the worker cannot send a heartbeat message to Nimbus, and the Nimbus will start it again on another machine.
What happens when node dies?
All tasks assigned to this node will time out, and the Nimbus will reassign the tasks to another machine.
What happens when Nimbus or supervisor Daemons dies?
Both the Nimbus and Supervisor Daemons are designed to fail quickly (any unexpected situation can cause the process to crash itself) and stateless (all states are stored on zookeeper or disk) as described in configuring a storm cluster. The Nimbus and Supervisor Daemons must be run under monitoring, which is implemented using tools such as Daemontools or Monit. So if Nimbus or supervisor daemons die, they will restart again like nothing has happened.
It is most necessary to point out that no worker process is affected by Nimbus or supervisors death. By contrast, for Hadoop, all jobs that run will be lost if Jobtracker dies.
Is there a separate failure condition for Nimbus?
If the Nimbus node dies, the worker will continue to run. In addition, supervisors will still restart when they die. However, without nimbus, workers are not reassigned to other machines when needed, such as when a worker's machine is down.
So the answer is that Nimbus is some sort of single point of failure. In practice, when Nimbus daemon dies, it is not a big deal, because nothing catastrophic will happen. There are plans to submit Nimbus availability in the future.
How does storm ensure data processing?
Storm provides a mechanism to ensure data processing, even if the node dies or loses the message. For more details, you can see the guaranteed message handling mechanism.
Storm document (----) fault tolerance