This article explains the design details of storm fault tolerance (fault-tolerance): How to implement fault tolerance when a worker, node, Nimbus, or supervisor fails, and whether there is a single point of failure for Nimbus.
What happens when a worker hangs up?
When a worker dies, the supervisor would restart it. If it continuously fails on startup and was unable to heartbeat to Nimbus, Nimbus would reassign the worker to another m Achine.
When a worker is hung up, supervisor will restart it. If the worker fails continuously at startup and does not allow Nimbus to observe its heartbeat, Nimbus will reassign the worker to another machine.
What happens when a node is hung up?
The tasks assigned to, machine would time-out and Nimbus would reassign those tasksto other machines.
Tasks assigned to this machine will time out and Nimbus these tasks to other machines.
What happens when the Nimbus or supervisor daemon process hangs?
The Nimbus and Supervisor Daemons is designed to being fail-fast (process self-destructs whenever any UNEXPECTE D situation is encountered) and stateless (all state iskept in Zookeeper or on disk). As described in Setting-a Storm cluster, the Nimbus and Supervisor daemons must be run under supervision using a tool L Ike Daemontools or Monit. So if the Nimbus or Supervisor daemons die, they restartas nothing happened.
The Nimbus and Supervisor daemon processes are designed to fail quickly (self-destruct is performed whenever any exception is encountered) and stateless (all states are saved on zookeeper or on disk). As described in setting up a Storm cluster, the Nimbus and Supervior daemon processes must be running under monitoring, such as using Daemontools or Monit tools. So if the Nimbus or supervisor daemon process hangs, it can be restarted like nothing unusual or happened.
Most notably, no worker processes is affected by the death of Nimbus or the supervisors. Contrast to Hadoop, where if the Jobtracker dies, all the running jobs is lost.
It is very important that no worker process will be affected by Nimbus or supervisor hanging off. This is the opposite of Hadoop. In Hadoop, if Jobtracker is hung up, all running jobs will be lost.
Nimbus is there a single point of failure?
If you lose the Nimbus node, the workers would still continue to function. Additionally, supervisors would continue to restart workers if they die. However, without Nimbus, workers won ' t be reassigned to other machines when necessary (like if you lose a worker machine).
When the Nimbus node fails, the worker will still be able to continue working. In addition, supervisor will be able to continue restarting the dead worker. However, without the Nimbus node, the worker cannot be reassigned to another machine when needed. (as if you had lost a woker machine).
So the answer was that Nimbus is "sort of" a SPOF. In practice, it's not a big deal since what catastrophic happens when the Nimbus daemon dies. There is plans to make Nimbus highly available on the future.
So the answer is that Nimbus is a problem with a single point of failure. In practice, this is not a big problem. Nimbus The Deamon process hangs up without causing any disaster. In the future, the plan is to design Nimbus to be highly available.
How does storm ensure data processing?
Storm provides mechanisms to ensure that data can be processed correctly even when the node is hung or if the message is lost. can refer to
guaranteeing message processing.
Reference Links:
Storm Official document:Fault tolerance
Blog:Storm fault-tolerant analysis
Fault tolerance of Fault tolerance--storm