Fault tolerance of Fault tolerance--storm

Last Update:2015-03-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article explains the design details of storm fault tolerance (fault-tolerance): How to implement fault tolerance when a worker, node, Nimbus, or supervisor fails, and whether there is a single point of failure for Nimbus.

What happens when a worker hangs up?

When a worker dies, the supervisor would restart it. If it continuously fails on startup and was unable to heartbeat to Nimbus, Nimbus would reassign the worker to another m Achine.

When a worker is hung up, supervisor will restart it. If the worker fails continuously at startup and does not allow Nimbus to observe its heartbeat, Nimbus will reassign the worker to another machine.

What happens when a node is hung up?

The tasks assigned to, machine would time-out and Nimbus would reassign those tasksto other machines.

Tasks assigned to this machine will time out and Nimbus these tasks to other machines.

What happens when the Nimbus or supervisor daemon process hangs?

The Nimbus and Supervisor Daemons is designed to being fail-fast (process self-destructs whenever any UNEXPECTE D situation is encountered) and stateless (all state iskept in Zookeeper or on disk). As described in Setting-a Storm cluster, the Nimbus and Supervisor daemons must be run under supervision using a tool L Ike Daemontools or Monit. So if the Nimbus or Supervisor daemons die, they restartas nothing happened.

The Nimbus and Supervisor daemon processes are designed to fail quickly (self-destruct is performed whenever any exception is encountered) and stateless (all states are saved on zookeeper or on disk). As described in setting up a Storm cluster, the Nimbus and Supervior daemon processes must be running under monitoring, such as using Daemontools or Monit tools. So if the Nimbus or supervisor daemon process hangs, it can be restarted like nothing unusual or happened.

Most notably, no worker processes is affected by the death of Nimbus or the supervisors. Contrast to Hadoop, where if the Jobtracker dies, all the running jobs is lost.

It is very important that no worker process will be affected by Nimbus or supervisor hanging off. This is the opposite of Hadoop. In Hadoop, if Jobtracker is hung up, all running jobs will be lost.

Nimbus is there a single point of failure?

If you lose the Nimbus node, the workers would still continue to function. Additionally, supervisors would continue to restart workers if they die. However, without Nimbus, workers won ' t be reassigned to other machines when necessary (like if you lose a worker machine).

When the Nimbus node fails, the worker will still be able to continue working. In addition, supervisor will be able to continue restarting the dead worker. However, without the Nimbus node, the worker cannot be reassigned to another machine when needed. (as if you had lost a woker machine).

So the answer was that Nimbus is "sort of" a SPOF. In practice, it's not a big deal since what catastrophic happens when the Nimbus daemon dies. There is plans to make Nimbus highly available on the future.

So the answer is that Nimbus is a problem with a single point of failure. In practice, this is not a big problem. Nimbus The Deamon process hangs up without causing any disaster. In the future, the plan is to design Nimbus to be highly available.

How does storm ensure data processing?

Storm provides mechanisms to ensure that data can be processed correctly even when the node is hung or if the message is lost. can refer to
guaranteeing message processing.

Reference Links:

Storm Official document:Fault tolerance

Blog:Storm fault-tolerant analysis

Fault tolerance of Fault tolerance--storm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Fault tolerance of Fault tolerance--storm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Fault tolerance of Fault tolerance--storm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support