How storm concept is explained and how it works

Source: Internet
Author: User
Tags emit

Structure of the Strom

Storm vs. Traditional relational database
The traditional relational database is the first to be saved, and Storm is the first to save, and not even to save
Traditional relational database is difficult to deploy real-time computing, only the timing Task Statistics Analysis window data can be deployed
Relational databases pay attention to transactions, concurrency control, relatively simple storm
Storm Hadoop,spark is a popular big data solution.

A language closely related to storm: The core code is written in Clojure, the utility is developed in Python, and the topology is developed using Java

There are two kinds of nodes in the topology storm cluster, one is the control node (Nimbus node) and the other is the working node (Supervisor node). The submission of all topology tasks must be done on the Storm client node (requires configuration of the Storm.yaml file), which is assigned by the Nimbus node to other supervisor nodes for processing. The Nimbus node first shards the submitted topology into a task, and submits the task and supervisor related information to the zookeeper cluster, supervisor will go to the zookeeper cluster to claim its own task, Notifies its worker process to perform task processing.

Compared to MapReduce, which is also the computational framework, the MapReduce cluster is running the job, and the storm cluster is running topology. But the job ends at the end of the run, and topology can only be killed manually, otherwise it will keep running.

Storm does not handle the saving of the calculation results, which is the responsibility of the application code, if the data is not small, you can simply save in memory, you can update the database every time, you can also use NoSQL storage. This part of the matter is completely handed over to the user.

After the data is stored, and you need to handle it yourself, the storm UI provides only topology monitoring and statistics.

The overall topology processing flowchart is:


Zookeeper cluster storm uses zookeeper to coordinate the entire cluster, but it is important to note that storm does not use zookeeper to deliver messages. So the load on the zookeeper is very low, and the zookeeper of a single node is sufficient in most cases, but if you are deploying a larger storm cluster, you need to have a larger zookeeper. On how to deploy zookeeper, you can see http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html

There are some things to be aware of when deploying zookeeper:
1, to zookeeper good monitoring is very important, zookeeper is fail-fast system, as long as there is nothing wrong will exit, so the actual scene to monitor, more details see http://zookeeper.apache.org/doc/ R3.3.3/zookeeperadmin.html#sc_supervision
2. A cron job is configured in the actual scenario to compress zookeeper data and business logs. Zookeeper yourself will not compress these, so if you do not set a cron job, then you will soon find that the disk is not enough, more details can be viewed http://zookeeper.apache.org/doc/r3.3.3/ Zookeeperadmin.html#sc_maintenance

Component storm, spout and bolts are Component. So, Storm defines a total interface called IComponent.
The whole family is as follows: The green part is our most commonly used, relatively simple part. The red part is related to the transaction


Spout Spout is the source of message generation for stream, the implementation of SPOUT components can be accomplished by inheriting baserichspout classes or other Spout classes, or by implementing a Irichspout interface to implement
Public interface Ispout extends Serializable {
void Open (Map conf, topologycontext context, spoutoutputcollector collector);
void Close ();
void Nexttuple ();
void Ack (Object msgId);
void Fail (Object msgId);
}
Open () Method--initialization method
Close ()-called when the spout will be closed. However, it is not guaranteed to be called, because supervisor nodes in the cluster, you can use kill-9 to kill worker processes. Only if storm is running in local mode, if it is a send Stop command, you can guarantee the execution of close
ACK (Object msgId)--the method of callback when a tuple is successfully processed, typically the implementation of this method is to remove messages from Message Queuing and prevent message replay
Fail (Object msgId)--a method of handling a callback when a tuple fails, typically the implementation of this method is to put the message back in the message queue and then replay it at a later time
Nexttuple ()-This is the most important method in the Spout class. The launch of a tuple to topology is achieved through this method. When this method is called, Storm makes a request to spout to have spout emit a tuple (tuple) to the output (ouput collector). This method should be non-blocking, so spout if no tuples are emitted, this method should be returned. Nexttuple, ACK, and fail are called in the same thread of the spout task. When no tuple is launched, Nexttuple should be allowed to sleep for a very short time (such as milliseconds) to avoid wasting too much CPU.
After inheriting the baserichspout, you do not have to implement the close, activate, deactivate, ACK, fail, and Getcomponentconfiguration methods, only the most basic core parts.
Usually (except for shell and transactional type), implement a spout, can directly implement interface Irichspout, if you do not want to write redundant code, you can directly inherit Baserichspout

The bolt Bolt class receives a tuple from a spout or other upstream bolt class and processes it. The implementation of the bolt component can be accomplished by inheriting the Basicrichbolt class or the Irichbolt interface, etc.
Prepare method-This method is similar to the open method in spout, called when a task is initialized in a worker in a cluster. It provides an environment for bolt execution
Declareoutputfields Method--Used to declare fields (field) contained in a tuple sent by the current bolt, and similar in spout
Cleanup Method--The Close method with Ispout, called before closing. It is also not guaranteed that it will be enforced.
The Execute method--This is the most critical method in Bolt, the processing of a tuple can be put into this method. The specific delivery is done through the emit method. Execute accepts a tuple for processing, and uses the prepare method to pass in the Ack method of the Outputcollector (indicating success) or fail (which indicates failure) to feedback the processing result.
Storm provides the Ibasicbolt interface, which is designed to implement the bolt of the interface without providing feedback in the code, and the internal storm automatically feeds back successfully. If you do want feedback to fail, you can throw failedexception
Usually, implement a bolt, you can implement the Irichbolt interface or inherit Baserichbolt, if you do not want to handle the result feedback, you can implement Ibasicbolt interface or inherit Basebasicbolt, It is actually equivalent to automatically implementing the Collector.emit.ack (Inputtuple)

Topology running Process (1) after Storm commits, the code is first stored in the Inbox directory of the Nimbus node, and then Generates a Stormconf.ser file for the current storm run configuration into the Stormdist directory of the Nimbus node, which also has the topology code file after serialization
(2) When setting the spouts and bolts associated with topology, you can simultaneously set the executor number and the number of tasks for the current spout and bolts, by default, the sum of a topology task is consistent with the sum of executor. After that, the system distributes the execution of these tasks as evenly as possible based on the number of worker. Which Supervisor node The worker runs on is determined by storm itself.
(3) After the task is assigned, the Nimbus node submits the information of the task to the Zookeeper cluster and Workerbeats nodes in the Zookeeper cluster, where the heartbeat information for all worker processes currently topology is stored.
(4) The Supervisor node continually polls the zookeeper cluster, saving all topology task assignment information, code storage directories, and association relationships between tasks in the Assignments node of zookeeper. Supervisor starts the worker process by polling the contents of this node to pick up its own task
(5) After a topology run, it will continue to send stream stream through the spouts, through the bolts to continue to process the incoming stream stream, stream flow is unbounded.
The final step is uninterrupted, unless you manually end the topology.

Topology run mode it is important to understand the operational pattern of storm (operation modes) before you begin to create a project. Storm has two modes of operation
How to submit a local run, for example:
Localcluster cluster = new Localcluster ();
Cluster.submittopology (topology_name, Conf, builder.createtopology ());
Thread.Sleep (2000);
Cluster.shutdown ();
Distributed submission methods, for example:
Stormsubmitter.submittopology (topology_name, Conf, builder.createtopology ());

It is important to note that after the storm code is written, it needs to be packaged into a jar package to run in Nimbus, and when packaged, it is not necessary to import the dependent jar. Otherwise, if the dependent Storm.jar packet is hit, a duplicate configuration file error at run time will cause the topology to not run. Because the local storm.yaml configuration file is loaded before topology is run.

Run the following command: Storm jar Stormtopology.jar MainClass [args]

The Storm daemon command nimbus:storm Nimbus start the Nimbus daemon
Supervisor:storm Supervisor start Supervisor Guardian Import path
Ui:storm UI This will launch the Stormui daemon, providing a Web-based user interface for monitoring storm clusters.
Drpc:storm Drpc start the daemon for DRPC

Storm Management Command Jar:storm JAR topology_jar topology_class [arguments ...]
The jar command is used to submit a cluster topology. It runs the main () method in the Topology_class of the specified parameter, uploads Topology_jar to Nimbus, and is published by Nimbus to the cluster. Once committed, Storm activates the topology and begins processing the main () method in Topology_class, which is responsible for invoking the Stormsubmitter.submittopology () method and providing a unique topology (cluster) name. If a topology with that name already exists in the cluster, the JAR command will fail. A common practice is to use command-line arguments to specify the topology name so that the topology is named when it is committed.

Kill:storm Kill Topology_name [-W wait_time]
Kill a topology, you can use the KILL command. It destroys a topology in a secure manner, first disabling the topology and allowing the topology to complete the current data flow in the time period waiting for the topology message. When you execute the KILL command, you can specify the wait time after the topology is deactivated by the-w [wait seconds]. You can also implement the same functionality on the Storm UI interface

Deactivate:storm Deactivate Topology_name
When a topology is deactivated, all distributed tuples are processed, and the spouts Nexttuple method will not be called. You can also implement the same functionality on the Storm UI interface

Activate:storm Activate Topology_name
Start a deactivated topology. You can also implement the same functionality on the Storm UI interface

Rebalance:storm Rebalance Topology_name [-W wait_time] [-N worker_count] [-e Component_name=executer_count] ...
Rebalance enables you to reassign cluster tasks. This is a very powerful command. For example, you add nodes to a running cluster. The rebalance command deactivates the topology and then re-allocates the worker after the corresponding time-out, and restarts the topology
Example: Storm rebalance wordcount-topology-w 15-n 5-e sentence-spout=4-e split-bolt=8

There are other administrative commands, such as Remoteconfvalue, REPL, classpath, etc.

New storm project considerations in order to develop a storm project, you need a storm jar package in your classpath. The most recommended approach is to use MAVEN, and you can manually add all the jar packages in the storm release to Classpath without using MAVEN.

Storm-starter projects use Leiningen as build and dependency management tools, you can download this script (https://raw.githubusercontent.com/technomancy/leiningen/ Stable/bin/lein) to install Leiningen, add it to your path and make it executable. To pull all the dependency packages of storm, simply execute Lein deps at the root of the project.

How storm concept is explained and how it works

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.