Storm is a free, open-source, distributed, high-fault-tolerant real-time computing system that Twitter developers contribute to the community. Storm makes it easy to make continuous flow calculations, making up for real-time requirements that Hadoop batches cannot meet.
storm is often used in real-time analytics, online machine learning, continuous computing, distributed remote calls, and ETL.
Characteristics
1, Storm This is a distributed, fault-tolerant real-time computing system
2, Storm cluster is mainly composed of a master node and a set of working nodes (worker nodes), which is coordinated by zookeeper cluster.
3. The master node usually runs a daemon--nimbus, which receives the user's submitted tasks and assigns the task to the working node while failing monitoring. The work node also runs a daemon--supervisor, which is used to receive work assignments and to run worker process--worker based on requirements.
Storm architecture
- A real-time application running in the Topology:storm.
- Nimbus: Responsible for resource allocation and task scheduling.
- Supervisor: Responsible for accepting tasks assigned by Nimbus, starting and stopping worker processes that belong to their own management.
- Worker: A process that runs a specific processing component logic.
- Spout: The component that produces the source data stream in a topology.
- Bolt: A component that accepts data in a topology and then executes the processing.
- Each Spout/bolt thread in the Task:worker is called a task.
- Tuple: The basic unit of a single message delivery.
- Stream grouping: How to Group messages
Architecture understanding
Storm's design idea
There is also an abstraction for stream stream in storm, a continuous, unbounded contiguous tuple, noting that when Storm is modeling an event flow, the events in the stream are abstracted into tuple tuples, which later explain how the tuple is used in storm. Storm thinks that each stream has a stream source, which is the source of the primitive tuple, so it abstracts this source into Spout,spout, possibly connecting the Twitter API and constantly issuing tweets, It is also possible to continuously read the queue elements from a queue and assemble them as a tuple launch. With the source, spout, and the stream, how do you deal with the tuple in the stream, the same idea that Twitter abstracts the middle state of the stream into Bolt,bolt can consume any number of input streams, as long as the flow direction is directed to the Bolt, It can also send new streams to other bolts so that as long as the specific spout (pipe port) is opened and the tuple that flows out of the spout is directed to a specific bolt, the bolt handles the imported stream and then directs other bolts or destinations. We can think of spout is a faucet, and each faucet in the flow of water is different, we want to get what kind of water to unscrew which faucet, and then use the pipe to guide the water faucet to a water treatment device (Bolt), The water processor then uses the pipe to guide the other processor or into the container. In order to increase the efficiency of water treatment, it is natural to think of the same water source to connect multiple taps and use multiple water processors, which can improve efficiency. Yes, that's what storm was designed for, and we'll see.
In response to the above, we can easily understand this picture, which is a forward-and-loop diagram, which is abstracted by storm as a topology topology (indeed, the topology is non-cyclic), and topology is the highest-level abstraction in storm, which can be committed to storm cluster execution, A topology is a Stream transformation diagram in which each node is a spout or bolt, and the edges in the graph indicate which flows the bolt subscribes to, and when spout or bolt sends the tuple to the stream, it sends the tuple to each bolt that subscribes to the stream (which means that we don't need to pull the pipe manually, As long as you pre-subscribe, spout will send the stream to the appropriate bolt.
Storm mainly divides into two kinds of components Nimbus and supervisor. Both of these components are fast-failing and have no state. The task status and heartbeat information are stored on the zookeeper, and the submitted code resources are on the local machine's hard disk.
Nimbus is responsible for sending code within the cluster, assigning work to the machine, and monitoring the status. There is only one global.
The supervisor listens to the work assigned to it and starts/shuts down worker processes as needed. One is deployed on each machine to run storm, and the number of slots allocated above is set according to the machine's configuration.
Zookeeper is the external resource that storm focuses on. Nimbus and supervisor even the actual worker is keeping the heartbeat on the zookeeper. Nimbus is also based on the heartbeat and task Health on the Zookeerper, scheduling and task assignment.
The program that storm commits to run is called topology. The smallest message unit processed by topology is a tuple, which is an array of arbitrary objects.
Topology consist of spout and bolts. Spout are the nodes that emit a tuple. Bolts are free to subscribe to a tuple that is emitted by a spout or bolt. spout and bolts are collectively referred to as component.
Big Data Architecture: Storm