Hadoop is generally used for offline analysis and calculation, and Storm is distinguished from Hadoop, used in real-time streaming computing, is widely used in real-time log processing, real-time statistics, real-time wind control and other scenarios, of course, can also be used in real-time data processing, stored in a distributed database such as HBase, facilitate subsequent queries.
Faced with the real-time computation of large quantities of data, storm implemented a scalable, low-latency, reliable and fault-tolerant distributed computing platform.
1. Introduction of objects
Tuple: Represents a basic processing unit in a stream, can include multiple fields, each filed represents an attribute
Topology: A topology is a graph of compute nodes, each node shifting the logic of processing, the connection between the nodes represents the direction of data flow
Spout: Represents the source of a stream, producing a tuple
BOLT: processing the input stream and generating multiple output streams, you can do simple data conversion calculations, complex flow processing typically requires more than one bolt to process
Nimnus: Master node, responsible for publishing code in the cluster, assigning work to the machine, and monitoring the status
Supervisor: A machine, a working node, listens to the assigned work and starts and shuts down the worker process as needed.
Woker: Executes the topology worker process for generating a task
Task: Each spout and bolt can be run as a task in storm, a task corresponding to a thread
The composition of storm topology topology is shown in
2. Overall architecture
The client submits the topology to Nimbus.
Nimbus The local directory for the topology calculates the task according to the topology configuration, assigns the task, establishes the assignments node on the zookeeper the supervisor correspondence between the task and the Woker machine node;
Create a Taskbeats node on zookeeper to monitor the heartbeat of a task; start topology.
Supervisor go to Zookeeper to get the assigned tasks, start multiple woker, each woker generate a task, a task one thread, initialize the connection between tasks based on the topology information; Between the task and task is managed through ZEROMQ, and then the entire topology runs.
Real-time computing storm Process Architecture Summary