I. Storm OverviewStorm is a distributed, reliable, and error-free stream data processing system. It delegates various components to process simple tasks independently. In the storm cluster, the spout component processes the input stream, and the spout transmits the read data to the bolt component. The bolt component processes the received data tuples and may pass the data to the next bolt. We can think of a storm cluster as a chain set composed of bolt components. Data is transmitted over these chains, and bolt acts as a node on the chain to process data. Storm and hadoop clusters look very similar on the surface, but mapreducejobs is run on hadoop, while topology is run on storm. The two are very different. The key difference is: mapreduce will end, and a topology will always run (unless you manually kill it). In other words, storm is for real-time data analysis, while hadoop is for offline data analysis. Suppose there is a situation where, when you watch political or political programs, they often mention personal names and hot topics, if we record the repeat times of names and topics, the results would be interesting. Therefore, imagine that in the storm environment, we can use the arguments as input streams. The spout component reads the data and sends each sentence to the bolt1 component, the bolt1 component splits the sentence into words and sends the words to the bolt2 component. The bolt2 component counts the number of words and stores the information in the database. The debaters are constantly talking, while storm constantly refresh the database results in real time. When you want to view these results, you only need to query the database. Now, you can imagine if you can evenly distribute these spouts and bolts to the entire cluster, and you can easily perform unlimited scaling, right? This is the power of storm!
Figure 1.1: A simple topology
The following describes typical scenarios of storm. 1. stream processing. 2. Real-time computing. 3. Distributed Remote Process calling.
Ii. Storm Components
The cluster has two types of nodes: master node and work node.
-
- Master node: runs the nimbus process, distributes code, schedules tasks, and monitors the running status (mainly the node failure status ).
- Worker node: runs the supervisor process and executes a subset of the topology.
Figure 1.2 components in the storm cluster:
The status of a storm cluster is stored in zookeeper or a local disk. Therefore, the Process in storm is stateless, and failure or restart of any node does not affect the entire cluster. Storm uses zeromq at the underlying layer to ensure its extraordinary features:
- Concurrent socket class
- Faster than TCP, suitable for cluster environments and supercomputing
- Transmit messages through inproc, IPC, TCP, and Multicast
- Asynchronous Io
- Connect N-to-N via fanout, pubsub, pipeline, requst-reply
- Push/Pull Mode
Iii. Storm features
- Simple programming: mainly depends on Spout and bolt.
- Supports multiple programming languages: JVM-based languages are supported. Any other language supports only one intermediate class.
- High Fault Tolerance: running down, restart, etc.
- Scalable: You can add or delete any node to a cluster.
- High reliability: All messages are guaranteed to be consumed at least once. That is to say, messages in storm are not lost.
- Fast: You don't have to worry about it.
- Transaction support
After a preliminary understanding of storm, we will write a simple demo to run it in the next section to let you know about Storm.