Introduction to Spark Streaming and Storm
Spark Streaming and Storm
Spark Streaming is in the Spark ecosystem technology stack and can be seamlessly integrated with Spark Core and Spark SQL. Storm is relatively simple;
(1) Overview
Spark Streaming is an extension of Spark's core APIs. It can process real-time stream data with high throughput and fault tolerance mechanisms. Data can be obtained from multiple data sources, including kafka, Flume, Twitter, ZeroMQ, and TCP, you can use advanced functions such as map, reduce, join, and window for complex algorithm processing. Finally, the processing results can be stored in the file system and database. Other Sub-frameworks of Spark can also be used for streaming data processing.
The internal processing mechanism of Spark Streaming is to collect Real-Time Streaming Data, split the data into batches according to a certain interval, and then process the batch data, A batch of result data after processing is finally obtained. The corresponding batch data corresponds to an RDD instance in the Spark kernel. Therefore, the streaming data DStream can be considered as a group of RDDs.
Execution Process (worker er mode ):
Improve the degree of Parallelism: The executor task splits the received data into blocks every 200 ms. interval, and adjusts the value of block. interval;
Enable multiple worker er processes to receive data in parallel;
To increase the degree of parallelism in Direct mode, you only need to increase the number of kafka partitions. In Director mode, the consumer offset is managed by spark and is stored in the checkpoint directory.
Storm adopts the Master/Slave Architecture
Nimbus: This process runs on the master node of the cluster and is responsible for Task Assignment and distribution.
Supervisor: runs on the slave node of the cluster and is responsible for executing specific tasks.
Zookeeper: Helps tHe master and slave nodes to decouple and store the metadata of cluster resources. When storm stores all metadata information in zk, storm itself becomes stateless, nimbus is used only when the Topology application is submitted;
Worker: runs and processes the logic process of a specific component. worker transfers data through netty.
Task: Each spout/bolt thread in a worker is called a task. The same spout/bolt task may share a physical process, which is the executor.
The above diagram composed of spout and bolt is called topologies. When the upper-layer spout or bolt transmits data to the lower-layer bolts, the default stream is used by default.
There are five common distribution policies for storm. The most common are Shuffle grouping and Fields grouping.
Ack mechanism in storm: To put it bluntly, storm uses the Acker component to calculate the number. Have the Tuple in the Tuple tree been confirmed? each Tuple Tree corresponds to a msgId?
Improve the degree of Parallelism:
Increase the number of workers; increase the number of executors; set the number of tasks. By default, one task is run in one thread.
Storm provides reliable message assurance:
Complete Tuple processing requires Spout, Bolt, and Acker (the node used in Storm to record whether a Tuple tree is fully processed) to complete coordination, as shown in. Send the Tuple from the Spout to the downstream, and send the corresponding information to the Acker. If a Tuple in the entire Tuple tree is successfully processed, the Acker will be notified, after the entire Tuple tree is processed, Acker returns the processed information to Spout. If a Tuple fails to be processed or times out, acker sends a failed message to Spout. Spout determines whether to re-transmit the message based on the information returned by the Acker and the user's choice of the message guarantee mechanism.
For more Spark tutorials, see the following:
Install and configure Spark in CentOS 7.0
Notes for setting up a single-host Spark in Ubuntu
Spark1.0.0 Deployment Guide
Install and configure Spark2.0
Spark 1.5 and Hadoop 2.7 Cluster Environment setup
Spark official documentation-Chinese Translation
Install Apache Spark on Ubuntu 17.10
Install Spark0.8.0 in CentOS 6.2 (64-bit)
Spark-2.2.0 installation and deployment
Spark2.0.2 detailed explanation of Hadoop2.6.4 distributed configuration
Ubuntu 14.04 LTS install Spark 1.6.0 (pseudo-distributed)
Spark details: click here
Spark: click here