Introduction to Spark Streaming and Storm

Source: Internet
Author: User

Introduction to Spark Streaming and Storm

Spark Streaming and Storm

Spark Streaming is in the Spark ecosystem technology stack and can be seamlessly integrated with Spark Core and Spark SQL. Storm is relatively simple;

(1) Overview

  • Spark Streaming

Spark Streaming is an extension of Spark's core APIs. It can process real-time stream data with high throughput and fault tolerance mechanisms. Data can be obtained from multiple data sources, including kafka, Flume, Twitter, ZeroMQ, and TCP, you can use advanced functions such as map, reduce, join, and window for complex algorithm processing. Finally, the processing results can be stored in the file system and database. Other Sub-frameworks of Spark can also be used for streaming data processing.

The internal processing mechanism of Spark Streaming is to collect Real-Time Streaming Data, split the data into batches according to a certain interval, and then process the batch data, A batch of result data after processing is finally obtained. The corresponding batch data corresponds to an RDD instance in the Spark kernel. Therefore, the streaming data DStream can be considered as a group of RDDs.

Execution Process (worker er mode ):

    

Improve the degree of Parallelism: The executor task splits the received data into blocks every 200 ms. interval, and adjusts the value of block. interval;

Enable multiple worker er processes to receive data in parallel;

To increase the degree of parallelism in Direct mode, you only need to increase the number of kafka partitions. In Director mode, the consumer offset is managed by spark and is stored in the checkpoint directory.

  • Storm

    

Storm adopts the Master/Slave Architecture

Nimbus: This process runs on the master node of the cluster and is responsible for Task Assignment and distribution.

Supervisor: runs on the slave node of the cluster and is responsible for executing specific tasks.

Zookeeper: Helps tHe master and slave nodes to decouple and store the metadata of cluster resources. When storm stores all metadata information in zk, storm itself becomes stateless, nimbus is used only when the Topology application is submitted;

Worker: runs and processes the logic process of a specific component. worker transfers data through netty.

Task: Each spout/bolt thread in a worker is called a task. The same spout/bolt task may share a physical process, which is the executor.

    

The above diagram composed of spout and bolt is called topologies. When the upper-layer spout or bolt transmits data to the lower-layer bolts, the default stream is used by default.

There are five common distribution policies for storm. The most common are Shuffle grouping and Fields grouping.

Ack mechanism in storm: To put it bluntly, storm uses the Acker component to calculate the number. Have the Tuple in the Tuple tree been confirmed? each Tuple Tree corresponds to a msgId?

Improve the degree of Parallelism:

Increase the number of workers; increase the number of executors; set the number of tasks. By default, one task is run in one thread.

Storm provides reliable message assurance:

  

Complete Tuple processing requires Spout, Bolt, and Acker (the node used in Storm to record whether a Tuple tree is fully processed) to complete coordination, as shown in. Send the Tuple from the Spout to the downstream, and send the corresponding information to the Acker. If a Tuple in the entire Tuple tree is successfully processed, the Acker will be notified, after the entire Tuple tree is processed, Acker returns the processed information to Spout. If a Tuple fails to be processed or times out, acker sends a failed message to Spout. Spout determines whether to re-transmit the message based on the information returned by the Acker and the user's choice of the message guarantee mechanism.

For more Spark tutorials, see the following:

Install and configure Spark in CentOS 7.0

Notes for setting up a single-host Spark in Ubuntu

Spark1.0.0 Deployment Guide

Install and configure Spark2.0

Spark 1.5 and Hadoop 2.7 Cluster Environment setup

Spark official documentation-Chinese Translation

Install Apache Spark on Ubuntu 17.10

Install Spark0.8.0 in CentOS 6.2 (64-bit)

Spark-2.2.0 installation and deployment

Spark2.0.2 detailed explanation of Hadoop2.6.4 distributed configuration

Ubuntu 14.04 LTS install Spark 1.6.0 (pseudo-distributed)

Spark details: click here
Spark: click here

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.