Introduction to Spark Streaming and Storm

Last Update:2018-02-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Spark Streaming and Storm

Spark Streaming is in the Spark ecosystem technology stack and can be seamlessly integrated with Spark Core and Spark SQL. Storm is relatively simple;

(1) Overview

Spark Streaming

Spark Streaming is an extension of Spark's core APIs. It can process real-time stream data with high throughput and fault tolerance mechanisms. Data can be obtained from multiple data sources, including kafka, Flume, Twitter, ZeroMQ, and TCP, you can use advanced functions such as map, reduce, join, and window for complex algorithm processing. Finally, the processing results can be stored in the file system and database. Other Sub-frameworks of Spark can also be used for streaming data processing.

The internal processing mechanism of Spark Streaming is to collect Real-Time Streaming Data, split the data into batches according to a certain interval, and then process the batch data, A batch of result data after processing is finally obtained. The corresponding batch data corresponds to an RDD instance in the Spark kernel. Therefore, the streaming data DStream can be considered as a group of RDDs.

Execution Process (worker er mode ):

Improve the degree of Parallelism: The executor task splits the received data into blocks every 200 ms. interval, and adjusts the value of block. interval;

Enable multiple worker er processes to receive data in parallel;

To increase the degree of parallelism in Direct mode, you only need to increase the number of kafka partitions. In Director mode, the consumer offset is managed by spark and is stored in the checkpoint directory.

Storm

Storm adopts the Master/Slave Architecture

Nimbus: This process runs on the master node of the cluster and is responsible for Task Assignment and distribution.

Supervisor: runs on the slave node of the cluster and is responsible for executing specific tasks.

Zookeeper: Helps tHe master and slave nodes to decouple and store the metadata of cluster resources. When storm stores all metadata information in zk, storm itself becomes stateless, nimbus is used only when the Topology application is submitted;

Worker: runs and processes the logic process of a specific component. worker transfers data through netty.

Task: Each spout/bolt thread in a worker is called a task. The same spout/bolt task may share a physical process, which is the executor.

The above diagram composed of spout and bolt is called topologies. When the upper-layer spout or bolt transmits data to the lower-layer bolts, the default stream is used by default.

There are five common distribution policies for storm. The most common are Shuffle grouping and Fields grouping.

Ack mechanism in storm: To put it bluntly, storm uses the Acker component to calculate the number. Have the Tuple in the Tuple tree been confirmed? each Tuple Tree corresponds to a msgId?

Improve the degree of Parallelism:

Increase the number of workers; increase the number of executors; set the number of tasks. By default, one task is run in one thread.

Storm provides reliable message assurance:

Complete Tuple processing requires Spout, Bolt, and Acker (the node used in Storm to record whether a Tuple tree is fully processed) to complete coordination, as shown in. Send the Tuple from the Spout to the downstream, and send the corresponding information to the Acker. If a Tuple in the entire Tuple tree is successfully processed, the Acker will be notified, after the entire Tuple tree is processed, Acker returns the processed information to Spout. If a Tuple fails to be processed or times out, acker sends a failed message to Spout. Spout determines whether to re-transmit the message based on the information returned by the Acker and the user's choice of the message guarantee mechanism.

For more Spark tutorials, see the following:

Install and configure Spark in CentOS 7.0

Notes for setting up a single-host Spark in Ubuntu

Spark1.0.0 Deployment Guide

Install and configure Spark2.0

Spark 1.5 and Hadoop 2.7 Cluster Environment setup

Spark official documentation-Chinese Translation

Install Apache Spark on Ubuntu 17.10

Install Spark0.8.0 in CentOS 6.2 (64-bit)

Spark-2.2.0 installation and deployment

Spark2.0.2 detailed explanation of Hadoop2.6.4 distributed configuration

Ubuntu 14.04 LTS install Spark 1.6.0 (pseudo-distributed)

Spark details: click here
Spark: click here

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Introduction to Spark Streaming and Storm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Introduction to Spark Streaming and Storm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support