Introduction and Principle of Spark Streaming

Last Update:2020-06-11 Source: Internet

Author: User

Keywords spark spark streaming spark streaming introduction

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction:
Spark Streaming is a set of frameworks.
Spark Streaming is an extension of the Spark core API, which can realize high-throughput, real-time streaming data processing with fault tolerance mechanism.
Support multiple data sources to obtain data:

Spark Streaming receives real-time input data from various sources such as Kafka, Flume, HDFS, etc. After processing, the processing structure is stored in various places such as HDFS and DataBase.
Dashboards: Graphical monitoring interface, Spark Streaming can be output to the front-end monitoring page.
*The most used is kafka+Spark Streaming
The relationship between Spark Streaming and SparkCore:

Spark processes batches of data (offline data). In fact, Spark Streaming does not process one piece of data like Strom, but the external data stream that is docked is divided according to time and the files are divided into batches. , And Spark processing logic is the same.
Spark Streaming will split the received real-time streaming data at a certain time interval and give it to the Spark Engine engine, and finally get batches of results.

Dstream: Spark Streaming provides a highly abstract DStream called discrete stream that represents a continuous stream of data

If external data continues to flow in, sliced by one minute, each one minute of internal data is continuous (continuous data stream), while one minute and one minute slices are independent of each other (discrete stream).

DStream is a unique data type of Spark Streaming.

Dstream can be seen as a set of RDDs, a sequence of RDDs:

Spark's RDD can be understood as a spatial dimension, and Dstream's RDD is understood as adding a time dimension to the spatial dimension.

For example, in the figure above, the data flow is divided into four shards, and the internal processing logic is the same, but the time dimension is different.

The difference between Spark and Spark Streaming:

Spark -> RDD: transformation action + RDD DAG

Spark Streaming -> Dstream: transformation output (it cannot allow data to be activated in the middle, you must ensure that the data has input and output) + DStreamGraph

Any operation on DStream will be transformed into an operation on the underlying RDD (through the operator):

Summary: Persistent continuous data, discretization, and then batch processing.

Persistence: The received data is temporarily stored.

Why persist:

For fault tolerance, when the data flow is wrong, because the data is not calculated, the data needs to be traced back from the source, and the temporarily stored data can be restored.

Discretization: Fragmentation by time to form a processing unit.

Fragment processing: batch processing.

transformation operator:

The reduce and count operators will not directly trigger Dstreami calculations.

output execution operator (output operator):

·Print

· SaveAsObjectFile, saveAsTextFile, saveAsHadoopFiles: output a batch of data to the Hadoop file system, using the start time of the batch of data
Poke to name
· ForEachRDD: allows users to do arbitrary operations on the RDD itself corresponding to each batch of DStream data

Dstream Graph:

Abstraction of a series of transformation operations

The dependencies formed by the conversion between Dstream are all stored in DStreamGraph, DStreamGraph
It is important to generate RDD Graph later

SparkStreaming is a set of frameworks, actually writing code is writing frameworks.

Framework: First make a unified analysis of the entire data calculation process until output.

The RDD involved in traditional Spark development is data invariant, but SparkStreaming is contrary to it.

So with Dstream and Dstream Graph.

The framework becomes task execution, in fact, it executes the spark job, and the spark task only recognizes RDD.

So you can use Dstream as a template for RDD and DStream Graph as a template for RDD DAG.

So writing code is to write Dstream and DStream Graph templates.

SparkStreaming architecture:

Master: Records the dependency relationship or blood relationship between Dstream, and is responsible for task scheduling to generate a new RDD
Worker: Receive data from the network, store and execute RDD calculations

Client: responsible for pouring data into Spark Streaming

Scheduling: Triggered according to time.

Master: Maintained this picture of DStream Graph. (Not at the node level, but at the task level)

Worker: Follow the picture to execute.

There is an important role in the Worker: receiver, receiving external data stream, and then the data stream is passed into the entire Spark Streaming through the receiver (receiver finally packs the data stream into a format that Spark Streaming can handle)

receiver: Receiver, receiving different data sources for targeted acquisition, Spark Streaming also provides different receivers distributed on different nodes, each receiver is a specific process, and each node receives a part as Enter. , Receiver does not do calculations immediately after accepting, first store it in its internal cache. Because Streaming is sharding continuously according to time, you need to wait. Once the timer expires, the buffer will convert the data into a data block (the role of the buffer: cut at a user-defined time interval), and then the data The block is placed in a queue, and the Block manager takes the data block from the queue and converts the data block into a data block that Spark can process.

Why is it a process?

container -> Executor so it is a process

Spark Streaming job submission:

• Network Input Tracker: track each network received data and map it to the corresponding input Dstream

• Job Scheduler: Periodically access DStream Graph and generate Spark Job, and give it to Job Manager for execution

• Job Manager: Get the task queue and execute Spark tasks

Spark Streaming window operation:

• Spark provides a set of window operations to perform statistical analysis on incremental updates of large-scale data through sliding window technology

• Window Operation: regular data processing within a certain period of time

• Any window-based operation needs to specify two parameters:

– Total window length (window length): how long do you want to calculate the data

– Slide interval: how often do you update

Spark Streaming fault tolerance:

• The real-time stream processing system must be running 7*24, and can recover from various system errors. At the beginning of the design, Spark Streaing supports error recovery of driver and worker nodes (Spark Streaing only has two Node: driver->AM, worker->NM)

• Worker fault tolerance: Spark and rdd guarantee the fault tolerance of worker nodes. spark streaming is built on spark,
So its worker node is also the same fault tolerance mechanism

• Driver fault tolerance: rely on WAL persistent logs ----------------------------------------- ---------- Hbase also has WAL
– The following configuration is required to start WAL
– 1: Set the checkpoint directory for streamingContext, the directory must be a file system supported by HADOOP, used to save WAL and do
Streaming checkpoint
– 2: spark.streaming.receiver.writeAheadLog.enable is set to true

The introduction of WAL: guarantees that the data received by any reliable data source will not be lost in the event of failure.

For example, if the received data source does not support things, then it is not reliable to rely on the data source to resend data, so WAL can try to avoid loss.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More