Introduction to Spark Streaming

Source: Internet
Author: User
Keywords spark spark streaming spark streaming introduction
Spark Streaming is an extension of Spark's core API, which can implement high-throughput, fault-tolerant real-time streaming data processing.

Spark Streaming supports obtaining data from multiple data sources, including Kafka, Flume, Twitter, ZeroMQ, Kinesis, and TCP Sockets. After obtaining data from the data source, you can use advanced functions such as map, reduce, join, and window to process complex algorithms, and finally you can store the processing results in the file system, database, and field dashboard.

Based on the unified Spark environment, other sub-frameworks of Spark, such as machine learning and graph computing, can be used to process streaming data.
Schematic representation of the data stream processed by Spark Streaming

Like other Spark sub-frameworks, Spark Streaming is also based on the core Spark. Spark Streaming's internal processing mechanism is to receive real-time input data streams and split them into batches of data according to a certain time interval (such as 1 second), and then process these batches of data through Spark Engine, and finally get the processed one. Batch results data.
Spark Streaming principle diagram

Spark Streaming supports a high-level abstraction called DiscretizedStream or DStream, which represents a continuous stream of data.

DStreams can be created using input data streams obtained from data sources such as Kafka, Flume, and Kinesis, or they can be obtained through higher-order functions based on other DStreams.

Internally, DStream is composed of a series of RDDs. A batch of data corresponds to an RDD instance in the Spark kernel. Therefore, the DStream corresponding to the stream data can be regarded as a set of RDDs, that is, a sequence of RDDs. That is to say, after the streaming data is divided into batches, it will pass through a first-in first-out queue, and Spark Engine will sequentially take out batches of data from the queue, and encapsulate the batch data into an RDD, and then process it.

The following describes some common terms of Spark Streaming.

Discrete Stream (Discretized Stream) or DStream
Spark Streaming's abstract description of the internal continuous real-time data stream, that is, a real-time data stream processed, corresponds to a DStream instance in Spark Streaming.
Time slice or batch processing interval (BatchInterval)
The time unit for splitting streaming data is generally 500 milliseconds or 1 second.
Batch data (BatchData)
The stream data contained in a time slice is represented as an RDD.
Window
A period of time. The system supports calculation of data in a window
Window Length
The length of streaming data covered by a window must be a multiple of the batch processing interval.
Sliding Interval
The length of time elapsed from the previous window to the next window. Must be a multiple of the batch processing interval.
Input DStream
An Input DStream is a special DStream.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.