Spark
Streaming is an extension of Spark's core API, which can implement high-throughput, fault-tolerant real-time streaming data processing.
Spark Streaming supports obtaining data from multiple data sources, including Kafka, Flume, Twitter, ZeroMQ, Kinesis, and TCP Sockets. After obtaining data from the data source, you can use advanced functions such as map, reduce, join, and window to process complex algorithms, and finally you can store the processing results in the file system, database, and field dashboard.
Based on the unified Spark environment, other sub-frameworks of Spark, such as machine learning and graph computing, can be used to process streaming data.
Schematic representation of the data stream processed by Spark Streaming
Like other Spark sub-frameworks, Spark
Streaming is also based on the core Spark. Spark Streaming's internal processing mechanism is to receive real-time input data streams and split them into batches of data according to a certain time interval (such as 1 second), and then process these batches of data through Spark Engine, and finally get the processed one. Batch results data.
Spark Streaming principle diagram
Spark Streaming supports a high-level abstraction called DiscretizedStream or DStream, which represents a continuous stream of data.
DStreams can be created using input data streams obtained from data sources such as Kafka, Flume, and Kinesis, or they can be obtained through higher-order functions based on other DStreams.
Internally, DStream is composed of a series of RDDs. A batch of data corresponds to an RDD instance in the Spark kernel. Therefore, the DStream corresponding to the stream data can be regarded as a set of RDDs, that is, a sequence of RDDs. That is to say, after the streaming data is divided into batches, it will pass through a first-in first-out queue, and Spark Engine will sequentially take out batches of data from the queue, and encapsulate the batch data into an RDD, and then process it.
The following describes some common terms of Spark
Streaming.
Discrete Stream (Discretized Stream) or DStream
Spark Streaming's abstract description of the internal continuous real-time data stream, that is, a real-time data stream processed, corresponds to a DStream instance in Spark Streaming.
Time slice or batch processing interval (BatchInterval)
The time unit for splitting streaming data is generally 500 milliseconds or 1 second.
Batch data (BatchData)
The stream data contained in a time slice is represented as an RDD.
Window
A period of time. The system supports calculation of data in a window
Window Length
The length of streaming data covered by a window must be a multiple of the batch processing interval.
Sliding Interval
The length of time elapsed from the previous window to the next window. Must be a multiple of the batch processing interval.
Input DStream
An Input DStream is a special DStream.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.