Spark Streaming Technical Point Rollup

Source: Internet
Author: User

Spark Streaming supports the scalable (scalable), high throughput (high-throughput), fault tolerant (fault-tolerant) stream processing (stream processing) for real-time data streams.
Spark Streaming supports the scalable (scalable), high throughput (high-throughput), fault tolerant (fault-tolerant) stream processing (stream processing) for real-time data streams.

Architecture diagram
Features are as follows:
? Can be linearly scaled to more than hundreds of nodes;
? Achieve sub-second delay processing;
? Seamless integration with Spark batch processing and interactive processing;
? Provides a simple API to implement complex algorithms;
? More support for streaming, including Kafka, Flume, Kinesis, Twitter, ZeroMQ, and more.
001. Principle
After receiving the real-time input data stream, Spark divides the data into batches (divides the data into batches), which is then transferred to spark Engine processing, generating the final stream of results by batch (generate the final stream of R Esults in batches).

002. API
DStream:
DStream (discretized stream, discrete stream) is a high-level abstract continuous stream of data provided by Spark stream.
Composition: A DStream can be regarded as a RDDs sequence.
Core idea: The calculation as a series of small time interval, state-independent, batch-determined tasks, each time interval received input data is reliably stored in the cluster as an input dataset.

Features: A high-level functional programming API, strong consistency, and failure recovery in universities.
Application Templates:
Template 1

Template 2

WordCount Example

Input DStream:
Input DStream is a DStream of raw data streams from streaming data sources, divided into basic input sources (file systems, sockets, Akka actors, custom data sources), and advanced input sources (Kafka, Flume, and so on).
Receiver:
Each Input DStream (except for the file stream) corresponds to a single receiver object that receives data from the data source and stores it in Spark memory for processing. Multiple Input DStream can be created in the application to receive multiple data streams in parallel.
Each Receiver is a Task that runs on a worker or Executor for a long time, so it consumes a core of the application. If the number of cores assigned to the Spark streaming application is less than or equal to the number of Input DStream (that is, the number of receivers), only the data can be received but not fully processed (except for the file stream, because no receiver is required).
Spark Streaming has encapsulated a variety of data sources, referencing official documents when needed.
Transformation operation
Common transformation

Updatestatebykey (func)
Updatestatebykey can reduce the data in the Dstream by key, then accumulate the data for each batch
WordCount version of Updatestatebykey

Transform (func)
Create a new DStream by applying a transform function to each RDD of the original DStream.
Official Document code example

Window operations
Window manipulation: Transformation data based on window (individuals think it is similar to Storm's tick, but more powerful).
Parameters: Window Length (Windows length) and sliding interval (slide interval) must be multiples of the source Dstream batch interval.
Example: The window length is 3, the sliding interval is 2, the previous line is the original DStream, the next line is the windowing DStream.

Common window operation

Official Document code example

Join (Otherstream, [numtasks])
Connecting data streams
Official Document code Example 1

Official Document code Example 2

Output operation

Caching and Persistence:
Each RDD in DStream is stored in memory by persist ().
Window operations is automatically persisted in memory without the need to show call persist ().
When performing persist () on data streams received over the network (such as Kafka, Flume, Sockets, ZeroMQ, ROCKETMQ, and so on), the serialized data is persisted by default on two nodes for fault tolerance.
Checkpoint:
Purpose: Spark fault recovery based on fault tolerant storage systems such as HDFs, S3.
Classification:
Metadata checkpoint: Saves streaming compute information for Driver run node recovery, including creating application configurations, application-defined DStream operations, queued but unfinished batches.
Data checkpoint: Saves the generated RDD. Since stateful transformation needs to consolidate data from multiple batches, the resulting RDD relies on data from the previous batch of Rdd (dependency chain) to reduce the dependency chain and thus decrease the recovery time, intermediate RD D periodically save to reliable storage (such as HDFs).
Use time:
Stateful Transformation:updatestatebykey () and window operations.
Applications that require Driver recovery.
003. How to use
Stateful transformation

Applications requiring Driver recovery (wordcount example): If the checkpoint directory exists, a new StreamingContext is created based on the checkpoint data; otherwise (e.g. first run) new StreamingContext.

Checkpoint time interval
Method:

Principle: Generally set to 5-10 times the sliding time interval.
Analysis: Checkpoint increases storage overhead and increases batch processing time. When the batch interval is small (for example, 1 seconds), checkpoint may reduce operation throughput, whereas a large checkpoint time interval can lead to an increase in the number of lineage and tasks.
004. Performance Tuning
Reduce batch processing time:
Data receive parallelism
Add DStream: When receiving network data (such as Kafka, Flume, sockets, etc.) the data is deserialized and stored in Spark, because a DStream only receiver object, if it becomes a bottleneck consider adding DStream.

Set the "spark.streaming.blockInterval" parameter: The received data is stored in the spark memory before it is merged into a block, and the number of blocks determines the number of tasks, for example, when the batch time interval is 2 seconds and the block When the interval is 200 milliseconds, the number of tasks is approximately 10, and if the number of tasks is too low, the CPU resources are wasted, and the recommended minimum block interval is 50 milliseconds.
To explicitly repartition input DStream: Re-partitioning the incoming data before deeper processing.

Data processing parallelism: Reducebykey, Reducebykeyandwindow, etc. operation can be controlled by setting the "spark.default.parallelism" parameter or by explicitly setting the degree of parallelism method parameter.
Data serialization: More efficient Kryo serialization can be configured.
Set reasonable batch time intervals
Principle: The speed of processing data should be greater than or equal to the speed of data entry, that is, batch processing time is greater than or equal to the batch interval.
Method:
Set the batch time interval to 5-10 seconds to reduce the data input speed;
By looking at the "total delay" in the log4j log, the batch interval is adjusted to ensure that "total delay" is less than the batch interval.
Memory tuning
Persistence level: Turn on compression and set the parameter "Spark.rdd.compress".
GC Policy: Open CMS on Driver and executor.

Spark Streaming Technical Point Rollup

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.