Spark Streaming Architecture

Last Update:2020-06-11 Source: Internet

Author: User

Keywords spark spark streaming spark streaming architecture

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Spark Streaming architecture: Discrete streaming data processing
For the traditional stream processing method of processing one record at a time, Spark Streaming is replaced by discretization of streaming data, so that it can perform micro batch processing below the second level. At the same time, the Receiver of Spark Streaming receives data in parallel and caches the data in the memory of the Spark working node. After delay optimization, the Spark engine can batch process short tasks (tens of milliseconds) and output the results to other systems. It is worth noting that it is different from the traditional continuous operator model, where the traditional model is statically assigned to a node for calculation, and the Spark task can be dynamically assigned to the worker node based on the data source and available resources. This can better accomplish the two features we will describe next: load balancing and fast failure recovery.

In addition, each batch of data is called a resilient distributed data set (RDD), which is a basic abstraction of fault-tolerant data sets in Spark. This is why the streaming data can process any Spark instructions and libraries.

Advantages of discretized stream data processing
Let's take a look at how this architecture uses Spark Streaming to accomplish our previously set goals.

Dynamic load balancing

The Spark system divides the data into small batches, allowing fine-grained allocation of resources. For example, consider when the input data stream needs to be partitioned by a key. In this simple case, in the traditional static assignment of tasks to nodes in other systems, if one of the partitions is more computationally intensive than the other, then the node processing will encounter performance bottlenecks and will slow down pipeline processing . In Spark Streaming, job tasks will be dynamically distributed to each node. Some nodes will process fewer tasks and take longer tasks. Other nodes will process more tasks that take less time.

Fast failure recovery mechanism

In the case of node failure, the traditional system restarts the failed continuous operator on another node. In order to recalculate the lost information, the previous data stream processing has to be rerun. It is worth noting that in this process, only one node is processing recalculation, and the pipeline cannot continue to work unless the new node information has been restored to the state before the failure. In Spark, the calculation will be split into multiple small tasks to ensure that it can run anywhere without affecting the correctness of the combined results. Therefore, the failed tasks can be re-processed on the cluster nodes in parallel at the same time, so that they are evenly distributed among the many nodes in all recalculation situations, so that they can recover from the failure faster than traditional methods.

Integration of batch processing, stream processing and interactive analysis

Discrete data stream (DStream) as a key program abstraction in Spark Streaming. Internally, DStream is represented by a set of consecutive RDDs in a time series, and each RDD contains a data stream within a specific time interval. This common representation allows for seamless interaction between batch and stream processing. So users can perform Spark related operations on each batch of streaming data. For example: use DStream to connect with pre-created data sets.

// Create data set from Hadoop file
val dataset = sparkContext.hadoopFile("file")
The
// Join each batch in stream with the dataset
kafkaDStream.transform {batchRDD =>
batchRDD.join(dataset).filter(...)
}
Just as each batch of streaming data is stored in the memory of the Spark node, we can perform interactive queries as needed. For example, you can query the status of all streams through the Spark SQL JDBC server, which we will also show in the next section. Because Spark performs a common abstraction of these tasks, this combination of batch processing, stream processing, and interactive work is very easy to implement in Spark, but in systems that do not have a common abstraction. It's hard.

Advanced analysis-machine learning, SQL query

Because Spark is interoperable, it has extended a rich library for users to use, such as: MLlib (machine learning), SQL, DataFrames and Graphx. Let's explore some use cases together:

Streaming + SQL and DataFrames
The RDD sequence maintained inside DStream can be converted into DataFrame (Spark SQL programming interface), and then query operations can be performed through SQL statements. For example: using Spark SQL's JDBC server, external programs can query the status of the stream through SQL.

val hiveContext = new HiveContext(sparkContext)
...
wordCountsDStream.foreachRDD {rdd =>
// Convert RDD to DataFrame and register it as a SQL table
val wordCountsDataFrame = rdd.toDF("word", "count")
wordCountsDataFrame.registerTempTable("word_counts")
}
...
// Start the JDBC server
HiveThriftServer2.startWithContext(hiveContext)
You can use the beeline client or tableau tool included with Spark through the JDBC server to interactively query the continuously updated "word_counts" table.

1: jdbc:hive2://localhost:10000> show tables;
+--------------+--------------+
| tableName | isTemporary |
+--------------+--------------+
| word_counts | true |
+--------------+--------------+
1 row selected (0.102 seconds)
1: jdbc:hive2://localhost:10000> select * from word_counts;
+-----------+--------+
| word | count |
+-----------+--------+
| 2015 | 264 |
| PDT | 264 |
| 21:45:41 | 27 |
Streaming + MLlib
Machine learning models can be generated offline through MLlib and can be applied to streaming data. For example, the following code uses static data to form a KMeans clustering model, and then uses the model to classify the Kafka data stream.

// Learn model offline
val model = KMeans.train(dataset, ...)

// Apply model online on stream
val kafkaStream = KafkaUtils.createDStream(...)
kafkaStream.map {event => model.predict(featurize(event))}
We demonstrated this "offline learning online prediction" method on the Spark Summit 2014 Databricks demo. Since then, we have also added machine learning algorithms for streaming in MLlib, so that we can continue to form some labeled data streams. Other Spark extension libraries can also be easily called on Spark Streaming.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More