Spark Streaming Architecture

Source: Internet
Author: User
Keywords spark spark streaming spark streaming architecture
Spark Streaming architecture: Discrete streaming data processing
For the traditional stream processing method of processing one record at a time, Spark Streaming is replaced by discretization of streaming data, so that it can perform micro batch processing below the second level. At the same time, the Receiver of Spark Streaming receives data in parallel and caches the data in the memory of the Spark working node. After delay optimization, the Spark engine can batch process short tasks (tens of milliseconds) and output the results to other systems. It is worth noting that it is different from the traditional continuous operator model, where the traditional model is statically assigned to a node for calculation, and the Spark task can be dynamically assigned to the worker node based on the data source and available resources. This can better accomplish the two features we will describe next: load balancing and fast failure recovery.

In addition, each batch of data is called a resilient distributed data set (RDD), which is a basic abstraction of fault-tolerant data sets in Spark. This is why the streaming data can process any Spark instructions and libraries.



Advantages of discretized stream data processing
Let's take a look at how this architecture uses Spark Streaming to accomplish our previously set goals.

Dynamic load balancing

The Spark system divides the data into small batches, allowing fine-grained allocation of resources. For example, consider when the input data stream needs to be partitioned by a key. In this simple case, in the traditional static assignment of tasks to nodes in other systems, if one of the partitions is more computationally intensive than the other, then the node processing will encounter performance bottlenecks and will slow down pipeline processing . In Spark Streaming, job tasks will be dynamically distributed to each node. Some nodes will process fewer tasks and take longer tasks. Other nodes will process more tasks that take less time.



Fast failure recovery mechanism

In the case of node failure, the traditional system restarts the failed continuous operator on another node. In order to recalculate the lost information, the previous data stream processing has to be rerun. It is worth noting that in this process, only one node is processing recalculation, and the pipeline cannot continue to work unless the new node information has been restored to the state before the failure. In Spark, the calculation will be split into multiple small tasks to ensure that it can run anywhere without affecting the correctness of the combined results. Therefore, the failed tasks can be re-processed on the cluster nodes in parallel at the same time, so that they are evenly distributed among the many nodes in all recalculation situations, so that they can recover from the failure faster than traditional methods.




Integration of batch processing, stream processing and interactive analysis

Discrete data stream (DStream) as a key program abstraction in Spark Streaming. Internally, DStream is represented by a set of consecutive RDDs in a time series, and each RDD contains a data stream within a specific time interval. This common representation allows for seamless interaction between batch and stream processing. So users can perform Spark related operations on each batch of streaming data. For example: use DStream to connect with pre-created data sets.

// Create data set from Hadoop file
val dataset = sparkContext.hadoopFile("file")
The
// Join each batch in stream with the dataset
kafkaDStream.transform {batchRDD =>
  batchRDD.join(dataset).filter(...)
}
Just as each batch of streaming data is stored in the memory of the Spark node, we can perform interactive queries as needed. For example, you can query the status of all streams through the Spark SQL JDBC server, which we will also show in the next section. Because Spark performs a common abstraction of these tasks, this combination of batch processing, stream processing, and interactive work is very easy to implement in Spark, but in systems that do not have a common abstraction. It's hard.

Advanced analysis-machine learning, SQL query

Because Spark is interoperable, it has extended a rich library for users to use, such as: MLlib (machine learning), SQL, DataFrames and Graphx. Let's explore some use cases together:

Streaming + SQL and DataFrames
The RDD sequence maintained inside DStream can be converted into DataFrame (Spark SQL programming interface), and then query operations can be performed through SQL statements. For example: using Spark SQL's JDBC server, external programs can query the status of the stream through SQL.

val hiveContext = new HiveContext(sparkContext)
...
wordCountsDStream.foreachRDD {rdd =>
  // Convert RDD to DataFrame and register it as a SQL table
  val wordCountsDataFrame = rdd.toDF("word", "count")
  wordCountsDataFrame.registerTempTable("word_counts")
}
...
// Start the JDBC server
HiveThriftServer2.startWithContext(hiveContext)
You can use the beeline client or tableau tool included with Spark through the JDBC server to interactively query the continuously updated "word_counts" table.

1: jdbc:hive2://localhost:10000> show tables;
+--------------+--------------+
| tableName | isTemporary |
+--------------+--------------+
| word_counts | true |
+--------------+--------------+
1 row selected (0.102 seconds)
1: jdbc:hive2://localhost:10000> select * from word_counts;
+-----------+--------+
| word | count |
+-----------+--------+
| 2015 | 264 |
| PDT | 264 |
| 21:45:41 | 27 |
Streaming + MLlib
Machine learning models can be generated offline through MLlib and can be applied to streaming data. For example, the following code uses static data to form a KMeans clustering model, and then uses the model to classify the Kafka data stream.

// Learn model offline
val model = KMeans.train(dataset, ...)

// Apply model online on stream
val kafkaStream = KafkaUtils.createDStream(...)
kafkaStream.map {event => model.predict(featurize(event))}
We demonstrated this "offline learning online prediction" method on the Spark Summit 2014 Databricks demo. Since then, we have also added machine learning algorithms for streaming in MLlib, so that we can continue to form some labeled data streams. Other Spark extension libraries can also be easily called on Spark Streaming.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.