In the past two years,
streaming computing has gradually become popular again. When it comes to streaming computing, there are two main types: continuous-based and micro-batch. Recently, Spark Streaming based on micro-batch mode is being used, which is just introduced in conjunction with the paper. The paper mentioned here is "Discretized Streams: Fault-Tolerant Streaming Computation at Scale" released in 2013. Although it is a paper published in 2013, the core logic of the system has not changed much. For understanding the system design and working method of Spark Streaming It is still very helpful. Note: Spark launched Structured Streaming in 2016, and then gradually evolved into continuous based. Spark
Streaming based on micro-batch mode may gradually fade out.
0. abstract
The main problem of the current (2013) distributed streaming system is that the cost of error recovery is very high: hot backup or recovery time is long, and it does not deal with straggler, where straggler is a member/component of the distributed system that runs behind other components For example, the running time of a task node is significantly longer than other nodes. In contrast, DStream (Spark Streaming's streaming mode, full name discretized streams) error recovery is faster, and it will handle straggler. In addition, there are other advantages including: providing a wealth of operators, high throughput, a cluster size that can be linearly scaled to 100 nodes, sub-second delay and pressure-second failure recovery. Finally, DStream can also be used in conjunction with batch processing and interactive queries.
1. Overview
There are two main types of distributed computing: batch processing and streaming computing. The main advantages of streaming computing are its timeliness and low latency. The two main problems in the design of large-scale streaming computing systems are error handling and straggler handling. Due to the real-time nature of the streaming system, how to recover quickly after an error is extremely important.
Unfortunately, the design of existing streaming systems is not good enough at these two points. For example, Storm and TimeStream (at this time, flink has not been popularized on a large scale) are based on the continuous operator mode, and a continuously running, stateful node is responsible for receiving and processing data. The error recovery in this mode mainly consists of two methods: one is replication, that is, each operator node has a replication node; the other is that upstream provides replay for a new node after a certain node fails. In a large-scale cluster mode, these two methods are not preferable: replication will consume double the resources; upstream replay will take a certain amount of time. Moreover, neither mode deals with straggler: the first mode will cause the replication process to slow down due to the presence of straggler; the second mode will treat straggler as a failed node and then recover it, which is relatively expensive.
The mode of Spark
Streaming is discretized streams (D-Streams). In this mode, there is no operator running all the time, but the data of each time interval is processed through a series of stateless and deterministic batch processing. For example, MapReduce calculates the count for every second of data. Similarly, the count of multiple batches of data can also be superimposed. In short, in DStream mode, once input is given, the output state is determined; below we will explain in detail why DStream's failure recovery mode is superior to the previous two modes.
There are two main difficulties in implementing DStream: low latency and fast error recovery (including straggler). Traditional batch processing systems, such as Hadoop, generally run slower, mainly because the intermediate results have to be persisted (note: this also means that fault tolerance is better). DStream uses Resilient Distributed Datasets (Resilient Distributed Datasets), or RDD, for batch processing (Note: RDD can save data to memory and then quickly calculate through the dependencies between RDD). This process is generally sub-second level, which can be satisfied for most scenes.
Fast error recovery mainly provides a new recovery mechanism through the certainty of DStream: par-allel recovery. When a node fails, we quickly rebuild the RDD data of the failed node through other nodes in the cluster. This recovery mode is faster than previous replication and upstream replay. The processing of straggler here because we can get the running time of a batch task, so we can judge whether it is straggler by guessing the running time of the task.
The implementation system of DStream is Spark Streaming, which is based on the Spark engine. This system can process 6kw data per second on a cluster of 100 nodes, and guarantee sub-second delay, and error recovery can also be guaranteed at sub-second level. Of course, these evaluation data are from 2013, that is, 5 years ago. The paper continues to list some comparative data, which will not be repeated here. In short, the conclusion is that the throughput and linear expansion of Spark Streaming are better than other current streaming computing systems.
Finally, it is worth mentioning that because Spark Streaming uses the same RDD model and batch processing, users can combine Spark Streaming with batch processing and interactive appearance. Or combine historical RDD data with Spark Streaming (Note: One of the scenarios here is to train the model offline, and then apply it to real-time data through Spark Streaming).
2. Backgroud
Many distributed streaming computing systems use the continuous operator mode. In this mode, there will be multiple continuously running operators. Each operator receives its own data and updates its status. Although this method can reduce the delay, but because the operator itself is stateful, the error recovery is particularly troublesome. As mentioned earlier, either through replication, or through upstream backup and replay. The shortcomings of these two methods are also obvious: waste of resources; long recovery time.
In addition to the cost problem, replication also has the problem of data consistency. How to ensure that the data received by the two nodes is consistent, so it is also necessary to introduce distributed protocols, such as Flux or Borealis’s DPC.
In upstream backup mode, after an operator node fails, upstream resends the data previously sent to the failed operator node from a checkpoint to a new replacement node, which will result in a longer recovery time. The paper does not say here about the problem of operator state preservation. In fact, the state of the operator should also be saved, and the checkpoint must be consistent with the upstream checkpoint.
After this, straggler cannot be handled well in replication mode or upstream backup mode.
3. DStream
As mentioned above, DStream replaces operator with a series of small batch jobs to achieve fast error recovery.
3.1 computation model
DStream breaks the batch into multiple batches at regular intervals. The data at each time interval will be stored as a series of RDDs, and then a series of operators, such as map, reduce, group, etc. for parallel calculations, and finally output the results as a new RDD or output to the system (such as stdout , File system, network, etc.).
The paper gives an example of calculating the website pv, the pseudo code is as follows
pageViews = readStream("http://...", "1s")
ones = pageViews.map(event => (event.url, 1))
counts = ones.runningReduce((a, b) => a + b)
The execution process is briefly described as follows:
Spark Streaming continuously receives http url view data pageViews
Split pageViews into a series of RDD data at 1s intervals (each time interval will also contain multiple RDD data)
Perform map, reduce and other processing on the data in 2.
System error handling is recovered through the dependency between DStream and RDD. The dimension of the dependency relationship is partition, which means that each RDD may be divided into multiple partitions and then distributed on different machines in the cluster, so that when the RDD data on a certain machine is lost, you can use multiple dependencies from the RDD The data comes back in parallel on the machine. After that, if there is no timing relationship between each time interval, the RDD data of each time interval can also be recovered in parallel. This is the key to DStream's fast error recovery.
3.2 Consistency Semantics
Continuous operator-based stream processing systems may cause some operators to lag when multiple operators have different loads, so that snapshot data at a certain point in the entire system is inaccurate. In response to this problem, Borealis synchronizes different nodes to avoid this problem; storm directly ignores this problem.
For DStream, because time is naturally discretized, and the RDDs corresponding to each time interval are fault-tolerant and immutable and computationally deterministic, so DStream meets exactly-once semantics.
I feel that there is a premise here. The paper does not point out that the upstream data is reliable.
4. System Architecture
The theme of the system architecture has not changed much, but it does not make much sense to discuss the implementation details again. Spark Streaming mainly includes three parts:
master: responsible for recording DStream's dependency graph (lineage graph) and task scheduling. We are also called driver now.
worker: responsible for receiving data, storing data and executing tasks. We are now also called executor.
client library.
Spark Streaming's stateless tasks can run on any node. Compared with the fixed topology of traditional streaming systems (note: not sure whether this is still the case), it will be easier to expand.
All the state of Spark Streaming is stored in RDD, and the partition of RDD can be stored to any node or calculated by multiple nodes. Task calculation will consider the data locality, for example, the task of processing partition A will be assigned to the node where partition A is running.
Some details of the implementation below are not discussed.
5. Fault and Straggler Recovery
Parallel Recovery has already been said before, so I won’t go into details here. Here is to supplement the processing of straggler.
The judgment of straggler is very simple, because many tasks are finished quickly, if a task is obviously longer than other tasks, it can be regarded as straggler. straggler can be migrated, that is, migrate tasks to other machines.
Although this paper has been around for a long time, it is still very helpful for understanding the original intention or design ideas of Spark Streaming. Finally, for the rest of the paper, it may be slightly outdated or less meaningful, so I won’t repeat it here. I hope you can give me some advices on the bias in the article.