Spark streaming and Storm are now popular real-time streaming computing frameworks that have been widely used in real-time computing scenarios where spark streaming is a spark-based extension that is later than Storm. This chapter expounds the two from the following angles, which can be used as a reference in the selection.
A, Data processing methods
Spark streaming is a real-time streaming computing framework built on Spark, using the time batch window to generate the compute input source RDD for Spark, and then generating the job for the RDD, queued to execute in the Spark computing framework, The bottom layer is based on the spark resource scheduling and Task computing framework; Spark streaming is a data-based batch process that calculates data-generation tasks, moving computations without moving data, while storm, on the other, is streaming data into compute nodes, Moving data instead of calculations, the batch data processing for time Windows needs to be implemented by the user themselves, as described in the previous Storm series related chapters.
B. Ecological system
Spark streaming is spark-based and can be combined with other spark components to enable interactive query adhoc, machine learning mlib, and more. In contrast, Storm is simply a streaming computing framework that lacks the convergence of existing hadoop ecosystems.
C, latency, and throughput
Spark streaming is based on the processing of batch data, relying on the scheduling and computing framework of Spark, the latency is higher than storm, the general minimum latency is around 2s, and storm can reach within 100ms. Because the spark streaming is processing data in batches, the overall throughput is relatively high.
D, fault tolerance
Spark streaming fault tolerance through lineage and two copies of data backup in memory, and by lineage records the operation of the RDD before, if a node fails at runtime, it can be recalculated at other nodes by the backup data.
Storm uses ACK components to track the flow of data, which is much more expensive than sparking streaming.
E, transactional
Spark streaming guarantees that the data is processed only once and is at the hierarchical level of the batch process.
Storm can ensure that each record is processed at least once by tracking mechanism, and it needs to be implemented by the user if it is necessary to ensure that the state is updated only once.
So for statefull calculations, the higher the transactional, the spark streaming is better.