Two high-performance parallel computing engine storm and spark simple comparison

Source: Internet
Author: User

Spark is based on the idea that when the data is large, it is more efficient to pass the calculation process to the data than to pass the data to the computational process. Each node stores (or caches) its data set, and then the task is submitted to the node. So this is the process of passing the data. This is very similar to Hadoop map/reduce, in addition to actively using memory to avoid I/O operations, so that the iterative algorithm (the input that the previous step calculates the output as the next step) performs more. Shark is just a spark-based query engine that supports Ad-hoc ad hoc analytic queries, and Storm's architecture is diametrically opposed to spark. Storm is a distributed flow computing engine. Each node implements a basic calculation process, and data items flow in and out of interconnected network nodes. Instead of spark, this is about passing data to the process. Two frameworks are used to process parallel computations of large amounts of data. Storm is better at dynamically processing a large number of generated "small chunks" (such as real-time computation of aggregation functions or analysis on Twitter data streams). Spark has been working on existing data works (such as Hadoop data) that have been imported into the spark cluster, and Spark scan is based on in-memory management and minimizes global I/O operations of the iterative algorithm. However, the Spark flow module (streaming module) is similar to storm (both stream computing engines), although not exactly the same. The Spark Flow module aggregates bulk data and then blocks distribution (treated as immutable data), and storm is processed and distributed in real time as soon as the data is received. Not sure which way to take advantage of data throughput, but storm calculation time delay is small. In summary, spark and storm design are reversed, while spark steaming is similar to storm, which has a data Smoothing window (sliding windows), which needs to maintain the window itself.

In addition, you can compare from different aspects:

Spark streaming and Storm are now popular real-time streaming computing frameworks that have been widely used in real-time computing scenarios where spark streaming is a spark-based extension that is later than Storm. This chapter expounds the two from the following angles, which can be used as a reference in the selection.

A, Data processing methods

Spark streaming is a real-time streaming computing framework built on Spark, using the time batch window to generate the compute input source RDD for Spark, and then generating the job for the RDD, queued to execute in the Spark computing framework, The bottom layer is based on the spark resource scheduling and Task computing framework; Spark streaming is a data-based batch process that calculates data-generation tasks, moving computations without moving data, while storm, on the other, is streaming data into compute nodes, Moving data instead of calculations, the batch data processing for time Windows needs to be implemented by the user themselves, as described in the previous Storm series related chapters.

B. Ecological system

Spark streaming is spark-based and can be combined with other spark components to enable interactive query adhoc, machine learning mlib, and more. In contrast, Storm is simply a streaming computing framework that lacks the convergence of existing hadoop ecosystems.

C, latency, and throughput

Spark streaming is based on the processing of batch data, relying on the scheduling and computing framework of Spark, the latency is higher than storm, the general minimum latency is around 2s, and storm can reach within 100ms. Because the spark streaming is processing data in batches, the overall throughput is relatively high.

D, fault tolerance

Spark streaming fault tolerance through lineage and two copies of data backup in memory, and by lineage records the operation of the RDD before, if a node fails at runtime, it can be recalculated at other nodes by the backup data.

Storm uses ACK components to track the flow of data, which is much more expensive than sparking streaming.

E, transactional

Spark streaming guarantees that the data is processed only once and is at the hierarchical level of the batch process.

Storm can ensure that each record is processed at least once by tracking mechanism, and it needs to be implemented by the user if it is necessary to ensure that the state is updated only once.

So for statefull calculations, the higher the transactional, the spark streaming is better.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.