Spark is based on the idea that when the data is large, it is more efficient to pass the calculation process to the data than to pass the data to the computational process. Each node stores (or caches) its data set, and then the task is submitted to the node.
so this is the process of passing the data. This is very similar to Hadoop map/reduce, in addition to actively using memory to avoid I/O operations, so that the iterative algorithm (the input that the previous step calculates the output as the next step) performs more.
Shark is just a Spark -based query engine (supports ad-hoc ad hoc analysis queries)
And Storm 's architecture is diametrically opposed to Spark. Storm is a distributed flow computing engine. Each node implements a basic calculation process, and data items flow in and out of interconnected network nodes. instead of Spark, this is about passing data to the process.
two frameworks are used to process parallel computations of large amounts of data.
Storm is better at dynamically processing a large number of generated "small chunks" (such as real-time computation of aggregation functions or analysis on Twitter data streams).
Spark work on the existing complete collection of data (such as Hadoop data) has been imported spark based on In-memoryi/o operation.
However, the Spark flow module (streaming module) is similar to Storm (both stream computing engines), although not exactly the same.
The Spark Flow module aggregates bulk data and then blocks distribution (treated as immutable data), and Storm is processed and distributed in real time as soon as the data is received.
not sure which way to take advantage of data throughput, but Storm calculation time delay is small.
In summary, Spark and Storm design are reversed, while spark Steaming is similar to storm, with data smoothing windows (sliding window ), while the latter needs to maintain the window itself.
Two high-performance parallel computing engine storm and spark comparison