Feature
|
Strom (Trident) |
Spark Streaming |
Description |
Parallel framework
|
DAG-based task Parallel computing engine (task Parallel continuous computational engine Using DAG)
|
Spark-based parallel computing engine (data Parallel general Purpose batch processing engine)
|
|
Data processing mode
|
(one at a time) to process an event (message) at once Trident: (micro-batch) handling multiple events at once
|
(micro-batch) handling multiple events at once
|
|
Delay
|
Less than one second Trident (few seconds) |
A few seconds) |
Joshdecember 8 at 4:23 PM Thanks for the article! Could explain this point in a bit more detail? "But, it relies in transactions to update state, which is slower and often have to being implemented by the user." If I want to write my output to a persistent store e.g. Redis, and why would it is slower in Storm than in Spark streamin G? Reply Replies Xinh Huynhdecember at 5:24 PM Hi Josh, check out the slide about storm/trident here:http://spark-summit.org/wp-conte ... Spark-streaming.pdf If you want exactly-once semantics with Trident, you have to store a per-state transaction ID for each state. i.e., in word-count, for each word, your would store both the count as well as a transaction ID; Each key-value pair would look like: (Key:word, Value:count, Txid). Before updating the count, you would read in the old transaction ID to make sure it's up to date, and this read causes EXT RA latency. If you are using the Redis in memory, that might is okay, but if it had to go to disk then that would add noticeable latency t o the update. Whereas in Spark, you don't have the to store a per-state transaction ID. For the details of Trident transactional processing, see Http://storm.apache.org/documentation/Trident-state Joshdecember at 9:18 AM Hi Xinh, thanks for the explanation. I See, isn ' t this similar to Spark Checkpointing-where it saves states to HDFS every ~ seconds? Or is your point, with Storm it would (by default) persist the state much more frequently than Spark? Xinh Huynhdecember at 11:43 AM Hi Josh, yes, the fault tolerance in Spark involves periodic (~ second) checkpointing of RDDs. Yes, my point was that with Storm Trident The persistence occurs when each batch was processed, and by default that occurs a Lot more than once every seconds. And, in tuning any of these parameters, there's a tradeoff in the frequency of persistence vs. recovery time in the case O F failure.
|
Fault tolerant
|
At least once Trident: Accurate once |
Accurate once |
|
SOURCE origin
|
Backtype and Twitter
|
UCB |
|
Implementation language
|
Clojure |
Scala |
|
API support
|
Java, Python, Ruby, etc.
|
Jscala, Java, python
|
|
Platform Integration
|
NA (based on zookeeper)
|
Spark (so you can unify (or share) the processing of current affairs and historical data)
|
|
Products, support
|
Storm have been around for several years and have run in production @ Twitter since, as well as at many other Companie S
|
Meanwhile, Spark streaming is a newer project; Its is production deployment (that I am aware of) have been at Sharethrough since 2013.
|
|
Computational theory Framework
|
Storm is the streaming solution in the Hortonworks Hadoop data platform
|
Spark streaming is in both MapR ' s distribution and Cloudera ' s Enterprise data platform. Databricks
|
|
Cluster integration, deployment approach
|
Dependent Zookeeper,standalone,messo
|
Standalone,yarn,messo
|
|
Google trend
|
|
|
|
Bug Burn Chart
|
https://issues.apache.org/jira/browse/STORM/
|
https://issues.apache.org/jira/browse/SPARK/
|
Visible spark problem resolution is much more timely than storm
|
|
|
|
|
|
|
|
|
The controversy between spark stream and storm has a long history.
Refer
Http://xinhstechblog.blogspot.com/2014/06/storm-vs-spark-streaming-side-by-side.html
Http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
http://www.zdatainc.com/2014/09/apache-storm-apache-spark/
Feature
|
Strom (Trident) |
Spark Streaming |
Description |
Parallel framework
|
DAG-based task Parallel computing engine (task Parallel continuous computational engine Using DAG)
|
Spark-based parallel computing engine (data Parallel general Purpose batch processing engine)
|
|
Data processing mode
|
(one at a time) to process an event (message) at once Trident: (micro-batch) handling multiple events at once
|
(micro-batch) handling multiple events at once
|
|
Delay
|
Less than one second Trident (few seconds) |
A few seconds) |
Joshdecember 8 at 4:23 PM Thanks for the article! Could explain this point in a bit more detail? "But, it relies in transactions to update state, which is slower and often have to being implemented by the user." If I want to write my output to a persistent store e.g. Redis, and why would it is slower in Storm than in Spark streamin G? Reply Replies Xinh Huynhdecember at 5:24 PM Hi Josh, check out the slide about storm/trident here:http://spark-summit.org/wp-conte ... Spark-streaming.pdf If you want exactly-once semantics with Trident, you have to store a per-state transaction ID for each state. i.e., in word-count, for each word, your would store both the count as well as a transaction ID; Each key-value pair would look like: (Key:word, Value:count, Txid). Before updating the count, you would read in the old transaction ID to make sure it's up to date, and this read causes EXT RA latency. If you are using the Redis in memory, that might is okay, but if it had to go to disk then that would add noticeable latency t o the update. Whereas in Spark, you don't have the to store a per-state transaction ID. For the details of Trident transactional processing, see Http://storm.apache.org/documentation/Trident-state Joshdecember at 9:18 AM Hi Xinh, thanks for the explanation. I See, isn ' t this similar to Spark Checkpointing-where it saves states to HDFS every ~ seconds? Or is your point, with Storm it would (by default) persist the state much more frequently than Spark? Xinh Huynhdecember at 11:43 AM Hi Josh, yes, the fault tolerance in Spark involves periodic (~ second) checkpointing of RDDs. Yes, my point was that with Storm Trident The persistence occurs when each batch was processed, and by default that occurs a Lot more than once every seconds. And, in tuning any of these parameters, there's a tradeoff in the frequency of persistence vs. recovery time in the case O F failure.
|
Fault tolerant
|
At least once Trident: Accurate once |
Accurate once |
|
SOURCE origin
|
Backtype and Twitter
|
UCB |
|
Implementation language
|
Clojure |
Scala |
|
API support
|
Java, Python, Ruby, etc.
|
Jscala, Java, python
|
|
Platform Integration
|
NA (based on zookeeper)
|
Spark (so you can unify (or share) the processing of current affairs and historical data)
|
|
Products, support
|
Storm have been around for several years and have run in production @ Twitter since, as well as at many other Companie S
|
Meanwhile, Spark streaming is a newer project; Its is production deployment (that I am aware of) have been at Sharethrough since 2013.
|
|