Spark streaming vs. Storm

Last Update:2018-07-26 Source: Internet

Author: User

Tags documentation redis zookeeper

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Feature	Strom (Trident)	Spark Streaming	Description
Parallel framework	DAG-based task Parallel computing engine (task Parallel continuous computational engine Using DAG)	Spark-based parallel computing engine (data Parallel general Purpose batch processing engine)
Data processing mode	(one at a time) to process an event (message) at once Trident: (micro-batch) handling multiple events at once	(micro-batch) handling multiple events at once
Delay	Less than one second Trident (few seconds)	A few seconds)	Joshdecember 8 at 4:23 PM Thanks for the article! Could explain this point in a bit more detail? "But, it relies in transactions to update state, which is slower and often have to being implemented by the user." If I want to write my output to a persistent store e.g. Redis, and why would it is slower in Storm than in Spark streamin G? Reply Replies Xinh Huynhdecember at 5:24 PM Hi Josh, check out the slide about storm/trident here:http://spark-summit.org/wp-conte ... Spark-streaming.pdf If you want exactly-once semantics with Trident, you have to store a per-state transaction ID for each state. i.e., in word-count, for each word, your would store both the count as well as a transaction ID; Each key-value pair would look like: (Key:word, Value:count, Txid). Before updating the count, you would read in the old transaction ID to make sure it's up to date, and this read causes EXT RA latency. If you are using the Redis in memory, that might is okay, but if it had to go to disk then that would add noticeable latency t o the update. Whereas in Spark, you don't have the to store a per-state transaction ID. For the details of Trident transactional processing, see Http://storm.apache.org/documentation/Trident-state Joshdecember at 9:18 AM Hi Xinh, thanks for the explanation. I See, isn ' t this similar to Spark Checkpointing-where it saves states to HDFS every ~ seconds? Or is your point, with Storm it would (by default) persist the state much more frequently than Spark? Xinh Huynhdecember at 11:43 AM Hi Josh, yes, the fault tolerance in Spark involves periodic (~ second) checkpointing of RDDs. Yes, my point was that with Storm Trident The persistence occurs when each batch was processed, and by default that occurs a Lot more than once every seconds. And, in tuning any of these parameters, there's a tradeoff in the frequency of persistence vs. recovery time in the case O F failure.
Fault tolerant	At least once Trident: Accurate once	Accurate once
SOURCE origin	Backtype and Twitter	UCB
Implementation language	Clojure	Scala
API support	Java, Python, Ruby, etc.	Jscala, Java, python
Platform Integration	NA (based on zookeeper)	Spark (so you can unify (or share) the processing of current affairs and historical data)
Products, support	Storm have been around for several years and have run in production @ Twitter since, as well as at many other Companie S	Meanwhile, Spark streaming is a newer project; Its is production deployment (that I am aware of) have been at Sharethrough since 2013.
Computational theory Framework	Storm is the streaming solution in the Hortonworks Hadoop data platform	Spark streaming is in both MapR ' s distribution and Cloudera ' s Enterprise data platform. Databricks
Cluster integration, deployment approach	Dependent Zookeeper,standalone,messo	Standalone,yarn,messo
Google trend
Bug Burn Chart	https://issues.apache.org/jira/browse/STORM/	https://issues.apache.org/jira/browse/SPARK/	Visible spark problem resolution is much more timely than storm

The controversy between spark stream and storm has a long history.
Refer
Http://xinhstechblog.blogspot.com/2014/06/storm-vs-spark-streaming-side-by-side.html
Http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming

http://www.zdatainc.com/2014/09/apache-storm-apache-spark/

Feature	Strom (Trident)	Spark Streaming	Description
Parallel framework	DAG-based task Parallel computing engine (task Parallel continuous computational engine Using DAG)	Spark-based parallel computing engine (data Parallel general Purpose batch processing engine)
Data processing mode	(one at a time) to process an event (message) at once Trident: (micro-batch) handling multiple events at once	(micro-batch) handling multiple events at once
Delay	Less than one second Trident (few seconds)	A few seconds)	Joshdecember 8 at 4:23 PM Thanks for the article! Could explain this point in a bit more detail? "But, it relies in transactions to update state, which is slower and often have to being implemented by the user." If I want to write my output to a persistent store e.g. Redis, and why would it is slower in Storm than in Spark streamin G? Reply Replies Xinh Huynhdecember at 5:24 PM Hi Josh, check out the slide about storm/trident here:http://spark-summit.org/wp-conte ... Spark-streaming.pdf If you want exactly-once semantics with Trident, you have to store a per-state transaction ID for each state. i.e., in word-count, for each word, your would store both the count as well as a transaction ID; Each key-value pair would look like: (Key:word, Value:count, Txid). Before updating the count, you would read in the old transaction ID to make sure it's up to date, and this read causes EXT RA latency. If you are using the Redis in memory, that might is okay, but if it had to go to disk then that would add noticeable latency t o the update. Whereas in Spark, you don't have the to store a per-state transaction ID. For the details of Trident transactional processing, see Http://storm.apache.org/documentation/Trident-state Joshdecember at 9:18 AM Hi Xinh, thanks for the explanation. I See, isn ' t this similar to Spark Checkpointing-where it saves states to HDFS every ~ seconds? Or is your point, with Storm it would (by default) persist the state much more frequently than Spark? Xinh Huynhdecember at 11:43 AM Hi Josh, yes, the fault tolerance in Spark involves periodic (~ second) checkpointing of RDDs. Yes, my point was that with Storm Trident The persistence occurs when each batch was processed, and by default that occurs a Lot more than once every seconds. And, in tuning any of these parameters, there's a tradeoff in the frequency of persistence vs. recovery time in the case O F failure.
Fault tolerant	At least once Trident: Accurate once	Accurate once
SOURCE origin	Backtype and Twitter	UCB
Implementation language	Clojure	Scala
API support	Java, Python, Ruby, etc.	Jscala, Java, python
Platform Integration	NA (based on zookeeper)	Spark (so you can unify (or share) the processing of current affairs and historical data)
Products, support	Storm have been around for several years and have run in production @ Twitter since, as well as at many other Companie S	Meanwhile, Spark streaming is a newer project; Its is production deployment (that I am aware of) have been at Sharethrough since 2013.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More