Spark streaming vs. Storm

Source: Internet
Author: User
Tags documentation redis zookeeper


Feature

Strom (Trident) Spark Streaming Description
Parallel framework
DAG-based task Parallel computing engine (task Parallel continuous computational engine Using DAG)
Spark-based parallel computing engine (data Parallel general Purpose batch processing engine)

Data processing mode
(one at a time) to process an event (message) at once
Trident: (micro-batch) handling multiple events at once
(micro-batch) handling multiple events at once

Delay
Less than one second
Trident (few seconds)
A few seconds) Joshdecember 8 at 4:23 PM
Thanks for the article!
Could explain this point in a bit more detail? "But, it relies in transactions to update state, which is slower and often have to being implemented by the user."
If I want to write my output to a persistent store e.g. Redis, and why would it is slower in Storm than in Spark streamin G? Reply
Replies

Xinh Huynhdecember at 5:24 PM
Hi Josh, check out the slide about storm/trident here:http://spark-summit.org/wp-conte ... Spark-streaming.pdf
If you want exactly-once semantics with Trident, you have to store a per-state transaction ID for each state. i.e., in word-count, for each word, your would store both the count as well as a transaction ID; Each key-value pair would look like: (Key:word, Value:count, Txid). Before updating the count, you would read in the old transaction ID to make sure it's up to date, and this read causes EXT RA latency. If you are using the Redis in memory, that might is okay, but if it had to go to disk then that would add noticeable latency t o the update. Whereas in Spark, you don't have the to store a per-state transaction ID.
For the details of Trident transactional processing, see Http://storm.apache.org/documentation/Trident-state
Joshdecember at 9:18 AM
Hi Xinh, thanks for the explanation. I See, isn ' t this similar to Spark Checkpointing-where it saves states to HDFS every ~ seconds? Or is your point, with Storm it would (by default) persist the state much more frequently than Spark?
Xinh Huynhdecember at 11:43 AM
Hi Josh, yes, the fault tolerance in Spark involves periodic (~ second) checkpointing of RDDs. Yes, my point was that with Storm Trident The persistence occurs when each batch was processed, and by default that occurs a Lot more than once every seconds. And, in tuning any of these parameters, there's a tradeoff in the frequency of persistence vs. recovery time in the case O F failure.


Fault tolerant
At least once
Trident: Accurate once
Accurate once
SOURCE origin
Backtype and Twitter
UCB
Implementation language
Clojure Scala
API support
Java, Python, Ruby, etc.
Jscala, Java, python

Platform Integration
NA (based on zookeeper)
Spark (so you can unify (or share) the processing of current affairs and historical data)

Products, support
Storm have been around for several years and have run in production @ Twitter since, as well as at many other Companie S
Meanwhile, Spark streaming is a newer project; Its is production deployment (that I am aware of) have been at Sharethrough since 2013.

Computational theory Framework
Storm is the streaming solution in the Hortonworks Hadoop data platform
Spark streaming is in both MapR ' s distribution and Cloudera ' s Enterprise data platform. Databricks

Cluster integration, deployment approach
Dependent Zookeeper,standalone,messo
Standalone,yarn,messo

Google trend



Bug Burn Chart

https://issues.apache.org/jira/browse/STORM/

https://issues.apache.org/jira/browse/SPARK/
Visible spark problem resolution is much more timely than storm











The controversy between spark stream and storm has a long history.
Refer
Http://xinhstechblog.blogspot.com/2014/06/storm-vs-spark-streaming-side-by-side.html
Http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming

http://www.zdatainc.com/2014/09/apache-storm-apache-spark/


Feature

Strom (Trident) Spark Streaming Description
Parallel framework
DAG-based task Parallel computing engine (task Parallel continuous computational engine Using DAG)
Spark-based parallel computing engine (data Parallel general Purpose batch processing engine)

Data processing mode
(one at a time) to process an event (message) at once
Trident: (micro-batch) handling multiple events at once
(micro-batch) handling multiple events at once

Delay
Less than one second
Trident (few seconds)
A few seconds) Joshdecember 8 at 4:23 PM
Thanks for the article!
Could explain this point in a bit more detail? "But, it relies in transactions to update state, which is slower and often have to being implemented by the user."
If I want to write my output to a persistent store e.g. Redis, and why would it is slower in Storm than in Spark streamin G? Reply
Replies

Xinh Huynhdecember at 5:24 PM
Hi Josh, check out the slide about storm/trident here:http://spark-summit.org/wp-conte ... Spark-streaming.pdf
If you want exactly-once semantics with Trident, you have to store a per-state transaction ID for each state. i.e., in word-count, for each word, your would store both the count as well as a transaction ID; Each key-value pair would look like: (Key:word, Value:count, Txid). Before updating the count, you would read in the old transaction ID to make sure it's up to date, and this read causes EXT RA latency. If you are using the Redis in memory, that might is okay, but if it had to go to disk then that would add noticeable latency t o the update. Whereas in Spark, you don't have the to store a per-state transaction ID.
For the details of Trident transactional processing, see Http://storm.apache.org/documentation/Trident-state
Joshdecember at 9:18 AM
Hi Xinh, thanks for the explanation. I See, isn ' t this similar to Spark Checkpointing-where it saves states to HDFS every ~ seconds? Or is your point, with Storm it would (by default) persist the state much more frequently than Spark?
Xinh Huynhdecember at 11:43 AM
Hi Josh, yes, the fault tolerance in Spark involves periodic (~ second) checkpointing of RDDs. Yes, my point was that with Storm Trident The persistence occurs when each batch was processed, and by default that occurs a Lot more than once every seconds. And, in tuning any of these parameters, there's a tradeoff in the frequency of persistence vs. recovery time in the case O F failure.


Fault tolerant
At least once
Trident: Accurate once
Accurate once
SOURCE origin
Backtype and Twitter
UCB
Implementation language
Clojure Scala
API support
Java, Python, Ruby, etc.
Jscala, Java, python

Platform Integration
NA (based on zookeeper)
Spark (so you can unify (or share) the processing of current affairs and historical data)

Products, support
Storm have been around for several years and have run in production @ Twitter since, as well as at many other Companie S
Meanwhile, Spark streaming is a newer project; Its is production deployment (that I am aware of) have been at Sharethrough since 2013.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.