Spark, Storm, and Hadoop

Last Update:2015-06-28 Source: Internet

Author: User

Tags hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. What is storm and how do you do it better?
Storm is an open-source distributed real-time computing system that can handle a large amount of data flow simply and reliably. Storm has many application scenarios, such as real-time analytics, online machine learning, continuous computing, distributed RPC, ETL, and so on.
Storm supports horizontal scaling with high fault tolerance, guaranteeing that every message will be processed and processed quickly (in a small cluster, each node can process millions of messages per second).
Storm is easy to deploy and operational, and more importantly, you can use any programming language to develop your application.

2. Does Storm have an advantage over spark or Hadoop?
The three frameworks of storm and spark, Hadoop each have its advantages, each with its own best-case scenario.
Therefore, in different application scenarios, you should choose a different framework.

Storm is the best streaming computing framework, and Storm is written in Java and Clojure, and Storm has the advantage of full-memory computing, so it's positioned as a distributed real-time computing system, according to storm authors. Storm's significance for real-time computing is similar to the meaning of Hadoop for batch processing.
Storm's application scenario:
1) Stream Data processing
Storm can be used to handle incoming messages and write the results to a store after processing.
2) distributed RPC. Because storm's processing components are distributed and processing latencies are extremely low, they can be used as a common distributed RPC framework.

Sparkspark is an open-source cluster computing system based on memory computing that is designed to perform data analysis more quickly. Spark, a small team dominated by AMP Labs at the University of California, Berkeley, developed using Scala, similar to the general-purpose parallel computing framework of Hadoop MapReduce, Matei distributed computing based on the map reduce algorithm, with Hadoop MapReduce has the advantage, but unlike MapReduce, where the job intermediate output and results can be stored in memory, thus eliminating the need to read and write HDFs, Spark is better suited for the algorithm of map reduce, such as data mining and machine learning, that needs to be iterated.
Spark's application scenario:
1) applications that operate a specific data set multiple times
Spark is a memory-based iterative computing framework for applications that require multiple operations of a particular data set. The more times you need to repeat the operation, the greater the amount of data to read, the greater the benefit, the smaller the amount of data, but the more dense the computation, the less benefit.
2) Application of coarse-grained update status
Because of the features of the RDD, Spark does not apply to applications that have an asynchronous fine-grained update state, such as Web service storage or incremental web crawlers and indexes. Is that the application model for that incremental modification is inappropriate.
In general, the application of Spark is more extensive and common.

Hadoop is the idea of implementing MapReduce, which computes data slices to process large amounts of offline data data. The data processed by Hadoop must be in a database that is already stored in HDFs or like HBase, so Hadoop is implemented by moving computing to the machines that hold the data to improve efficiency.
Scenarios for Hadoop:
1) offline analysis and processing of massive data
2) Large-scale web information search
3) data-intensive parallel computing

Spark, Storm, and Hadoop

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More