Spark, Storm, and Hadoop

Source: Internet
Author: User
Tags hadoop mapreduce

1. What is storm and how do you do it better?
Storm is an open-source distributed real-time computing system that can handle a large amount of data flow simply and reliably. Storm has many application scenarios, such as real-time analytics, online machine learning, continuous computing, distributed RPC, ETL, and so on.
Storm supports horizontal scaling with high fault tolerance, guaranteeing that every message will be processed and processed quickly (in a small cluster, each node can process millions of messages per second).
Storm is easy to deploy and operational, and more importantly, you can use any programming language to develop your application.

2. Does Storm have an advantage over spark or Hadoop?
The three frameworks of storm and spark, Hadoop each have its advantages, each with its own best-case scenario.
Therefore, in different application scenarios, you should choose a different framework.

Storm is the best streaming computing framework, and Storm is written in Java and Clojure, and Storm has the advantage of full-memory computing, so it's positioned as a distributed real-time computing system, according to storm authors. Storm's significance for real-time computing is similar to the meaning of Hadoop for batch processing.
Storm's application scenario:
1) Stream Data processing
Storm can be used to handle incoming messages and write the results to a store after processing.
2) distributed RPC. Because storm's processing components are distributed and processing latencies are extremely low, they can be used as a common distributed RPC framework.

Sparkspark is an open-source cluster computing system based on memory computing that is designed to perform data analysis more quickly. Spark, a small team dominated by AMP Labs at the University of California, Berkeley, developed using Scala, similar to the general-purpose parallel computing framework of Hadoop MapReduce, Matei distributed computing based on the map reduce algorithm, with Hadoop MapReduce has the advantage, but unlike MapReduce, where the job intermediate output and results can be stored in memory, thus eliminating the need to read and write HDFs, Spark is better suited for the algorithm of map reduce, such as data mining and machine learning, that needs to be iterated.
Spark's application scenario:
1) applications that operate a specific data set multiple times
Spark is a memory-based iterative computing framework for applications that require multiple operations of a particular data set. The more times you need to repeat the operation, the greater the amount of data to read, the greater the benefit, the smaller the amount of data, but the more dense the computation, the less benefit.
2) Application of coarse-grained update status
Because of the features of the RDD, Spark does not apply to applications that have an asynchronous fine-grained update state, such as Web service storage or incremental web crawlers and indexes. Is that the application model for that incremental modification is inappropriate.
In general, the application of Spark is more extensive and common.

Hadoop is the idea of implementing MapReduce, which computes data slices to process large amounts of offline data data. The data processed by Hadoop must be in a database that is already stored in HDFs or like HBase, so Hadoop is implemented by moving computing to the machines that hold the data to improve efficiency.
Scenarios for Hadoop:
1) offline analysis and processing of massive data
2) Large-scale web information search
3) data-intensive parallel computing

Spark, Storm, and Hadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.