Large Data architect: Hadoop, Storm which one to choose

Last Update:2014-12-18 Source: Internet

Author: User

Keywords Can stream compute disk direct DFS

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First of all: Hadoop is disk-level computing, when computing, data on disk, need to read and write disk; http://www.aliyun.com/zixun/aggregation/13431.html ">storm is a memory-level calculation, Data imports memory directly over the network. Read/write memory is faster n order of magnitude than read-write disk. According to the Harvard CS61 Courseware, disk access latency is about 75,000 times times the latency of memory access. So storm faster.

Comments：

1. Delay, refers to the data from the production to the operation of the results of the time, "fast" should mainly refer to this.

2. Throughput refers to the amount of data processed by the system unit time.

Storm network direct, in memory calculation, the time delay is necessarily much lower than that of Hadoop via HDFS transmission; When the calculation model is more suitable for flow, the storm flow processing eliminates the time of batch collecting data, because storm is a service-oriented job, and also eliminates the delay of job scheduling. So in terms of latency, storm is faster than Hadoop.

In terms of principle:

Hadoop M/R is based on HDFs, it needs to divide input data, produce intermediate data file, sort, data compress, copy and so on, and it is less efficient.

Storm is based on ZEROMQ, a high-performance message communication library, that does not persist data.

Why storm faster than Hadoop, here's a scenario

Say a typical scenario, thousands of log production to produce log files, some ETL operations need to be stored in a database.

Assuming that Hadoop is used, you need to deposit HDFs first, cut the granularity of a file per minute (this granularity has been extremely thin, then small words on the HDFS will be a heap of small files), Hadoop began to calculate, 1 minutes has passed, and then began scheduling tasks and spent a minute, Then the job runs, assuming that the machine is very much, the number of notes is over, then write the database hypothesis also spent a little time, so that from the data generation to the end can be used has been at least two minutes.

And the flow calculation is when the data is generated, there is a program to continuously monitor the production of the log, generate a row through a transmission system to the flow of computing systems, and then the flow of computing systems directly processed, after processing, write directly to the database, each data from generation to write to the database, in sufficient resources can be completed at the millisecond level.

And say another scene:

If a large file wordcount, put it on the storm on the flow of processing, and so on all the data processing to let storm output results, at this time, you put it and hadoop more speed, at this point, in fact, compared with the delay, but compared to the huff and puff.

--------------------------------------------------------------------------------------------------------------- -----------------------------------

The main aspect: Hadoop uses disk as medium for intermediate exchange, and storm data is always flowing in memory.

The two areas are not exactly the same, one is batch processing, based on task scheduling, and the other is real-time processing, based on flow.

In the case of water, Hadoop can be thought of as pure water, a bucket to move, and Storm is piped, topology, and then open the faucet, the water flowing out.

--------------------------------------------------------------------------------------------------------------- ------------------------------------

Storm's chief engineer, Nathan Marz, said: Storm can easily write and extend complex real-time computations in a cluster of computers, Storm to real-time processing, just as Hadoop is to batches. Storm guarantees that every message will be processed, and that it will quickly--in a small cluster, can handle millions of messages per second. What's even better is that you can use any programming language for development.

The main features of Storm are as follows:

1. Simple programming model. Similar to mapreduce reduces the complexity of parallel batch processing, storm reduces the complexity of real-time processing.

2. You can use a variety of programming languages. You can use a variety of programming languages on top of storm. Clojure, Java, Ruby, and Python are supported by default. To increase support for other languages, simply implement a simple storm communication protocol.

3. Fault tolerance. Storm manages worker processes and node failures.

4. Horizontal expansion. Calculations are performed in parallel between multiple threads, processes, and servers.

5. Reliable message processing. Storm ensure that each message is at least one complete process. When a task fails, it is responsible for retrying the message from the message source.

6. Fast. The design of the system ensures that the message can be processed quickly, using ØMQ as its underlying message queue.

7. Local mode. Storm has a "local mode" that can completely simulate the storm cluster during the process. This allows you to quickly develop and unit test.

--------------------------------------------------------------------------------------------------------------- ------------------------------------------------

In the same situation of consuming resources, generally speaking, the storm delay is lower than mapreduce. But the throughput is also lower than mapreduce. Storm is a typical stream computing system, and MapReduce is a typical batch processor. The following convection computing and batch system processes

This data processing process can be roughly divided into three phases:

1. Data collection and preparation

2. Data calculations (involving intermediate storage in computing), the "Decisions of those aspects" in the main question should refer mainly to this phase of processing.

3. Data results Presentation (feedback)

1 Data acquisition stage, the current typical processing strategy: the production system of the database is generally from the page management and parsing db log, stream computing will data acquisition in the message queue (such as Kafaka,metaq,timetunle) and so on. Batch processing systems typically collect data into distributed file systems (such as HDFS) and, of course, use Message Queuing. For the moment, we call Message Queuing and file system preprocessing storage. The two are not much different in latency and throughput, then from this preprocessing storage into the data calculation phase is very different, the flow calculation is generally in real-time read Message Queuing into the stream computing system (storm) of the data to operate, batch processing system will generally save a large number of batch import into the computing Systems (Hadoop) , there is the difference between the delay.

2 The data calculation phase, the Stream computing System (storm) of the low latency is mainly a few aspects (for the problem of the main question)

The A:storm process is permanent, and data can be processed in real time.

MapReduce data saved a batch by the job management system to start the task, Jobtracker Computing task assignment, Tasktacker start related operation process

B:stom the data between each calculated cell is transmitted directly through the network (ZEROMQ).

The results of the MapReduce Map task operation are written to HDFs, where the reduce task drags past operations over the network. Relatively more disk read and write, relatively slow

C: For complex operations

The storm model directly supports DAG (directed non-ring graph).

MapReduce needs to be composed of several Mr Processes, and some map operations are meaningless

3 Data results show

The general operation results of the flow calculation are directly fed to the final result set (Display page, database, search engine index). MapReduce generally requires that the results be imported into the result set after the complete operation.

There is no essential difference between actual stream computing and batch processing systems, and Trident, like Storm, has a batch concept, and MapReduce can reduce the data set for each operation (for example, a few minutes), and Facebook's Puma is a stream computing system based on Hadoop.

original link: http://doc.okbase.net/sphl520/archive/98505.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More