Big Data Architect: Which one should hadoop and storm choose?

Last Update:2014-09-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First of all, we understand that hadoop is disk-level computing. When computing is performed, data needs to be read and written to the disk. Storm is memory-level computing, and data is directly imported to the memory through the network. The read/write memory is N orders of magnitude faster than the read/write disk speed. According to the Harvard cs61 courseware, the disk access delay is about 75000 times the memory access delay. Storm is faster.

Note:
1. latency: The time from data generation to calculation results. "fast" should mainly refer to this.
2. throughput refers to the amount of data processed by the system per unit time.

Storm's network direct transmission and memory computing have a much lower latency than hadoop's transmission through HDFS. When the computing model is suitable for streaming, Storm's stream processing, it saves time for batch processing to collect data, because storm is a service-type job and also saves the Job Scheduling latency. In terms of latency, storm is faster than hadoop.

From the principle perspective:

Hadoop M/R is based on HDFS and requires splitting input data, generating intermediate data files, sorting, data compression, and multi-Copy Replication, which is less efficient.
Storm is based on zeromq, a high-performance messaging library that does not persist data.

Why is storm faster than hadoop? The following is an Application Scenario:
In a typical scenario, thousands of log producers generate log files and some ETL operations need to be performed to store them in a database.

Assuming that hadoop is used, you need to first store the HDFS and split the granularity of a file every minute (this granularity is already extremely fine. If it is smaller, there will be a bunch of small files on HDFS ), when hadoop started computing, one minute had passed, and it took another minute to start scheduling the task, and then the job was running. If there were so many machines, it would take a few minutes, then it took a very short time to write the database hypothesis. In this way, it took at least two minutes from data generation to the end.
When streamcompute generates data, a program continuously monitors the generation of logs, generates a row and sends it to the streamcompute system through a transmission system, which then processes the data directly, after processing, data is directly written to the database. Each data entry can be written to the database within milliseconds when resources are sufficient.

Also, let's talk about another scenario:
If the wordcount of a large file is put on storm for stream processing, and storm outputs the result after all existing data processing is complete, you can compare it with hadoop at this time, in fact, the comparison is not latency, but throughput.

Certificate --------------------------------------------------------------------------------------------------------------------------------------------------

The main aspect: hadoop uses disks as the medium for intermediate exchange, while storm data is always transferred in memory.
The two fields are not exactly the same. One is batch processing, which is based on task scheduling; the other is real-time processing and stream-based.
Taking Water as an example, hadoop can be regarded as pure water and can be moved in a bucket. Storm uses a water pipe to pre-connect topology and then open the faucet, and the water will flow continuously.

Certificate ---------------------------------------------------------------------------------------------------------------------------------------------------

Storm's main engineer, Nathan marz, said: storm can easily write and expand complex real-time computing in a computer cluster. Storm is for real-time processing, just like hadoop is for batch processing. Storm ensures that every message is processed, and it is very fast-millions of messages can be processed per second in a small cluster. Even better, you can use any programming language for development.
Storm has the following features:
1. Simple programming model. Similar to mapreduce, it reduces the complexity of parallel batch processing and storm reduces the complexity of real-time processing.
2. You can use various programming languages. You can use various programming languages on top of storm. Clojure, Java, Ruby, and Python are supported by default. To support other languages, you only need to implement a simple storm communication protocol.
3. Fault tolerance. Storm manages work processes and node faults.
4. horizontal scaling. Computing is performed in parallel between multiple threads, processes, and servers.
5. reliable message processing. Storm ensures that each message can be processed completely at least once. When a task fails, it is responsible for retrying messages from the message source.
6. Fast. The system design ensures that messages can be processed quickly and uses MQ as its underlying message queue.
7. Local Mode. Storm has a "Local Mode" that can fully simulate a storm cluster during processing. This allows you to quickly perform development and unit testing.

Certificate ---------------------------------------------------------------------------------------------------------------------------------------------------------------

When resources are the same, storm generally has a lower latency than mapreduce. However, the throughput is lower than that of mapreduce. Storm is a typical stream computing system, while mapreduce is a typical batch processing system. The following is the flow of the stream computing and batch processing system.

This data processing process can be divided into three phases:
1. Data collection and preparation
2. For data computing (involving intermediate storage in computing), the "decisions made by those aspects" in the subject should mainly refer to the processing method at this stage.
3. Data Result Presentation (feedback)

1) in the data collection phase, the current typical processing strategy: the data generation system generally comes from page hitting and parsing dB logs. streamcompute will queue messages (such as kafaka, metaq, timetunle. The batch processing system generally collects data into distributed file systems (such as HDFS), and of course uses message queues. Currently, we call message queues and file systems pre-processing storage. There is no big difference between the two in terms of latency and throughput. There is a big difference from this pre-processing storage to the data computing stage, streamcompute generally reads data from the message queue in the streamcompute System (Storm) in real time for computation. The batch processing system typically collects a large number of data and imports the data to the computing system (hadoop) in batches ), there is a difference in latency.
2) In the data computing stage, the low latency of the stream Computing System (Storm) mainly involves several aspects (for subject issues)
A: The Storm Process is resident and can process data in real time.
After a batch of mapreduce data is collected, the Job Management System starts the task, jobtracker allocates the computing task, and tasktacker starts the related computing process.
B: stom data is directly transmitted between each computing unit through the network (zeromq.
The result of the mapreduce map task operation is written to HDFS, because the reduce task is dragged to the operation through the network. It is relatively slow to read and write more disks.
C: complex operations
Storm's calculation model directly supports DAG (directed acyclic graph)
Mapreduce requires multiple Mr processes. Some map operations are meaningless.

3) Data Result Presentation
Streamcompute generally reports the computing results directly to the final result set (display pages, databases, and search engine indexes ). However, mapreduce generally needs to import the results to the result set in batches after the entire operation is completed.

There is no essential difference between the actual streamcompute and the batch processing system. Storm Trident also has a batch concept, while mapreduce can reduce the dataset of each operation (for example, it can be started once in a few minutes ), facebook's Puma is a stream Computing System Based on hadoop.

Thinking: what projects are suitable for hadoop and what projects are suitable for storm

Big Data Architect: Which one should hadoop and storm choose?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Big Data Architect: Which one should hadoop and storm choose?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Big Data Architect: Which one should hadoop and storm choose?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support