Analysis of the Reason Why Hadoop is not suitable for processing Real-time Data

Last Update:2015-02-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Analysis of the Reason Why Hadoop is not suitable for processing Real-time Data
1. Overview

Hadoop has been recognized as the undisputed king in the big data analysis field. It focuses on batch processing. This model is sufficient for many cases (for example, creating an index for a webpage), but there are other use models that require real-time information from highly dynamic sources. To solve this problem, we need to use Storm launched by Twitter. Storm does not process static data, but it processes the expected continuous stream data. Considering that Twitter users generate 0.14 billion tweets every day, it is easy to see the huge use of this technology.

Storm is not just a traditional big data analysis system: it is an example of a complex event processing (CEP) system. CEP systems are generally classified into computing and detection-oriented systems. Each system is implemented in Storm through user-defined algorithms. For example, CEP can be used to identify meaningful events in the event flood and then process these events in real time.

2. Why is Hadoop not suitable for real-time computing?

This is not suitable. It is a relative concept. If the business has low latency requirements, this problem will not exist. However, in fact, some business requirements in the enterprise have high latency requirements. Let me talk about it as follows:

2.1 latency

Storm's network direct transmission and memory computing have a much lower latency than Hadoop's HDFS transmission. When the computing model is more suitable for streaming, Storm's stream trial processing, saves the time for batch data collection. Because Storm is a service-type job, it also saves the Job Scheduling latency. From the perspective of latency, Storm is faster than Hadoop, so Storm is more suitable for real-time stream data processing. The latency problem is described in a business scenario.

2.1.1 business scenarios

Thousands of log producers generate log files and perform ETL operations on these log files to store them in the database.

I used Hadoop and Storm to analyze this business scenario. If we use Hadoop to process this business process, we need to first store it in HDFS and split the granularity of a file every minute (less than the second level, and the minute is the minimum latitude) for calculation. This granularity is already extremely fine. If it is small, there will be a pile of small files on HDFS. Next, when Hadoop started computing, one minute had passed, and it took another minute to start scheduling the task. Then, the job ran. Assuming that the cluster was large, it would take several seconds to complete computing, then it took a little time to write the database hypothesis (ideally). In this way, it took at least two minutes from data generation to the end.

However, when streamcompute generates data, a program continuously monitors the generation of logs and generates a row, which is sent to the streamcompute system through a transmission system, the stream computing system then processes the data directly and writes the data to the database after processing. Each data entry can be completed in milliseconds when the resources are sufficient (the cluster is large.

2.1.2 Throughput

In terms of throughput, Hadoop has advantages over Storm. Because Hadoop is a batch processing computing, Hadoop's throughput is higher than Storm's stream processing computing.

2.2 application fields

Hadoop is an offline analysis tool that processes massive data based on the MapReduce model, while Storm is a distributed, real-time data stream analysis tool that continuously generates data, such as Twitter's Timeline. In addition, the M/R model is difficult to exert in the real-time field. Its Own design features determine that the data source must be static.

Hardware 2.3

Hadoop is disk-level computing. When computing, data on the disk needs to be read and written to the disk. Storm is memory-level computing, and data is directly imported to the memory through the network. The read/write memory is N orders of magnitude faster than the read/write disk speed. According to industry conclusions, the latency of disk access is about 7.5 times the latency of memory access, so we can see that Storm is faster.

3. Detailed Analysis

Before analyzing, Let's first look at the models of the Two Computing frameworks. First, let's look at the MapReduce model. Taking WordCount as an example, as shown in:

I have read the code in the Hadoop-mapreduce-project under the hadoop source code and should be familiar with this process. I will not repeat this process here.

Next, let's take a look at the Storm model, as shown in:

Next we will discuss two metric issues: latency and throughput.

Latency: The time from data generation to calculation results. It is closely related to "Speed.
Throughput: the amount of data processed by the system per unit time.

In addition, when resources are the same, Storm generally has a lower latency than MapReduce,

The throughput is lower than that of MapReduce. Next I will describe the process of downstream computing and batch computing. The entire data processing process can be divided into three phases:

1. Data collection phase
2. Data computing (involving intermediate storage in computing)

3. Data Result Presentation (feedback)

3.1.1 data collection phase

Currently, typical processing strategies: data generation systems generally come from Web logs and DB Log parsing. streamcompute data collection is the acquired Message Queue (such as Kafka and RabbitMQ. A batch processing system collects data to a distributed file system (such as HDFS). Of course, message queues are also used. Currently, we call message queues and file systems pre-processing storage. There is no big difference between the latency and throughput in this phase. There is a big difference from this pre-processing storage to the data computing phase. Streamcompute reads data from the message queue in the streamcompute System (Storm) in real time for computation. After the batch processing is unified to accumulate a large amount of data, batch import to the computing system (Hadoop), there is a difference in latency.

3.1.2 data computing stage

Streamcompute System (Storm) latency mainly includes the following aspects:

Storm processes are resident and can process data in real time. After a batch of MapReduce data is accumulated, the Job Management System starts the task, Jobtracker allocates the computing task, and Tasktacker starts the related computing process.
Storm Data is directly transmitted between each computing unit through the network (ZeroMQ. The result of the MapReduce Map task operation is written to HDFS, and the operation is dragged by the Reduce task through the network. Disk read/write operations are relatively slow.
For complex operations, Storm's computing model directly supports DAG (Directed Acyclic graphs, dependency between multiple application processes, and input of the next application as the previous output ), mapReduce requires multiple MR processes, and some Map operations are meaningless.

3.1.3 data presentation

Streamcompute generally reports the computing results directly to the final result set (display pages, databases, and search engine indexes ). However, MapReduce generally needs to import the results to the result set in batches after the entire operation is completed.

4. Summary

Storm can easily write and expand complex real-time computing in a computer cluster. Storm is like Hadoop in batch processing in real time. Storm ensures that each message is processed quickly. In a small cluster, millions of messages can be processed per second.

Storm has the following features:

Simple programming model. Similar to MR, MR reduces the complex rows of parallel batch processing and Storm reduces the complex rows of real-time processing.
You can use various programming languages. Follow the Storm communication protocol.
Fault Tolerance. Storm manages worker processes and node faults.
Horizontal scaling. Computing is performed in parallel between processes and servers in multiple threads.
Reliable message processing. Storm ensures that each message can be processed completely at least once and uses MQ as its underlying message queue.
Local Mode. Storm has a "Local Mode" that can fully simulate a Storm cluster during processing. This allows you to quickly perform development and unit testing.

Finally, it is concluded that Hadoop's MR is based on HDFS and needs to split the input data to generate intermediate data files, Sort data, compress data, and perform multi-point replication. Storm is based on ZeroMQ, a high-performance messaging library, and cannot persist data. This article will be shared here. If you have any questions, you can join the QQ group or send me an email. I will do my best to help you!

Tutorial on standalone/pseudo-distributed installation and configuration of Hadoop2.4.1 under Ubuntu14.04

Install and configure Hadoop2.2.0 on CentOS

Build a Hadoop environment on Ubuntu 13.04

Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1

Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)

Configuration of Hadoop environment in Ubuntu

Detailed tutorial on creating a Hadoop environment for standalone Edition

Build a Hadoop environment (using virtual machines to build two Ubuntu systems in a Winodws environment)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Analysis of the Reason Why Hadoop is not suitable for processing Real-time Data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Analysis of the Reason Why Hadoop is not suitable for processing Real-time Data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support