Thoughts on real-time analysis and offline analysis (2)

Source: Internet
Author: User

Thoughts on real-time and offline analysis in the last blog

I saw the design of S4 and Storm yesterday, and combined with my previous understanding of Microsoft Dryad, I feel that some commonalities need to be clear.

Before the emergence of the "Split-merge" model like MapReduce, we all adopted the "one-layer computing" approach. For example,What I Have DoneThe frequency of occurrence of each word in this sentence. Because this problem is simple and the target data volume is small, there is no challenge to our computing.


The idea of divide governance has long existed, but I was impressed by the profound example of network computing. Split a large computing task into a small task, hand it over to the volunteer's computer for execution, and then merge the results. This is also
What MapReduce does. MapReduce can be called "Two-layer computing ". The core problem is to use parallelism to solve the problems of large data volume or large computing workload.
The idea of MapReduce is only two steps, which also limits the difficulty of complex computing, such as Join and Sort. It requires multiple steps of MapReduce tasks. Current Hadoop
The disadvantage of MapReduce in solving these complex tasks is that each MapReduce
Jobs are independent of each other, beginning and end. In the task sequence, subsequent jobs cannot effectively use the output results (locality) of the previous Job ).


These real-time computing frameworks are also the idea of sub-governance, but their basic models are directed acyclic graphs (DAG ). Because of the complexity of computing, the data processing process can be easily extended. You can set this mode
Type is called "multi-layer computing ". During the entire computing process, it is still parallel computing, data is not implemented, and data flows in the memory and network. You can plan the topology of a computing flow based on the computing complexity. It solves
MapReduce has two problems: 1. The process is lengthy and difficult to code when complicated computing is hard to implement MapReduce. 2.
Tasks are sequential, but computing is independent and cannot take advantage of localization.

"Two-layer model" and "multi-layer model" are both valuable in different business scenarios. I do not agree that it is hard to put a lot of complicated computing on the MapReduce model, which makes it difficult to understand and code more disgusting. It seems that these real-time computing frameworks also provide topology planning tools, which are very considerate services.

It seems that the common Summary of several architectures is:

1. All analysis models are directed acyclic graphs (dags ).

2. Still parallelization

3. Data on the fly

Continue learning...

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.