Open source framework for distributed computing Introduction to Hadoop practice (i)

Source: Internet
Author: User
Tags file system log resource split

In the SIP project design process, for its huge log at the beginning to consider using the task decomposition of multithreading mode to analyze statistics, in my previous article "Tiger Concurrent Practice-Log analysis of parallel decomposition design and implementation" mentioned. But because the content of statistics is still very simple, so the use of memcache as a counter, combined with MySQL to complete the access control and statistics work. However, in the future, the work of massive log analysis needs to be prepared. Now the most fire technical vocabulary is "cloud computing", in the open API increasingly popular today, the Internet application data will be more and more valuable, how to analyze these data, mining its intrinsic value, need distributed computing to support the analysis of massive data.

Looking back, the earlier kind of multithreading, multi-task decomposition of the log analysis design, in fact, is a stand-alone version of the calculation of the abbreviation, how to split the work of this stand-alone, into a collaborative work cluster, in fact, is the distributed computing framework design involved. At the BEA conference last year, BEA and VMware collaborated on virtual machines to build clusters, in the hope that the computer hardware would be similar to the resource pools in the application, and that users would not have to care about the allocation of resources to maximize the value of their hardware resources. Distributed computing is also the case, the specific computing task to which machine to execute, after the implementation by WHO to summarize, this is the distributed framework of the master to choose, and users simply to analyze the content to provide to the distributed computing system as input, you can get the results of distributed computing.

Hadoop is a distributed computing open source framework for the Apache open source organization that has been applied to many large web sites, such as Amazon, Facebook and Yahoo. For me, one of the most recent usage points is the log analysis of the service integration platform. The service integration platform's log volume will be very large, and this also coincides with the application of distributed computing scenarios (log analysis and indexing is the two major scenarios).

Currently there is no formal use, so it is their own amateur groping, follow-up written related content, are a novice learning process, there will inevitably be some mistakes, just want to record down can share to more like-minded friends.

What is Hadoop?

Before doing anything, the first step is to know what (what), then why (why), and finally how (what to do). But a lot of development friends after many years of project, are accustomed to first how, then what, finally is why, this will only let oneself become impetuous, and often will be the technology mistakenly used in unsuitable scenes.

The central design of the Hadoop framework is: MapReduce and HDFs. MapReduce's ideas are widely circulated by a Google paper, and a simple word explains that MapReduce is "the breakdown of tasks and the aggregation of results". HDFS is the acronym for the Hadoop Distributed File System (Hadoop Distributed File systems), which provides low-level support for distributed computing storage.

MapReduce from its name to see the approximate reason, two verbs map and Reduce, "map (expand)" is to break a task into multiple tasks, "Reduce" is the decomposition of multitasking results aggregated, to get the final analysis results. This is not a new idea, in fact, in the previous mentioned multithreading, multitasking design can find the shadow of this thought. Whether it's in the real world or in program design, a job can be split into multiple tasks, the relationship between tasks can be divided into two: one is unrelated to the task, can be executed in parallel, the other is the task is interdependent, the order can not be reversed, such tasks can not be processed in parallel. Back in college, the professor to let everyone to analyze the critical path, nothing more than to find the most time-saving task decomposition execution mode. In the distributed system, the machine cluster can be regarded as the hardware resource pool, and the parallel task is split, then left to each idle machine resources to deal with, can greatly improve the computational efficiency, at the same time, this resource-independent, for the expansion of computing clusters undoubtedly provides the best design guarantee. (In fact, I always think that the cartoon of Hadoop should not be a small elephant, should be ants, distributed computing is like ants eat elephants, cheap machine groups can match any high-performance computer, the longitudinal expansion of the curve is always the enemy but the horizontal extension of the diagonal). After the task is decomposed, it is necessary to consolidate the results of the processing, which is what reduce does.

Figure 1:mapreduce Structure schematic diagram

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.