Open source framework for distributed computing Introduction to Hadoop practice (i)

Last Update:2017-02-27 Source: Internet

Author: User

Tags file system log resource split

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the SIP project design process, for its huge log at the beginning to consider using the task decomposition of multithreading mode to analyze statistics, in my previous article "Tiger Concurrent Practice-Log analysis of parallel decomposition design and implementation" mentioned. But because the content of statistics is still very simple, so the use of memcache as a counter, combined with MySQL to complete the access control and statistics work. However, in the future, the work of massive log analysis needs to be prepared. Now the most fire technical vocabulary is "cloud computing", in the open API increasingly popular today, the Internet application data will be more and more valuable, how to analyze these data, mining its intrinsic value, need distributed computing to support the analysis of massive data.

Looking back, the earlier kind of multithreading, multi-task decomposition of the log analysis design, in fact, is a stand-alone version of the calculation of the abbreviation, how to split the work of this stand-alone, into a collaborative work cluster, in fact, is the distributed computing framework design involved. At the BEA conference last year, BEA and VMware collaborated on virtual machines to build clusters, in the hope that the computer hardware would be similar to the resource pools in the application, and that users would not have to care about the allocation of resources to maximize the value of their hardware resources. Distributed computing is also the case, the specific computing task to which machine to execute, after the implementation by WHO to summarize, this is the distributed framework of the master to choose, and users simply to analyze the content to provide to the distributed computing system as input, you can get the results of distributed computing.

Hadoop is a distributed computing open source framework for the Apache open source organization that has been applied to many large web sites, such as Amazon, Facebook and Yahoo. For me, one of the most recent usage points is the log analysis of the service integration platform. The service integration platform's log volume will be very large, and this also coincides with the application of distributed computing scenarios (log analysis and indexing is the two major scenarios).

Currently there is no formal use, so it is their own amateur groping, follow-up written related content, are a novice learning process, there will inevitably be some mistakes, just want to record down can share to more like-minded friends.

What is Hadoop?

Before doing anything, the first step is to know what (what), then why (why), and finally how (what to do). But a lot of development friends after many years of project, are accustomed to first how, then what, finally is why, this will only let oneself become impetuous, and often will be the technology mistakenly used in unsuitable scenes.

The central design of the Hadoop framework is: MapReduce and HDFs. MapReduce's ideas are widely circulated by a Google paper, and a simple word explains that MapReduce is "the breakdown of tasks and the aggregation of results". HDFS is the acronym for the Hadoop Distributed File System (Hadoop Distributed File systems), which provides low-level support for distributed computing storage.

MapReduce from its name to see the approximate reason, two verbs map and Reduce, "map (expand)" is to break a task into multiple tasks, "Reduce" is the decomposition of multitasking results aggregated, to get the final analysis results. This is not a new idea, in fact, in the previous mentioned multithreading, multitasking design can find the shadow of this thought. Whether it's in the real world or in program design, a job can be split into multiple tasks, the relationship between tasks can be divided into two: one is unrelated to the task, can be executed in parallel, the other is the task is interdependent, the order can not be reversed, such tasks can not be processed in parallel. Back in college, the professor to let everyone to analyze the critical path, nothing more than to find the most time-saving task decomposition execution mode. In the distributed system, the machine cluster can be regarded as the hardware resource pool, and the parallel task is split, then left to each idle machine resources to deal with, can greatly improve the computational efficiency, at the same time, this resource-independent, for the expansion of computing clusters undoubtedly provides the best design guarantee. (In fact, I always think that the cartoon of Hadoop should not be a small elephant, should be ants, distributed computing is like ants eat elephants, cheap machine groups can match any high-performance computer, the longitudinal expansion of the curve is always the enemy but the horizontal extension of the diagonal). After the task is decomposed, it is necessary to consolidate the results of the processing, which is what reduce does.

Figure 1:mapreduce Structure schematic diagram

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Open source framework for distributed computing Introduction to Hadoop practice (i)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Open source framework for distributed computing Introduction to Hadoop practice (i)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support