Open source framework for distributed computing introduction to Hadoop practice
Source: Internet
Author: User
KeywordsDistributed computing is yes open source
In the SIP project design process, for its large log in the early consideration of the use of task decomposition of multithreading mode to analyze statistics, in the previous blog mentioned that part of the design, but because the content of the statistics is still very simple, Therefore, the use of memcache as a counter combined with MySQL completed http://www.aliyun.com/zixun/aggregation/38609.html "> Access control and statistical work." But in the future, the work of massive log analysis needs to be prepared. Now the most fire technical vocabulary is "cloud computing", in the open API increasingly popular today, the Internet application data will be more and more valuable, how to analyze these data, mining its intrinsic value, it needs distributed computing to support the analysis of massive data.
Looking back, the earlier kind of multithreading, multi-task decomposition of the log analysis design, in fact, is a stand-alone version of the calculation of the abbreviation, how to split the work of this stand-alone, into a cluster work synergy, in fact, is the distributed computing framework design involved. At the BEA conference last year, BEA and VMware collaborated on virtual machines to build clusters, in the hope that the computer hardware would be similar to the resources in the resource pool in the application, and that users would not be concerned with allocating resources to maximize the value of their hardware resources. Distributed computing is also the case, the specific computing task to which machine to execute, after the implementation of who will be summed up, which is distributed by the Master of the framework to choose, and users simply to be analyzed to provide the content of the distributed computing system as input, you can get the results of distributed computing. Hadoop is a distributed computing open source framework for the Apache open source organization that has been applied to many large web sites, such as Amazon, Facebook,yahoo, and so on. For me, one of the most recent use point is the service Integration Platform log analysis, the service Integration Platform log volume will be very large, this also coincides with the application of Distributed computing scenario (log analysis, index establishment is the two major application scenarios).
Currently there is no formal use, so it is their own amateur groping, follow-up written related content, are a novice learning process, there will inevitably be some mistakes, just want to record down can share to more like-minded friends.
What is Hadoop
Before you do something, the first step is to know what, then is why, finally is how, but many friends in the development after many years of project, are accustomed to first how, then what, finally is why, this will only become impetuous, and often will be the misuse of the technology is not suitable for the scene.
The most central design of the Hadoop framework is: MapReduce and HDFs. MapReduce's ideas are widely circulated by a Google paper, and a simple word explaining mapreduce is the breakdown of tasks and the aggregation of results. HDFS is the abbreviation for the Hadoop Distributed file system, providing low-level support for distributed computing storage.
MapReduce from its name to look at the general can see a reason, two verbs map,reduce,map (expand) is to break a task into multiple tasks, Reduce is the decomposition of the task of processing the results of the summary, to obtain the final analysis results. This is not a new idea, in fact, in the previous mentioned multithreading, multitasking design can find the shadow of this thought. Whether in the real world or in programming, a job can be split into multiple tasks, the relationship between tasks can be divided into two: one is unrelated to the task, can be executed in parallel, the other is the task is interdependent, the order can not be reversed, such tasks can not be processed in parallel. Back in the past, the university teacher in class to let everyone to analyze the critical path, nothing more than to find the most time-saving task decomposition execution mode. In the distributed system, the machine cluster can be regarded as the hardware resource pool, the parallel task is split into each idle machine resources to deal with, can greatly improve the computational efficiency, at the same time, this kind of resource independence, to compute the cluster expansion undoubtedly provides the best design guarantee. (In fact, I always think that the cartoon of Hadoop should not be a small elephant, should be ants, distributed computing is like ants eat elephants, cheap machine groups can match any high-performance computer, the longitudinal expansion of the curve is always the enemy but the horizontal extension of the diagonal). After the task is decomposed, it is necessary to aggregate the results of the processing, which is what reduce does.
Figure 1 MapReduce
The above diagram is the general structure of the MapReduce map, before the map may also have a split process for the input data, to ensure that the task of parallel efficiency, in the map there will be a shuffle process, to improve the efficiency of reduce and reduce the pressure of data transmission is greatly helpful. Details of these sections will be specifically mentioned later.
HDFs is the storage cornerstone of distributed computing, and Hadoop's Distributed file system and other Distributed file systems have many similar qualities.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.