--hadoop analysis of large data Processing (II.): MapReduce

Last Update:2014-12-18 Source: Internet

Author: User

Keywords Large data can therefore distributed computing

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Large http://www.aliyun.com/zixun/aggregation/14345.html "> Data processing Model MapReduce

(followed by "Large Data processing--hadoop analysis (i)")

The data produced in the large data age will ultimately need to be computed, and the purpose of the storage is to make the data analysis bigger. The significance of large data is to calculate, analyze, and excavate the things behind the data. Hadoop not only provides the distributed file system of data storage, but also provides the distributed programming model and distributed computing systems, which can solve the problem of data processing in large data age through the programming model and distributed computing architecture. The distributed computing model and architecture are the famous MapReduce. MapReduce also came from Google,2004 's famous MapReduce paper, which was adopted by Hadoop's predecessor project. MapReduce distributed computing model is applied to Google's data search and data mining business, so many internet companies use MapReduce to do large data processing.

The MapReduce calculation model divides the data into two phases: one phase is map and the other is reduce, and the two phases can be combined in tandem. Therefore, a MapReduce programming model can be used to decompose an algorithm into a map function and a reduce function. The map function carries out a distributed parallel operation on the input data; the reduce function merges the results of the map function and outputs the resulting data. The MapReduce run model is shown in the following illustration:

MapReduce is a distributed programming model, in order to support this programming model, the Hadoop project realizes the distribution, scheduling, operation and fault-tolerant mechanism of computing tasks. The MapReduce running framework is shown in the following illustration:

In the distributed computing model, there are two main roles, the first role is job Tracker, and the other is task Tracker. The main task of job tracker is to coordinate the execution of MapReduce operations, specifically the dispatch and distribution of tasks. The Job Tracker is the master node in the MapReduce computing schema. Task Tracker is used to perform tasks assigned by job Tracker, including map tasks and reduce tasks. Therefore, task Tracker can be divided into the map task Tracker and the reduce task Tracker. Job Tracker is important in the overall framework and always keeps the connection to task Tracker, and once a task tracker is discovered, job tracker will reschedule fail tasks on other nodes. In addition, considering that the computing task needs to obtain data from the database or file system, it is necessary to consider the distance between the node and the data source when the job tracker is scheduling a task. Computing tasks can usually be scheduled directly on a storage node, which avoids data transmission on the network and reduces the delay in the introduction of IO data.

More than 10 years ago, when storage developed to a certain scale, a model of storage and computation was proposed, which enabled storage technology to evolve independently to form large storage systems such as SAN and NAS. In the era of large data computing, the relentless separation of storage and computing can also have negative effects, resulting in reduced computational performance, so the integration of computing and storage is valuable for large data analysis under the MapReduce framework. In fact, whether separation is not the focus, the key is that the system structure needs to meet the application needs. Storage networks such as Sans and Nas often need to be connected to the compute server through high-speed interconnect technology (FC/IB/10GB Ethernet), which reduces IO latency. In a large data environment, this high-speed interconnection is not necessarily the best choice to reduce costs. Therefore, in a cost-oriented distributed architecture, computing and storage integration will be a good choice.

Here's a quick description of how distributed computing is done under the MapReduce architecture:

1 when a client needs to perform a calculation process, a calculation job needs to be submitted to the job tracker. and store the job in the form of a file in the HDFs.

2 Job Tracker is a scheduler of computing resources, and according to a certain strategy, the client-submitted map task is first dispatched to task tracker.

3 The Map task is executed concurrently on multiple task tracker. The intermediate results of the calculation are stored on the HDFs in the form of a file.

4 When the map task is completed, the job Tracker assigns the reduce task to the task tracker server.

5 When all the reduce tasks are completed, the results are exported to the Hadoop file system.

As can be seen from this process, Job Tracker is similar to CPU resource scheduler; Task Tracker are CPU resources; The input and output of the data is based on the Hadoop file system.

From this framework, we can see that the computing nodes of the whole system are very scalable, and the only potential bottleneck lies in job Tracker. and a single job tracker also becomes a point of single point of failure. Similar to the Distributed File system, Job tracker is the same as Namenode, and once the job tracker fails, the entire computing systems will not function properly. Therefore, the design of distributed computing system with this architecture is focused on job Tracker. First of all, it is necessary to ensure that job Tracker has a strong transaction processing capability, and secondly, it needs to ensure that job tracker is highly available.

The distributed computing system is only a kind of computing tool, in order to realize the distributed data processing, it is necessary to adopt the MapReduce idea in the programming model, especially the MapReduce model is needed in the algorithm design. All algorithms need to be decomposed into two classes of map and reduce methods, and these methods can be executed concurrently. There are some mapreduce open source algorithm resources in the field of data mining, such as Mahout project is a very representative open source resource database.

This article is from the "Storage Path" blog, please be sure to keep this source http://alanwu.blog.51cto.com/3652632/1418021

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More