Principles of Hadoop Map/Reduce

Last Update:2014-09-25 Source: Internet

Author: User

Tags shuffle hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Principles of Hadoop Map/Reduce

Hadoop is a project under Apache. It consists of HDFS, MapReduce, HBase, Hive, ZooKeeper, and other Members. HDFS and MapReduce are two of the most basic and important members.

HDFS is an open-source version of Google GFS. It is a highly fault-tolerant distributed file system that provides high-throughput data access and is suitable for storing massive (PB-level) data) as shown in the following figure:

The Master/Slave structure is used. NameNode maintains metadata in the cluster and provides functions for creating, opening, deleting, and renaming files or directories. DatanNode stores data and initiates read/write requests for processing data. DataNode periodically reports heartbeat to NameNode. NameNode controls DataNode by responding to heartbeat.

MapReduce is a powerful tool for large-scale data (TB-level) computing. Map and Reduce are its main ideas. They are derived from functional programming languages. Their principles are as follows: Map is responsible for dividing data, reduce aggregates data. You only need to implement the map and reduce interfaces to compute TB-level data. common applications include log analysis, data mining, and other data analysis applications. In addition, it can also be used for scientific data computing, such as calculating the circumference rate PI. The implementation of Hadoop MapReduce also adopts the Master/Slave structure. Master is called JobTracker, while Slave is called TaskTracker. The calculation submitted by the user is called a Job. Each Job is divided into several Tasks. JobTracker is responsible for Job and task scheduling, while TaskTracker is responsible for task execution.

Shuffle and Sort Analysis in MapReduce

MapReduce is a very popular distributed computing framework. It is designed for Parallel Computing of massive data. Google was the first to propose the technical framework, and Google was inspired by functional programming languages such as LISP, Scheme, and ML. The MapReduce framework consists of two core steps: Map and Reduce. When you submit a computing job to the MapReduce framework, it first splits the computing job into several Map tasks and distributes them to different nodes for execution, each Map task processes a part of the input data. After the Map task is completed, it generates some intermediate files which will be used as the input data of the Reduce task. The main goal of a Reduce task is to summarize the output of the previous maps and output them together. From the perspective of high-level abstraction, Figure 1 shows the data flow of MapReduce:

MapReduce job running process:

The focus of this article is to analyze the core process of MapReduce-Shuffle and Sort. In this article, Shuffle refers to the process of generating output from Map, including system sorting and transmitting Map output to Cer CER as input. Here we will explore how Shuffle works, because the basic understanding helps tune MapReduce programs.

-------------------------------------- Split line --------------------------------------

Build a Hadoop environment on Ubuntu 13.04

Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1

Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)

Configuration of Hadoop environment in Ubuntu

Detailed tutorial on creating a Hadoop environment for standalone Edition

Build a Hadoop environment (using virtual machines to build two Ubuntu systems in a Winodws environment)

-------------------------------------- Split line --------------------------------------

First, the analysis starts from the Map end. When Map starts to generate output, it does not simply write data to the disk, because frequent operations may cause serious performance degradation, the processing is more complex. Data is first written to a buffer in the memory and pre-ordered to improve efficiency,

Each Map task has a circular memory buffer for writing output data. The default size of this buffer is 100 MB, which can be implemented through io. sort. mb attribute to set the specific size, when the amount of data in the buffer reaches a specific threshold (io. sort. mb * io. sort. spill. percent, io. sort. spill. percent is 0.80 by default), the system will start a background thread to spill the buffer content to the disk. During the spill process, Map output will continue to be written to the buffer, but if the buffer is full, Map will be blocked and the direct path spill will be completed. Before writing the data in the buffer to a disk, the spill thread sorts the data in a secondary order. The spill thread sorts the data in the partition order first, and then sorts the data in each partition by the Key. The output includes an index file and data file. If Combiner is set, the output is sorted Based on the output. Combiner is a Mini Reducer. It runs on the node that executes the Map task. First, it makes a simple Reduce operation on the Map output to make the Map output more compact, less data is written to the disk and transmitted to the CER Cer. The Spill file is saved in the directory specified by mapred. local. dir and deleted after the Map task is completed.

For more details, please continue to read the highlights on the next page:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More