Hadoop version of Biosphere MapReduce model

Last Update:2014-12-25 Source: Internet

Author: User

Keywords Branches functions fragments can

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

(1) Introduction to Apache Hadoop version

Apache Open Source project development process:

--Trunk Branch: New features are developed on the backbone branch (trunk);

-Unique branch of feature: Many new features are poorly stabilized or imperfect, and the branch is merged into the backbone branch after the unique specificity of these branches is perfect;

--candidate branch: Periodically split from the backbone branch, the general candidate Branch release, the branch will stop updating new features, if the candidate branch has a bug fix, will be a new version of the candidate branch, the candidate branch is the release of a stable version;

Causes of the Hadoop version confusion:

--Main function in the branch version development: After 0.20 branch release, the main function has been developed in this branch, the backbone branch did not merge this branch, 0.20 branches became mainstream;

--Low release post: 0.22 release later than 0.23 version;

--Version rename: 0.20 branch of the 0.20.205 version renamed to 1.0 version, the two versions are the same, but the name changed;

Apache Hadoop version diagram:

(2) Introduction to Apache Hadoop version features

First generation Hadoop features:

--Append: Support the file append function, lets the user use the HBase time to avoid the data loss, is also uses the hbase premise;

-raid: Ensure the data reliable, the introduction of check code to verify the number of data blocks;

--Symlink: Support HDFs file links;

--Security:hadoop security mechanism;

--Namenode ha: In order to avoid Namenode single point of failure, HA cluster has two sets of namenode;

Second generation Hadoop features:

--HDFS federation:namenode restriction HDFS extension, the function allows a number of namenode in charge of different directories, to achieve access isolation and horizontal expansion;

Yarn:mapreduce Scalability and multiple framework support, yarn is a new resource management framework that separates Jobtracker resource management and job control functions, ResourceManager responsible for resource management, Applicationmaster is responsible for operation control;

0.20 Version Branch: Only this branch is stable version, the other branches are unstable version;

--0.20.2 version (Stable version): Contains all features, classic version;

--0.20.203 Version (Stable version): Contains append, does not contain symlink raid Namenodeha function;

--0.20.205 Version/1.0 version (Stable version): Contains append security, does not contain symlink raid Namenodeha function;

--1.0.1 ~ 1.0.4 (Stable version): Fix 1.0.0 bugs and perform some performance improvements;

0.21 version Branch (unstable version): Contains append raid symlink Namenodeha, no security;

0.22 version Branch (unstable version): Contains append raid symlink so that ha, does not contain mapreduce security;

0.23 Version Branch:

--0.23.0 version (unstable version): Second generation Hadoop, added HDFS Federation and yarn;

--0.23.1 ~ 0.23.5 (unstable version): fix some bugs of 0.23.0, and make some optimizations;

--2.0.0-alpha ~ 2.0.2-alpha (unstable version): Increased Namenodeha and wire-compatiblity functions;

(3) Cloudera Hadoop corresponds to the Apache version of Hadoop

2. Hadoop Biosphere

Apache support: The core projects of Hadoop are supported by Apache, and in addition to Hadoop, there are several projects that are an integral part of Hadoop;

HDFS: Distributed file system for reliable storage of massive data;

MapReduce: Distributed processing data model can be run in large commercial cloud computing cluster;

Pig: Data stream language and operating environment, used to retrieve mass data sets;

--HBase: Distributed database, stored by columns, HBase using HDFS as the underlying storage, while supporting the MapReduce model of mass computation and random reading;

-Zookeeper: Provides a distributed coordination service for Hadoop cluster, which is used to build distributed applications to avoid the uncertainty loss caused by application execution failure;

Sqoop: This tool can be used for data transmission between HBase and HDFS, and improve the efficiency of data transmission.

--Common: Distributed File system, universal IO components and interfaces, including serialization, Java RPC, and persistent data structures;

-Avro: A serialization system that supports efficient cross-language RPC and persistent data storage;

Two. Introduction to the MapReduce model

MapReduce Introduction: MapReduce is a data processing programming model;

Multi-lingual Support: MapReduce can be written in a variety of languages, such as Java, Ruby, Python, C + +;

Parallel nature: MapReduce can be run in parallel in nature;

1. MapReduce Data Model Analysis

MapReduce Data Model:

-Two phases: the MapReduce tasks can be divided into two phases, map and reduce phases;

--Input and output: Each phase uses key-value pairs as input and output, and IO types can be selected by programmers;

--Two functions: The map function and the reduce function;

MapReduce Job Composition: A MapReduce work unit, including input data, MapReduce program and configuration information;

Operation Control: Operation control is controlled by Jobtracker (one) and tasktracker (multiple);

Jobtracker function: Jobtracker Control the operation of the Tasktracker task, and carry out the unified dispatch;

--Tasktracker role: The implementation of specific MapReduce procedures;

--Unified dispatching mode: Tasktracker run at the same time will run the progress sent to Jobtracker, Jobtracker record all tasktracker;

--Task failure processing: If a tasktracker task fails, Jobtracker schedules the other tasktracker to perform the MapReduce job again;

2. Map Data Flow

Input fragment: When the MapReduce program executes, the input data will be divided into the equal length data block, these data blocks are fragmented;

--Fragment Correspondence task: Each fragment corresponds to a map task, namely the map function in MapReduce;

Parallel processing: Each fragment performs a Map task shorter than the one-time processing of all data;

-Load balancing: The computer in the cluster has good performance and poor performance, according to the reasonable distribution of fragmentation size, than the average distribution efficiency is high, give full play to the efficiency of the cluster;

--Reasonable fragmentation: the smaller the load balancing efficiency, but the management of fragmentation and management map task total time will increase, need to determine a reasonable fragment size, the general default of 64M, and the same block size;

Data Local optimization: The map task runs on the node where the data is stored locally in order to achieve the best efficiency;

--Fragment = Data block: A fragment is only stored on a single node, the most efficient;

--Fragmentation > Data block: The fragment is larger than the data block, then a fragmented data is stored on multiple nodes, the data required by the map task needs to be transmitted from multiple nodes, which will reduce the efficiency;

Map task output: After the completion of the map task execution, the results are written to the local hard disk, not to the HDFS;

-Intermediate Transitions: The result of the map is only for intermediate transitions, the intermediate result is passed to the reduce task, the result of the reduce task is the final result, and the map intermediate value is eventually deleted;

--The map task failed: If the map task fails, the map task is rerun on the other node, and the intermediate result is calculated again;

3. Reduce Data Flow

Reduce task: The number of map tasks is much more than the reduce task;

-No localization advantage: Reduce's task input is the MAP task output, reduce task most of the data is not local;

Data Merge: The results of the map task output are uploaded over the network to the Reduce task node, where data is merged and then processed in the input to reduce task;

--Result output: Reduce output directly to HDFs;

--Reduce Quantity: Reduce quantity is specifically specified, specified in the configuration file;

MapReduce Data Flow Diagram Analysis:

--a single mapreduce data stream:

--Multiple MapReduce models:

-No MapReduce data flow for the reduce program:

Map output partition: More than one reduce task, each reduce task corresponds to some map tasks, we partition these map tasks according to their input reduce task, to create a partition for each reduce;

--Partition ID: Map results have many kinds of keys, the same key corresponding to the data to a reduce, a map may give multiple reduce output data;

--Partition function: partition function can be defined by user, usually use system default partition function Partitioner, this function is partitioned by hash function;

Blending: The data flow between the map task and the reduce task becomes mixed;

--Reduce data source: input data from multiple maps per reduce task

Map data whereabouts: The results of each map task are exported to multiple reduce;

No reduce: When data can be processed in full parallel, reduce is not applicable and only map tasks are performed;

4. Combiner introduced

MapReduce bottlenecks: Bandwidth limits the number of MapReduce to perform tasks, and a large amount of data transfer is required during MAP and Reduce execution;

Solution: Merging function combiner, merging the results of multiple MAP task outputs and sending the merged results to Reduce jobs;

5. Hadoop Streaming

Hadoop Multi-Language support: Java, Python, Ruby, C + +;

Multilingual: Hadoop allows the use of other languages to write MapReduce functions;

-Standard streaming: Because Hadoop can use UNIX standard streams as an interface between Hadoop and applications, MapReduce programming can be done as long as the standard stream is used;

Streaming processing text: Streaming in the text processing mode, there is a row of data view, very suitable for processing text;

--the input and output of the map function: The standard stream is entered into the map function, and the result of the map function is written to the standard output stream.

--Map output format: The output of the key-value pairs is a tab-delimited row, in this form written in the standard output stream;

--the input and output of the reduce function: the input data is a tab-delimited key-value pair row in the standard input stream, which is sorted by the Hadoop framework, and the result is output to the standard output stream;

6. Hadoop Pipes

Pipes concept: Pipes is the MapReduce C + + interface;

--Understanding misunderstanding: Pipes is not using standard input-output streams as a streaming between Map and Reduce, nor is it using JNI programming;

--Working principle: Pipes uses sockets as communication between map and reduce function processes;

Copyright NOTICE: If no special note, this article is (Hanshuliang) original, reproduced please retain the source of the article.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More