MapReduce Principle Chapter

Source: Internet
Author: User
Tags shuffle sort
Introduction

MapReduce is a programming framework for distributed computing programs and a core framework for users to develop "Hadoop-based data analysis applications";
The MapReduce core function is to integrate user-written business logic code and its own default components into a complete distributed computing program, concurrently running on a Hadoop cluster; MAPREDUCE framework structure and core operating mechanism structure

A complete MapReduce program has three types of instance processes in the distributed Runtime: 1, Mrappmaster: Responsible for the entire program process scheduling and state coordination 2, Maptask: Responsible for the map phase of the entire data processing process 3, Reducetask: The process flow diagram of the Mr Program for the entire data processing process in the reduce phase

Process Analysis

1, when a Mr program starts, the first thing to start is the Mrappmaster,mrappmaster after the start of the job description information, calculate the number of required maptask instances, and then to the cluster application machine to start the corresponding number of maptask process

2, after the maptask process starts, according to the given data slice range to data processing, the main process is:
A) Use the customer-specified inputformat to obtain Recordreader read data, forming an input kv pair
b) The input kv pair is passed to the customer-defined map () method, the logical operation is done, and the kv pair output by the map () method is collected into the cache
c) The KV in the cache is sorted by the K partition and is constantly overflowing to the disk file

3, mrappmaster monitoring until all maptask process tasks are completed, the corresponding number of reducetask processes are initiated according to the parameters specified by the customer, and the data range (data partitioning) to be processed by the Reducetask process is informed

4. After the Reducetask process is started, according to the location of the data to be processed by Mrappmaster, a number of maptask output results files are obtained from the machine where several maptask are running, and a local re-merge sort is performed. Then according to a group of KV of the same key, call the customer-defined reduce () method for logical operation, and collect the result of the operation output kv, and then call the customer-specified OutputFormat to output the result data to the external storage Maptask parallelism decision mechanism

The parallelism of Maptask determines the concurrency of task processing in the map phase, which in turn affects the processing speed of the whole job.
So, the more Maptask parallel instances, the better. How to decide the degree of parallelism. The decision mechanism of maptask parallelism

The map phase parallelism of a job is determined by the client when submitting the job
The basic logic for the client to plan the degree of parallelism in the map phase is:
Performs a logical slice of the pending data (that is, dividing the pending data into multiple split logically by a specific tile size), and then each split allocates a maptask parallel instance processing
This logic and the formation of the section planning description file, by the Fileinputformat implementation of the Class Getsplits () method, the process is as follows:
Fileinputformat slicing mechanismThe slice definition is the default tiling mechanism in the Getsplit () method Fileinputformat in the InputFormat class:
A) slice size simply by the length of the file's content, the default equals the block size C) Slice regardless of the dataset as a whole, but separate slices for each file individually
For example, there are two files for processing data:

File1.txt 320M
File2.txt 10M

After the fileinputformat slicing mechanism operation, the slice information is formed as follows:

file1.txt.split1--0~128
file1.txt.split2--128~256
file1.txt.split3--256~320
file2.txt.split1--0~10m
Parameter configuration for the size of slices in fileinputformat
By analyzing the source code, the logic of calculating the slice size in Fileinputformat: Math.max (MinSize, Math.min (MaxSize, blockSize)); Slices are mainly determined by these values.

MinSize: Default value: 1
Configuration parameters: Mapreduce.input.fileinputformat.split.minsize
MaxSize: Default value: Long.maxvalue
Configuration parameters: Mapreduce.input.fileinputformat.split.maxsize
BlockSize

Soby default, the tile size =blocksize
MaxSize (Slice max):
If the parameter is smaller than blocksize, it will make the slice smaller, and it will be equal to the value of the configured parameter
MinSize (Minimum slice):
The parameter adjustment is bigger than blocksize, then can make the slice bigger than the blocksize

Select the number of factors affecting the concurrency: 1, the hardware configuration of the Computing node 2, the type of the operation task: CPU-intensive or IO-intensive 3, the data volume of the computing task reducetask the degree of parallelism decision

The parallelism of Educetask also affects the execution concurrency and execution efficiency of the entire job, but with the number of slices determined by the number of maptask, the decision of Reducetask quantity can be set directly manually:

The default value is 1, manually set to 4
Job.setnumreducetasks (4);

If the data is unevenly distributed, it is possible to have data skew in the reduce phase
Note: The number of reducetask is not arbitrary settings, but also to consider business logic requirements, in some cases, you need to calculate the global summary results, there can be only 1 reducetask
try not to run too many reduce tasks. For most jobs, the best number of rduce is the same as the reduce in the cluster, or smaller than the cluster's reduce slots. This is especially important for small clusters. the shuffle mechanism of MapReduce summarizes how data processed in the map phase is passed to the reduce phase, which is one of the most critical processes in the MapReduce framework. This process is called shuffle; Shuffle: Shuffle, licensing--(core mechanism: data partitioning, sorting, caching); Specifically, it is to distribute the processing result data of the maptask output to Reducetask, and in the process of distribution, the data is partitioned and sorted by key. ;

1, partition partition 2, sort based on key 3, combiner for local value of the combined detailed process

1, Maptask collect our Map () method output kv pair, put into the memory buffer
2. Overflow of local disk files from memory buffers, may overflow multiple files
3. Multiple overflow files will be merged into large overflow files
4, during the overflow process, and the process of merging, call Partitoner to group and sort on key
5, Reducetask according to their own partition number, to each maptask machine to take the corresponding results partition data
6, Reducetask will take the same partition from the different maptask of the result file, Reducetask will merge these files again (merge sort)
7. After merging into a large file, the process of shuffle is ended, and then the logical operation of the Reducetask is entered (a key value is taken from the file to group, and the user-defined reduce () method is called)

The size of the buffer in the shuffle affects the execution efficiency of the MapReduce program, in principle, the larger the buffer, the fewer disk IO, the faster the execution speed
The size of the buffer can be adjusted by parameter: IO.SORT.MB default 100M mapreduce and yarn yarn Overview

Yarn is a resource scheduling platform, which is responsible for providing server computing resources to the operational program, which is equivalent to a distributed operating system platform, while the MapReduce and other operational programs are equivalent to the important concepts of application yarn running on the operating system.

1, yarn does not know the user-submitted program operating mechanism
2, yarn only provides the scheduling of computing resources (user program to yarn application resources, yarn is responsible for allocating resources)
3, yarn in charge of the role is called ResourceManager
4, yarn in specific to provide the role of computing resources called NodeManager
5, in this way, yarn is actually completely decoupled from the running user program, which means that yarn can run various types of distributed computing programs (MapReduce is just one of them), such as MapReduce, Storm program, spark Program, Tez ...
6, so, the spark, storm and other computing frameworks can be integrated in yarn running, as long as their respective frameworks have yarn-compliant resource request mechanism can be
7, yarn becomes a general resource scheduling platform, from now on, all kinds of operational clusters in the enterprise can be integrated in a physical cluster, improve resource utilization, and facilitate data sharing examples of running operational procedures in yarn

The scheduling process of the MapReduce program, as shown below:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.