Hadoop (5) in the big data era: hadoop distributed computing framework (mapreduce)

Source: Internet
Author: User

 

Hadoop In The Big Data era (1): hadoop Installation

Hadoop In The Big Data era (II): hadoop script Parsing

Hadoop In The Big Data era (III): hadoop data stream (lifecycle)

Hadoop (4) in the big data era: hadoop Distributed File System (HDFS)

 

Hadoop has two core modules: one isDistributed Storage System-HDFSIn the previous chapter, I will give a general introduction, and the other isHadoop computing framework-mapreduce.

Mapreduce is actually a mobile distributed computing framework in the key-value format..

 

The computation is divided into two stages: the map stage and the reduce stage, both of which process data. Because of its simplicity, however, if you want to understand the various links and implementation details, it is still difficult to a certain extent. Therefore, I plan to pick up the core of mapreduce in this article for analysis and explanation.


 

1. Default Value of mapreduce driver

 

One of the reasons for the ease of writing mapreduce programs is that it provides some default values, which are just set for the development environment. Although it is easy to get started, I still understand the essence of mapreduce because it is the mapreduce engine and only understands the core of mapreduce. When you are writing a mapreduce program, the program you write is the final and steady program you want. For more information, see the following code:

Public int run (string [] ARGs) throws ioexception {jobconf conf = new jobconf ();/*** default input format, that is, the format of the data to be processed by the mapper program, hadoop supports many input formats, which will be explained in detail below. * However, textinputformat is the most commonly used (that is, for common text files, the key is the start offset of each row in the longwritable-file, value is text-text line ). **/CONF. setinputformat (Org. apache. hadoop. mapred. textinputformat. class);/*** the actual number of map tasks depends on the size of the input file and the size of the file block **/CONF. setnummaptasks (1);/*** default mapclass. If we do not specify our own mapper class, we will use this identitymapper class **/CONF. setmapperclass (Org. apache. hadoop. mapred. lib. identitymapper. class);/*** the map task is run by maprunner, and maprunner is the default Implementation of maprunnable. It calls the map () method of mapper once for each record in sequence, code details-Key Points */CONF. setmaprunnerclass (Org. A Pache. hadoop. mapred. maprunner. class);/*** key and value format of map task output result */CONF. setmapoutputkeyclass (Org. apache. hadoop. io. longwritable. class); Conf. setmapoutputvalueclass (Org. apache. hadoop. io. text. class);/*** hashpartitioner is the default partition implementation, which partitions the data after the map task runs, that is, the result data is divided into multiple blocks (each partition corresponds to a reduce task ). * Hashpartitioner hashing the keys of each record to determine which partition the record belongs. **/CONF. setpartitionerclass (Org. apache. hadoop. mapred. lib. hashpartitioner. class);/*** set the number of reduce tasks */CONF. setnumreducetasks (1);/*** default reduce class. If we do not specify our own reduce class, we will use this identityreducer class **/CONF. setreducerclass (Org. apache. hadoop. mapred. lib. identityreducer. class);/*** key and value format of the final output result of the task */CONF. setoutputkeyclass (Org. apache. hadoop. io. longwritable. class); Conf. setoutputvalueclass (Org. apache. hadoop. io. text. class);/*** final output to the text file type */CONF. setoutputformat (Org. apache. hadoop. mapred. textoutputformat. class);/*] */jobclient. runjob (CONF); Return 0 ;}

 

Most of what I want to talk about is included in the Code comments. In addition, there is one more point:Java's generic mechanism has many restrictions: the type information in the running process is not always visible due to type erasure. Therefore, hadoop needs to explicitly set the map, reduce input, and result types..

 

The above is the maprunner class, which is the engine for running map tasks. The default implementation is as follows:

Public class maprunner <K1, V1, K2, V2> implements maprunnable <K1, V1, K2, V2> {private mapper <K1, V1, K2, V2> mapper; private Boolean incrproccount; @ suppresswarnings ("unchecked") Public void configure (jobconf job) {// obtain the map instance this by reflection. mapper = reflectionutils. newinstance (job. getmapperclass (), job); // increment processed counter only if skipping feature is enabled this. incrproccount = skipbadrecords. getmappermaxskiprecords (job)> 0 & skipbadrecords. getautoincrmapperproccount (job);} public void run (recordreader <K1, V1> input, outputcollector <k2, V2> output, reporter) throws ioexception {try {// allocate key & Value instances that are re-used for all entries K1 key = input. createkey (); V1 value = input. createvalue (); While (input. next (Key, value) {// map pair to output // call the Mapper function cyclically. map (Key, value, output, Reporter); If (incrproccount) {reporter. incrcounter (skipbadrecords. counter_group, skipbadrecords. counter_map_processed_records, 1) ;}} finally {mapper. close () ;}} protected mapper <K1, V1, K2, V2> getmapper () {return mapper ;}}


 

Believe that sometimes it is faster to understand the source code!

 

 

2. Shuffle

The shuffle process is actually a step from map output to reduce input. It is called the "heart" of mapreduce and is divided into three stages,Map-side partitioning, reduce-side replication, and reduce sorting (merging.

 

2.1 map-side partitioning

In mapreduce computing, there are multiple map tasks and several reduce tasks, and each task may be in a different machine, therefore, it is difficult to output the map task to reduce input.

 


When the map function generates output, it is not simply written to the disk, but usedBuffer is written to the memory and pre-ordered for efficiencyThe process is as follows:

 

 

 

InBefore writing data to a disk, the thread first divides the output data into the response partition (partiton). In each partition, the background thread performs internal sorting by pressing a key. If there is a combiner, it will run on the output after sorting.

 

2.2. Reduce end replication stage

Because the output file of the map task is written to the local diskPartition into the number of reduce partitions (each reduce requires a partition), Because the completion time of the map task may be different, as long as a task is completed, the reduce task begins to copy its output, which is the replication phase of the reduce task. As shown in.

 

2.3. Sort (merge) stage of reduce end

 

After all the map outputs are copied, the reduce task enters the sorting phase (sort phase). This phase merges the map output to maintain the ordered sorting, as shown in.

 


3. Input and Output formats

With the increase of time, the growth of data is also exponential growth, and the number of data formats is also growing, making it increasingly difficult to process big data. In order to adapt to the ability to process a variety of data, hadoop provides a seriesThe purpose of control over input and output formats is to parse various input files and generate required output format data..


However, no matter which format the data is processed, it must be combined with mapreduce to maximize the usage of hadoop.

This part is also the core of hadoop!

 

3.1 input parts and records

 

Speaking about HDFS,An input Shard is the input block processed by a single map task.,The size of a part should be the same as that of HDFS..

 

Each Shard is divided into several records. Each record is a key-value pair. Map processes each record one by one..


InIn databases, an input shard can correspond to several rows in a table, and a record corresponds to a row (dbinputformat).


 

Input parts are represented as inputsplit interfaces in hadoop and are created using inputformat..


Inputformat is used to generate input parts and split them into records. It is only an interface and specific tasks are implemented..

 

 

3.2. fileinputformat

Fileinputformat is the base class implemented by inputformat, which uses files as the data source. It provides two functions: defining which files are included in the input of the job, and generating partitions for the input file. The job that splits parts into the base class has its sub-class implementation,Fileinputformat is an abstract class..

 

Fileinputformat implements the file partitioning function, but how does it implement it? Three parameters must be described first:

Attribute name

Type

Default Value

Description

Mapred. Min. Split. Size

Int

1

Minimum number of bytes of a file shard

Mapred. Max. Split. Size

Long

Long. max_value

Maximum number of bytes of a file shard

DFS. Block. Size

Long

64 m

Block Size in HDFS

 

There is a formula for calculating the part size (refer to the computesplitsize () method of the fileinputfomat class)


Max (minimumsize, min (maximumsize, blocksize ))


Default Value: minimumsize <blocksize <maximumsize

 

 

Fileinputformat only separates large files, that is, files larger than the block size.


The inputsplit generated by fileinputformat is an entire file (the file is too small to be partitioned, and the entire file is treated as a partition for map tasks) or a part of the file (the file is large and partitioned). 

3.3. Common inputformat implementation

 

 

Small file and combinefileinputformat


Although hadoop is suitable for processing large files, a large number of small files are indispensable in actual situations. Therefore, hadoop provides a combinefileinputformat designed for small files,It packages multiple files into one shard. Generally, each mapper can process more data..

 

 

Textinputformat


Hadoop default inputformat. The key of each record is the offset of the row in the file, and the value is the row content..

 


Keyvalueinputformat


It is suitable for processing configuration files. Keys in the row are in the key value format. For example, keys are keys in the row and values are values in the row.

 


Nlineinputformat


It is also developed to process text files. It is characteristic of receivingFixed number of rows, The others are similar to textinputformat.

 


Sequencefileinputformat (binary input)


Hadoop's ordered file format storage format stores binary key-value pairs. Because the ordered file stores data in the map structure, sequencefileinputformat can be used for processing.

 


Dbinputformat


As the name implies, JDBC is used to read data from a relational database.

 

 

Multiple inputs


The multipleinputs class can be used to process data in multiple input formats.For example, if the input data contains the text and binary types, you can use multipleinputs to specify the input type and map function of a file for parsing.

 

3.4 output format

Since there is an input format, there is an output format, which corresponds to the input format.


The default output format is textoutputformat, which writes records as text lines. Key-value pairs can be of any type. Key-value pairs are separated by tabs by default..


 

 

3.5 hadoop features

In addition to the above points, there are counters, sorting, connections, and so on to be concerned, please wait for details...

 

 


 

Hadoop (5) in the big data era: hadoop distributed computing framework (mapreduce)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.