Hadoop (5) in the big data era: hadoop distributed computing framework (mapreduce)

Last Update:2014-11-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop In The Big Data era (1): hadoop Installation

Hadoop In The Big Data era (II): hadoop script Parsing

Hadoop In The Big Data era (III): hadoop data stream (lifecycle)

Hadoop (4) in the big data era: hadoop Distributed File System (HDFS)

Hadoop has two core modules: one isDistributed Storage System-HDFSIn the previous chapter, I will give a general introduction, and the other isHadoop computing framework-mapreduce.

Mapreduce is actually a mobile distributed computing framework in the key-value format..

The computation is divided into two stages: the map stage and the reduce stage, both of which process data. Because of its simplicity, however, if you want to understand the various links and implementation details, it is still difficult to a certain extent. Therefore, I plan to pick up the core of mapreduce in this article for analysis and explanation.

1. Default Value of mapreduce driver

One of the reasons for the ease of writing mapreduce programs is that it provides some default values, which are just set for the development environment. Although it is easy to get started, I still understand the essence of mapreduce because it is the mapreduce engine and only understands the core of mapreduce. When you are writing a mapreduce program, the program you write is the final and steady program you want. For more information, see the following code:

Public int run (string [] ARGs) throws ioexception {jobconf conf = new jobconf ();/*** default input format, that is, the format of the data to be processed by the mapper program, hadoop supports many input formats, which will be explained in detail below. * However, textinputformat is the most commonly used (that is, for common text files, the key is the start offset of each row in the longwritable-file, value is text-text line ). **/CONF. setinputformat (Org. apache. hadoop. mapred. textinputformat. class);/*** the actual number of map tasks depends on the size of the input file and the size of the file block **/CONF. setnummaptasks (1);/*** default mapclass. If we do not specify our own mapper class, we will use this identitymapper class **/CONF. setmapperclass (Org. apache. hadoop. mapred. lib. identitymapper. class);/*** the map task is run by maprunner, and maprunner is the default Implementation of maprunnable. It calls the map () method of mapper once for each record in sequence, code details-Key Points */CONF. setmaprunnerclass (Org. A Pache. hadoop. mapred. maprunner. class);/*** key and value format of map task output result */CONF. setmapoutputkeyclass (Org. apache. hadoop. io. longwritable. class); Conf. setmapoutputvalueclass (Org. apache. hadoop. io. text. class);/*** hashpartitioner is the default partition implementation, which partitions the data after the map task runs, that is, the result data is divided into multiple blocks (each partition corresponds to a reduce task ). * Hashpartitioner hashing the keys of each record to determine which partition the record belongs. **/CONF. setpartitionerclass (Org. apache. hadoop. mapred. lib. hashpartitioner. class);/*** set the number of reduce tasks */CONF. setnumreducetasks (1);/*** default reduce class. If we do not specify our own reduce class, we will use this identityreducer class **/CONF. setreducerclass (Org. apache. hadoop. mapred. lib. identityreducer. class);/*** key and value format of the final output result of the task */CONF. setoutputkeyclass (Org. apache. hadoop. io. longwritable. class); Conf. setoutputvalueclass (Org. apache. hadoop. io. text. class);/*** final output to the text file type */CONF. setoutputformat (Org. apache. hadoop. mapred. textoutputformat. class);/*] */jobclient. runjob (CONF); Return 0 ;}

Most of what I want to talk about is included in the Code comments. In addition, there is one more point:Java's generic mechanism has many restrictions: the type information in the running process is not always visible due to type erasure. Therefore, hadoop needs to explicitly set the map, reduce input, and result types..

The above is the maprunner class, which is the engine for running map tasks. The default implementation is as follows:

Public class maprunner <K1, V1, K2, V2> implements maprunnable <K1, V1, K2, V2> {private mapper <K1, V1, K2, V2> mapper; private Boolean incrproccount; @ suppresswarnings ("unchecked") Public void configure (jobconf job) {// obtain the map instance this by reflection. mapper = reflectionutils. newinstance (job. getmapperclass (), job); // increment processed counter only if skipping feature is enabled this. incrproccount = skipbadrecords. getmappermaxskiprecords (job)> 0 & skipbadrecords. getautoincrmapperproccount (job);} public void run (recordreader <K1, V1> input, outputcollector <k2, V2> output, reporter) throws ioexception {try {// allocate key & Value instances that are re-used for all entries K1 key = input. createkey (); V1 value = input. createvalue (); While (input. next (Key, value) {// map pair to output // call the Mapper function cyclically. map (Key, value, output, Reporter); If (incrproccount) {reporter. incrcounter (skipbadrecords. counter_group, skipbadrecords. counter_map_processed_records, 1) ;}} finally {mapper. close () ;}} protected mapper <K1, V1, K2, V2> getmapper () {return mapper ;}}

Believe that sometimes it is faster to understand the source code!

2. Shuffle

The shuffle process is actually a step from map output to reduce input. It is called the "heart" of mapreduce and is divided into three stages,Map-side partitioning, reduce-side replication, and reduce sorting (merging.

2.1 map-side partitioning

In mapreduce computing, there are multiple map tasks and several reduce tasks, and each task may be in a different machine, therefore, it is difficult to output the map task to reduce input.

When the map function generates output, it is not simply written to the disk, but usedBuffer is written to the memory and pre-ordered for efficiencyThe process is as follows:

InBefore writing data to a disk, the thread first divides the output data into the response partition (partiton). In each partition, the background thread performs internal sorting by pressing a key. If there is a combiner, it will run on the output after sorting.

2.2. Reduce end replication stage

Because the output file of the map task is written to the local diskPartition into the number of reduce partitions (each reduce requires a partition), Because the completion time of the map task may be different, as long as a task is completed, the reduce task begins to copy its output, which is the replication phase of the reduce task. As shown in.

2.3. Sort (merge) stage of reduce end

After all the map outputs are copied, the reduce task enters the sorting phase (sort phase). This phase merges the map output to maintain the ordered sorting, as shown in.

3. Input and Output formats

With the increase of time, the growth of data is also exponential growth, and the number of data formats is also growing, making it increasingly difficult to process big data. In order to adapt to the ability to process a variety of data, hadoop provides a seriesThe purpose of control over input and output formats is to parse various input files and generate required output format data..

However, no matter which format the data is processed, it must be combined with mapreduce to maximize the usage of hadoop.

This part is also the core of hadoop!

3.1 input parts and records

Speaking about HDFS,An input Shard is the input block processed by a single map task.,The size of a part should be the same as that of HDFS..

Each Shard is divided into several records. Each record is a key-value pair. Map processes each record one by one..

InIn databases, an input shard can correspond to several rows in a table, and a record corresponds to a row (dbinputformat).

Input parts are represented as inputsplit interfaces in hadoop and are created using inputformat..

Inputformat is used to generate input parts and split them into records. It is only an interface and specific tasks are implemented..

3.2. fileinputformat

Fileinputformat is the base class implemented by inputformat, which uses files as the data source. It provides two functions: defining which files are included in the input of the job, and generating partitions for the input file. The job that splits parts into the base class has its sub-class implementation,Fileinputformat is an abstract class..

Fileinputformat implements the file partitioning function, but how does it implement it? Three parameters must be described first:

Attribute name	Type	Default Value	Description
Mapred. Min. Split. Size	Int	1	Minimum number of bytes of a file shard
Mapred. Max. Split. Size	Long	Long. max_value	Maximum number of bytes of a file shard
DFS. Block. Size	Long	64 m	Block Size in HDFS

There is a formula for calculating the part size (refer to the computesplitsize () method of the fileinputfomat class)

Max (minimumsize, min (maximumsize, blocksize ))

Default Value: minimumsize <blocksize <maximumsize

Fileinputformat only separates large files, that is, files larger than the block size.

The inputsplit generated by fileinputformat is an entire file (the file is too small to be partitioned, and the entire file is treated as a partition for map tasks) or a part of the file (the file is large and partitioned).

3.3. Common inputformat implementation

Small file and combinefileinputformat

Although hadoop is suitable for processing large files, a large number of small files are indispensable in actual situations. Therefore, hadoop provides a combinefileinputformat designed for small files,It packages multiple files into one shard. Generally, each mapper can process more data..

Textinputformat

Hadoop default inputformat. The key of each record is the offset of the row in the file, and the value is the row content..

Keyvalueinputformat

It is suitable for processing configuration files. Keys in the row are in the key value format. For example, keys are keys in the row and values are values in the row.

Nlineinputformat

It is also developed to process text files. It is characteristic of receivingFixed number of rows, The others are similar to textinputformat.

Sequencefileinputformat (binary input)

Hadoop's ordered file format storage format stores binary key-value pairs. Because the ordered file stores data in the map structure, sequencefileinputformat can be used for processing.

Dbinputformat

As the name implies, JDBC is used to read data from a relational database.

Multiple inputs

The multipleinputs class can be used to process data in multiple input formats.For example, if the input data contains the text and binary types, you can use multipleinputs to specify the input type and map function of a file for parsing.

3.4 output format

Since there is an input format, there is an output format, which corresponds to the input format.

The default output format is textoutputformat, which writes records as text lines. Key-value pairs can be of any type. Key-value pairs are separated by tabs by default..

3.5 hadoop features

In addition to the above points, there are counters, sorting, connections, and so on to be concerned, please wait for details...

Hadoop (5) in the big data era: hadoop distributed computing framework (mapreduce)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop (5) in the big data era: hadoop distributed computing framework (mapreduce)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hadoop (5) in the big data era: hadoop distributed computing framework (mapreduce)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support