Hadoop Map/reduce Tutorial

Source: Internet
Author: User
Keywords nbsp you can Rita.
Tags an application apache api application applications based cat class
Objective

This tutorial provides a comprehensive overview of various aspects of the Hadoop map/reduce framework from a user perspective.

Prerequisites

Make sure that Hadoop is installed, configured, and running correctly. For more information, see:

Hadoop QuickStart for first-time users. Hadoop clusters are built on large-scale distributed clusters. Overview

The Hadoop Map/reduce is a simple software framework that can be written from applications that run on large clusters of thousands of commercial machines and process the T-level datasets in a reliable, fault-tolerant manner.

A map/reduce job (job) typically cuts the input dataset into separate blocks of data that are handled in a completely parallel manner by the map task. The framework sorts the output of the map first, and then enters the result into the reduce task. Usually the input and output of the job are stored in the file system. The entire framework is responsible for scheduling and monitoring tasks, as well as for performing tasks that have failed.

Typically, the Map/reduce framework and the Distributed file system run on the same set of nodes, that is, compute nodes and storage nodes are usually together. This configuration allows the framework to efficiently schedule tasks on the nodes where the data is already stored, which enables the network bandwidth of the entire cluster to be utilized very efficiently.

The Map/reduce framework consists of a single master Jobtracker and a slave tasktracker of each cluster node. Master is responsible for scheduling all the tasks that make up a job, which are distributed across different slave, master monitoring their execution, and performing the failed tasks again. Slave is only responsible for performing tasks assigned by master.

The application should at least indicate the location (path) of the input/output and provide the map and reduce functions by implementing the appropriate interfaces or abstract classes. The job configuration (job revisit) is formed by adding the parameters of other jobs. Then, Hadoop's job client submits jobs (jar packs/executable programs, etc.) and configuration information to Jobtracker, who distributes the software and configuration information to slave, schedules tasks, and monitors their execution while providing status and diagnostic information to job-client.

Although the Hadoop framework is implemented with JAVATM, map/reduce applications are not necessarily written in Java.

Hadoop Streaming is a utility that runs jobs, allowing users to create and run any executable program (for example, Shell tools) as mapper and reducer. The Hadoop pipes is a SWIG-compliant C + + API (not based on JNITM technology) and can be used to implement Map/reduce applications. Input and Output

The Map/reduce frame runs on the <key, value> key-value pairs, that is, the frame looks at the input of the job as a set of <key, value> key-value pairs, and also produces a set of <key, value> key-value pairs as the output of the job , the two sets of key value pairs may be different types.

The framework needs to serialize key and value classes (classes), so these classes need to implement the writable interface. In addition, to facilitate the framework to perform sorting operations, the key class must implement the Writablecomparable interface.

The input and output types for a map/reduce job are as follows:

(input) <k1, v1>-> map-> <k2, v2>-> combine-> <k2, v2>-> reduce-> <k3, v3> ( Output)

Example: WordCount v1.0

Before delving into the details, let's look at an example of a map/reduce application to have a rudimentary understanding of how they work.

WordCount is a simple application that calculates the number of occurrences of each word in the specified dataset.

This application is suitable for stand-alone mode, pseudo distributed mode or fully distributed mode three kinds of Hadoop installation methods.

source code WordCount.java1.package org.myorg; 2.3.import java.io.IOException; 4.import java.util.*; 5.6.import Org.apache.hadoop.fs.Path; 7.import org.apache.hadoop.conf.*; 8.import org.apache.hadoop.io.*; 9.import org.apache.hadoop.mapred.*; 10.import org.apache.hadoop.util.*; 11.12.public class WordCount {13.14.&nbsp;&nbsp; public static class Map extends Mapreducebase implements mapper&lt; longwritable, text, text, intwritable&gt; {15.&nbsp;&nbsp;&nbsp;&nbsp; Private final static intwritable one = new Intwritable (1); 16.&nbsp;&nbsp;&nbsp;&nbsp; private Text Word = new text (); 17.18.&nbsp;&nbsp;&nbsp;&nbsp; public void Map (longwritable key, Text value, Outputcollector&lt;text, intwritable&gt; Output, Reporter Reporter) throws IOException {19.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; String line = value.tostring (); 20.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; StringTokenizer tokenizer = new StringTokenizer (line); 21.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; while (Tokenizer.hasmoretokens ()) {22.&NBSP;&NBsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Word.set (Tokenizer.nexttoken ()); 23.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Output.collect (Word, one); 24.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;} 25.&nbsp;&nbsp;&nbsp;&nbsp;} 26.&nbsp;&nbsp;} 27.28.&nbsp;&nbsp; Public Static class Reduce extends Mapreducebase implements Reducer&lt;text, Intwritable, Text, intwritable&gt; {29.&nbsp; &nbsp;&nbsp;&nbsp; public void reduce (Text key, iterator&lt;intwritable&gt; values, Outputcollector&lt;text, intwritable&gt; output, Reporter Reporter) throws IOException {30.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; int sum = 0; 31.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; while (Values.hasnext ()) {32.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp; SUM + + values.next (). 33.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;} 34.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Output.collect (key, new Intwritable (sum)); 35.&nbsp;&nbsp;&nbsp;&nbsp;} 36.&nbsp;&nbsp;} 37.38.&nbsp;&nbsp; public static void Main (string) args) throws settlingtion {39.&nbsp;&nbsp;&nbsp;&nbsp; jobconf conf = new jobconf (wordcount.class); 40.&nbsp;&nbsp;&nbsp;&nbsp; conf.setjobname ("WordCount"); 41.42.&nbsp;&nbsp;&nbsp;&nbsp; Conf.setoutputkeyclass (Text.class); 43.&nbsp;&nbsp;&nbsp;&nbsp; Conf.setoutputvalueclass (Intwritable.class); 44.45.&nbsp;&nbsp;&nbsp;&nbsp; Conf.setmapperclass (Map.class); 46.&nbsp;&nbsp;&nbsp;&nbsp; Conf.setcombinerclass (Reduce.class); 47.&nbsp;&nbsp;&nbsp;&nbsp; Conf.setreducerclass (Reduce.class); 48.49.&nbsp;&nbsp;&nbsp;&nbsp; Conf.setinputformat (Textinputformat.class); 50.&nbsp;&nbsp;&nbsp;&nbsp; Conf.setoutputformat (Textoutputformat.class); 51.52.&nbsp;&nbsp;&nbsp;&nbsp; fileinputformat.setinputpaths (conf, new Path (args[0)); 53.&nbsp;&nbsp;&nbsp;&nbsp; fileoutputformat.setoutputpath (conf, new Path (args[1)); 54.55.&nbsp;&nbsp;&nbsp;&nbsp; jobclient.runjob (conf); 57.&nbsp;&nbsp;} 58.} 59. Usage

If the environment variable hadoop_home corresponds to the root of the installation, hadoop_version the current installed version of HADOOP and compiles wordcount.java to create the jar package, you can do the following:

$ mkdir wordcount_classes
$ javac-classpath ${hadoop_home}/hadoop-${hadoop_version}-core.jar-d wordcount_classes WordCount.java
$ jar-cvf/usr/joe/wordcount.jar-c wordcount_classes/.

That:

/usr/joe/wordcount/input-is the input path in HDFs/usr/joe/wordcount/output-is the output path in HDFs

Use the sample text file as input:

$ bin/hadoop dfs-ls/usr/joe/wordcount/input/
/usr/joe/wordcount/input/file01
/usr/joe/wordcount/input/file02

$ bin/hadoop DFS-CAT/USR/JOE/WORDCOUNT/INPUT/FILE01
Hello World Bye World

$ bin/hadoop DFS-CAT/USR/JOE/WORDCOUNT/INPUT/FILE02
Hello Hadoop Goodbye Hadoop

To run the application:

$ bin/hadoop Jar/usr/joe/wordcount.jar Org.myorg.wordcount/usr/joe/wordcount/input/usr/joe/wordcount/output

The output is:

$ bin/hadoop dfs-cat/usr/joe/wordcount/output/part-00000
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2

The application can use the-files option to specify a comma-delimited list of paths that are the current working directory of a task. Use option-libjars to add jar packs to the classpath of map and reduce. Use the-archives option to pass files as parameters, which will be decompressed and create a symbolic link to the extracted directory (named after the ZIP package) in the current working directory of the task. For more details on command-line options, refer to Commands manual.

Use-libjars and-files to run WordCount examples:
Hadoop jar Hadoop-examples.jar wordcount-files cachefile.txt-libjars mylib.jar Input Output

explain

The WordCount application is straightforward.

The Map method (line 18-25) in Mapper (line 14-26) processes one row at a time through the specified Textinputformat (49 rows). It then divides the line into several tokens by StringTokenizer with a space delimiter, followed by the output < <word>, and the 1> form of the key-value pair.

For the first input in the example, the map output is:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>

The second input, the map output is:
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>

With regard to the determination of the number of maps that make up a given job, and how to control these maps in a finer way, we will learn more in the later part of the tutorial.

WordCount also specifies a combiner (46 lines). Therefore, each time the map runs, the output is sorted according to key, and the output is then passed to the local combiner (as in the job's configuration and reducer) for local aggregation.

The output of the first map is:
< Bye, 1>
< Hello, 1>
< World, 2>

The output of the second map is:
< Goodbye, 1>
< Hadoop, 2>
< Hello, 1>

The Reduce method (line 29-35) in reducer (line 28-36) sums only the number of occurrences of each key (the word in this case).

So the output of this job is:
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>

Several aspects of the job are specified in the Run method in the code, such as the input/output path passed over the command line, the type of key/value, the input/output format, and so on jobconf configuration information. The program then calls Jobclient.runjob (55 lines) to submit the job and monitor its execution.

We will learn more about jobconf, jobclient, tool and other interfaces and classes (class) in the remainder of this tutorial.

map/reduce-user interface

This section of the document provides the appropriate details for each aspect of the Map/reduce framework that users will face. This should help users to achieve, configure, and tune jobs more granular. However, please note that the Javadoc documentation for each class/interface provides the most comprehensive documentation;

We'll look at the mapper and Reducer interfaces first. Applications typically implement them by providing the map and reduce methods.

We then discuss other core interfaces, including: Jobconf,jobclient,partitioner, Outputcollector,reporter, Inputformat,outputformat, and so on.

Finally, we'll finish by discussing some useful feature points in the framework (e.g., Distributedcache, Isolationrunner, and so on).

Core Function Description

Applications typically implement mapper and reducer interfaces by providing map and reduce, which form the core of the job.

Mapper

Mapper maps the input key value pair (Key/value pair) to a set of intermediate-value pairs of key values.

A map is a class of stand-alone tasks that convert an input recordset into an intermediate format recordset. The intermediate format recordset for this transformation does not need to match the type of the input recordset. A given input key value pair can be mapped to 0 or more output key value pairs.

The Hadoop map/reduce Framework produces a MAP task for each inputsplit, and each inputsplit is generated by the inputformat of the job.

In a nutshell, the implementation of mapper needs to rewrite the Jobconfigurable.configure (jobconf) method, which needs to pass a jobconf parameter to complete the initialization of mapper. The framework then invokes a map (writablecomparable, writable, Outputcollector, Reporter) operation for each key value pair in the inputsplit of this task. An application can perform cleanup work by overriding the Closeable.close () method.

The output key value pair does not need to match the type of the input key value pair. A given input key value pair can be mapped to 0 or more output key value pairs. You can collect output key-value pairs by calling Outputcollector.collect (writablecomparable,writable).

Applications can use reporter to report progress, set application-level status messages, update counters (counters), or simply indicate that they are functioning properly.

The framework then divides the values (value) of all intermediate processes associated with a particular key into groups, and then passes them to the reducer to produce the final result. Users can specify the Comparator that are specifically responsible for grouping by Jobconf.setoutputkeycomparatorclass (Class).

After the mapper output is sorted, it is divided into each reducer. The total number of blocks is the same as the number of reduce tasks for one job. Users can control which key is assigned to which reducer by implementing a custom partitioner.

Users can choose to specify a combiner through Jobconf.setcombinerclass (Class), which is responsible for local aggregation of the output from the intermediate process, which helps to reduce the amount of data transferred from mapper to reducer.

The output of these sorted intermediate processes is saved in the format (Key-len, key, Value-len, value), and the application can control whether the intermediate results are compressed and compressed by jobconf, and which compressioncodec to use.

How many maps does

need?

The number of maps is usually determined by the size of the input data, which is generally the total block number of all input files.

The normal parallel size of the map is approximately 10 to 100 maps per node, and it can be set to about 300 of the smaller CPU-consuming map tasks. Because each task initialization takes a certain amount of time, it is more reasonable to have the map execute for at least 1 minutes.

So, if you enter 10TB of data, the size of each block is 128MB, you will need about 82,000 maps to complete the task, unless you use Setnummaptasks (int) (Note: Here is only a hint of the frame (hint), Actual determinants see here) set this number higher.

Reducer

Reducer sets the set of intermediate values associated with a key (reduce) to a smaller set of values.

The user can set the number of reduce tasks in a job by jobconf.setnumreducetasks (int).

In a nutshell, the implementation of reducer needs to rewrite the Jobconfigurable.configure (jobconf) method, which needs to pass a jobconf parameter to complete the initialization of reducer. The framework is then used for each <key in a group of input data, (List of values) > for calls to reduce (writablecomparable, iterator, Outputcollector, Reporter) methods. The application can then perform cleanup work by overriding Closeable.close ().

Reducer has 3 main stages: Shuffle, sort, and reduce.

Shuffle

The input of the reducer is the output that the mapper has arranged. At this stage, the framework obtains all the mapper-related chunks for each reducer through HTTP.

Sort

At this stage, the framework groups the input of the reducer according to the value of the key (because the same key may be in the output of the different mapper).

The shuffle and sort two phases are simultaneous, and the output of the map is also merged while being retrieved.

Secondary Sort

If the intermediate process is required to group the keys differently than the group rules for key before reduce, you can specify a comparator by Jobconf.setoutputvaluegroupingcomparator (Class). Plus Jobconf.setoutputkeycomparatorclass (Class) can be used to control the middle process of how the key is grouped, so the combination of both can be achieved by a value of two order.

Reduce

At this stage, the framework for each <key in the grouped input data, (List of values) > to call once reduce (writablecomparable, iterator, Outputcollector, Reporter) method.

The output of the reduce task is usually written to the file system by calling Outputcollector.collect (Writablecomparable, writable).

Applications can use reporter to report progress, set application-level status messages, update counters (counters), or simply indicate that they are functioning properly.

Reducer output is not sorted.

How many reduce do

need?

The number of reduce recommendations is 0.95 or 1.75 times (<no. of nodes> * mapred.tasktracker.reduce.tasks.maximum).

With 0.95, all reduce can be started as soon as the maps completes and the output of the map is transferred. With 1.75, the speed of the node can be completed in the first round of reduce task, you can start the second round, so that you can get a better load balancing effect.

Increasing the number of reduce reduces the cost of the entire framework, but can improve load balancing and reduce the negative impact of execution failures.

The above scaling factor is slightly smaller than the overall number in order to reserve some reduce resources for speculative tasks (speculative-tasks) or failed tasks in the framework.

without Reducer

If there is no reduction, it is legal to set the number of reduce tasks to zero.

In this case, the output of the map task is written directly to the output path specified by Setoutputpath (path). The framework does not sort them before they are written to filesystem.

Partitioner

The partitioner is used to divide the key value space (key spaces).

Partitioner is responsible for controlling the partition of the map output key. Key (or a key subset) is used to produce partitions, usually using a hash function. The number of partitions is the same as the number of reduce tasks for a job. Therefore, it controls which of the intermediate process key (that is, this record) should be sent to the M reduce task for reduce operations.

Hashpartitioner is the default partitioner.

Reporter

Reporter is a mechanism for MAP/REDUCE application reporting progress, setting application level status messages, and updating counters (counters).

Implementations of mapper and reducer can use reporter to report progress, or simply to show that they are functioning properly. This mechanism is critical in scenarios where an application takes a long time to process an individual key value pair, because the framework might assume that the task timed out, forcing it to kill. Another way to avoid this is to set the configuration parameter mapred.task.timeout to a value that is sufficiently high (or simply set to zero, there is no time-out limit).

Applications can update counter (counters) with reporter.

Outputcollector

Outputcollector is a common mechanism provided by a map/reduce framework for collecting mapper or reducer output data (including intermediate output results and output of jobs).

The Hadoop map/reduce framework comes with a class library of mapper, reducer, and partitioner that contains many practical types.

Job Configuration

Jobconf represents the configuration of a map/reduce job.

Jobconf is the main interface that users describe to the Hadoop framework how a map/reduce job performs. The framework will faithfully try to complete the assignment according to the information described by jobconf, however:

Some parameters may be marked as final by the manager, which means they cannot be changed. Some of the job's parameters can be set explicitly (for example: setnumreducetasks (int)), while others subtly interact with the framework or other parameters of the job, and are set up more complex (for example: setnummaptasks (int)).

Typically, jobconf indicates the specific implementation of Mapper, combiner (if any), partitioner, Reducer, InputFormat, and OutputFormat. Jobconf can also specify a set of input files (setinputpaths (jobconf, path ...)/addinputpath (jobconf, Path)) and (Setinputpaths (jobconf, String)/ Addinputpaths (jobconf, String) and where the output file should be written (Setoutputpath (Path)).

Jobconf can optionally set some advanced options on the job, such as setting comparator, files placed on Distributedcache, intermediate results, or whether the output of the job needs to be compressed and how to compress it; take advantage of user-supplied scripts ( Setmapdebugscript (String)/setreducedebugscript (string), whether the job allows the execution of a preventive (speculative) task ( Setmapspeculativeexecution (Boolean))/(Setreducespeculativeexecution (Boolean)), maximum number of attempts per task (setmaxmapattempts ( int)/setmaxreduceattempts (int)), the percentage of tasks that a job can tolerate failure (setmaxmaptaskfailurespercent (int)/ setmaxreducetaskfailurespercent (int));

Of course, users can use Set (String, String)/get (String, string) to set or get any parameters that the application requires. However, the use of Distributedcache is for large-scale read-only data.

task execution and environment

Tasktracker performs mapper/reducer tasks (Task) as a subprocess on a separate JVM.

Subtasks inherit the environment of the parent tasktracker. The user can set additional options on the child JVM through the mapred.child.java.opts configuration parameters in jobconf, for example: through-djava.library.path=<> Set a non-standard path as a link to the runtime to search for a shared library, and so on. If Mapred.child.java.opts contains a symbolic @taskid@, it is replaced with the Map/reduce taskid value.

Here is an example of multiple parameters and replacements, including: Logging the JVM GC log; The JVM JMX agent starts in a password-free manner so that it can connect to the jconsole so that it can view the memory and threads of the subprocess and get the dump of the thread The JVM's maximum heap size is set to 512MB and an additional path is added for the java.library.path of the child JVM.

<property>
<name>mapred.child.java.opts</name>
<value>
-xmx512m-djava.library.path=/home/mycompany/lib-verbose:gc-xloggc:/tmp/@taskid @.gc
-dcom.sun.management.jmxremote.authenticate=false-dcom.sun.management.jmxremote.ssl=false
</value>
</property>

Users or administrators can also use Mapred.child.ulimit to set the maximum virtual memory for the subtasks that are running. The Mapred.child.ulimit value is in kilobytes (KB) and must be greater than or equal to the value of the XMX parameter to VM, otherwise the VM will fail to start.

Note: Mapred.child.java.opts is only used to set subtasks that task tracker initiates. To set memory options for the daemon, see cluster_setup.html

${mapred.local.dir}/tasktracker/is a local directory of Task tracker that is used to create local caches and jobs. It can specify multiple directories (spanning multiple disks), and the files will be stored in a random directory under the local path. When the job starts, Task tracker creates a local job directory based on the configuration document, as shown in the following directory structure:

${mapred.local.dir}/tasktracker/archive/: Distributed caching. This directory holds the local distributed cache. Therefore, the local distributed cache is shared among all task and job. ${mapred.local.dir}/tasktracker/jobcache/$jobid/: Local job directory. ${mapred.local.dir}/tasktracker/jobcache/$jobid/work/: The shared directory specified by the job. Each task can use this space as a staging space for sharing files between them. This directory is exposed to the user through the Job.local.dir parameter. This path can be accessed via API Jobconf.getjoblocaldir (). It can also be obtained as a system attribute. Therefore, a user (such as running streaming) can call System.getproperty ("Job.local.dir") to obtain the directory. ${mapred.local.dir}/tasktracker/jobcache/$jobid/jars/: The path to the jar package that holds the job's jar file and the expanded jar. Job.jar is the application's jar file, which is automatically distributed to each machine and is automatically expanded before task starts. Use the API Jobconf.getjar () function to get the location of the Job.jar. Use Jobconf.getjar (). GetParent () to access the directory where the expanded jar package resides. ${mapred.local.dir}/tasktracker/jobcache/$jobid/job.xml: A job.xml file, a local generic job profile. ${mapred.local.dir}/tasktracker/jobcache/$jobid/$taskid: Each task has a directory Task-id, which has the following directory structure: ${mapred.local.dir}/ tasktracker/jobcache/$jobid/$taskid/job.xml: A job.xml file, localized task job configuration file. Task localization is to set specific property values for the task. These values are specified below. ${mapred.local.dir}/tasktracker/jobcache/$jobid/$taskid/output a directory of output files that store intermediate processes. It was saved by FramwoRK produces temporary map reduce data, such as map output files, and so on. ${mapred.local.dir}/tasktracker/jobcache/$jobid/$taskid/work:task Current working directory. ${mapred.local.dir}/tasktracker/jobcache/$jobid/$taskid/work/tmp:task temporary directory. (Users can set property mapred.child.tmp to set a temporary directory for the map and reduce tasks.) The default value is./tmp. If this value is not an absolute path, it adds the work path of the task to the temporary file path of the task before the path. This value is used directly if the value is an absolute path. If the specified directory does not exist, the directory is created automatically. followed by the option-djava.io.tmpdir= ' the absolute path of the temporary file
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.