Objective
This tutorial provides a comprehensive overview of various aspects of the Hadoop map-reduce framework from a user perspective.
Prerequisites
Make sure that Hadoop is installed, configured, and running correctly. For more information, see:
Hadoop QuickStart for first-time users. Hadoop clusters are built on large-scale distributed clusters. Overview
The Hadoop Map-reduce is a simple software framework that can be written from applications that run on large clusters of thousands of commercial machines and process the T-level datasets in a reliable, fault-tolerant manner.
A map-reduce job (job) typically cuts the input dataset into separate blocks of data that are handled in a completely parallel manner by the map task. The framework sorts the output of the map first, and then enters the result into the reduce task. Usually the input and output of the job are stored in the file system. The entire framework is responsible for scheduling and monitoring tasks, as well as for performing tasks that have failed.
Typically, the Map-reduce framework and the Distributed file system run on the same set of nodes, that is, compute nodes and storage nodes are usually together. This configuration allows the framework to efficiently schedule tasks on the nodes where the data is already stored, which enables the network bandwidth of the entire cluster to be utilized very efficiently.
The Map-reduce framework consists of a single master Jobtracker and a slave tasktracker of each cluster node. This master is responsible for scheduling all the tasks that make up a job, which are distributed across different slave, master monitoring their execution, and performing the failed tasks again. Slave is only responsible for performing tasks assigned by master.
The application should at least indicate the location (path) of the input/output and provide the map and reduce functions by implementing the appropriate interfaces or abstract classes. The job configuration (job revisit) is formed by adding the parameters of other jobs. Then, Hadoop's job client submits jobs (jar packs/executable programs, etc.) and configuration information to Jobtracker, who distributes the software and configuration information to slave, schedules tasks, and monitors their execution while providing status and diagnostic information to job-client.
Although the Hadoop framework is implemented with JAVATM, map-reduce applications are not necessarily written in Java.
Hadoop Streaming is a utility that runs jobs, allowing users to create and run any executable program (for example, Shell tools) as mapper and reducer. The Hadoop pipes is a SWIG-compliant C + + API (not based on JNITM technology) and can be used to implement Map-reduce applications. Input and Output
The Map-reduce frame runs on the <key, value> key-value pairs, that is, the frame looks at the input of the job as a set of <key, value> key-value pairs, and also produces a set of <key, value> key-value pairs as the output of the job , the two sets of key value pairs may be different types.
The framework needs to serialize key and value classes (classes), so these classes need to implement the writable interface. In addition, to facilitate the framework to perform sorting operations, the key class must implement the Writablecomparable interface.
The input and output types for a map-reduce job are as follows:
(input) <k1, v1>-> map-> <k2, v2>-> combine-> <k2, v2>-> reduce-> <k3, v3> ( Output)
Example: WordCount v1.0
Before delving into the details, let's look at an example of a map-reduce application to have a rudimentary understanding of how they work.
WordCount is a simple application that calculates the number of occurrences of each word in the specified dataset.
This application is suitable for stand-alone mode, pseudo distributed mode or fully distributed mode three kinds of Hadoop installation methods.
source code WordCount.java1.package org.myorg; 2.3.import java.io.IOException; 4.import java.util.*; 5.6.import Org.apache.hadoop.fs.Path; 7.import org.apache.hadoop.conf.*; 8.import org.apache.hadoop.io.*; 9.import org.apache.hadoop.mapred.*; 10.import org.apache.hadoop.util.*; 11.12.public class WordCount {13.14. public static class Map extends Mapreducebase implements mapper< longwritable, text, text, intwritable> {15. Private final static intwritable one = new Intwritable (1); 16. private Text Word = new text (); 17.18. public void Map (longwritable key, Text value, Outputcollector<text, intwritable> Output, Reporter Reporter) throws IOException {19. String line = value.tostring (); 20. StringTokenizer tokenizer = new StringTokenizer (line); 21. while (Tokenizer.hasmoretokens ()) {22.&NBSP;&NBsp; Word.set (Tokenizer.nexttoken ()); 23. Output.collect (Word, one); 24. } 25. } 26. } 27.28. Public Static class Reduce extends Mapreducebase implements Reducer<text, Intwritable, Text, intwritable> {29. public void reduce (Text key, iterator<intwritable> values, Outputcollector<text, intwritable> output, Reporter Reporter) throws IOException {30. int sum = 0; 31. while (Values.hasnext ()) {32. SUM + + values.next (). 33. } 34. Output.collect (key, new Intwritable (sum)); 35. } 36. } 37.38. public static void Main (string) args) throws settlingtion {39. jobconf conf = new jobconf (wordcount.class); 40. conf.setjobname ("WordCount"); 41.42. Conf.setoutputkeyclass (Text.class); 43. Conf.setoutputvalueclass (Intwritable.class); 44.45. Conf.setmapperclass (Map.class); 46. Conf.setcombinerclass (Reduce.class); 47. Conf.setreducerclass (Reduce.class); 48.49. Conf.setinputformat (Textinputformat.class); 50. Conf.setoutputformat (Textoutputformat.class); 51.52. fileinputformat.setinputpaths (conf, new Path (args[0)); 53. fileoutputformat.setoutputpath (conf, new Path (args[1)); 54.55. jobclient.runjob (conf); 57. } 58.} 59. Usage
If the environment variable hadoop_home corresponds to the root of the installation, hadoop_version the current installed version of HADOOP and compiles wordcount.java to create the jar package, you can do the following:
$ mkdir wordcount_classes
$ javac-classpath ${hadoop_home}/hadoop-${hadoop_version}-core.jar-d wordcount_classes WordCount.java
$ jar-cvf/usr/joe/wordcount.jar-c wordcount_classes/.
That:
/usr/joe/wordcount/input-is the input path in HDFs/usr/joe/wordcount/output-is the output path in HDFs
Use the sample text file as input:
$ bin/hadoop dfs-ls/usr/joe/wordcount/input/
/usr/joe/wordcount/input/file01
/usr/joe/wordcount/input/file02
$ bin/hadoop DFS-CAT/USR/JOE/WORDCOUNT/INPUT/FILE01
Hello World Bye World
$ bin/hadoop DFS-CAT/USR/JOE/WORDCOUNT/INPUT/FILE02
Hello Hadoop Goodbye Hadoop
To run the application:
$ bin/hadoop Jar/usr/joe/wordcount.jar Org.myorg.wordcount/usr/joe/wordcount/input/usr/joe/wordcount/output
The output is:
$ bin/hadoop dfs-cat/usr/joe/wordcount/output/part-00000
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2
explain
The WordCount application is straightforward.
The Map method (line 18-25) in Mapper (line 14-26) processes one row at a time through the specified Textinputformat (49 rows). It then divides the line into several tokens by StringTokenizer with a space delimiter, followed by the output < <word>, and the 1> form of the key-value pair.
For the first input in the example, the map output is:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
The second input, the map output is:
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>
With regard to the determination of the number of maps that make up a given job, and how to control these maps in a finer way, we will learn more in the later part of the tutorial.
WordCount also specifies a combiner (46 lines). Therefore, each time the map process is introduced, the output is sorted according to key, and then the output is passed to the local combiner (as the job is configured to be the same as reducer, using Reduce.class) for local aggregation.
The output of the first map is:
< Bye, 1>
< Hello, 1>
< World, 2>
The output of the second map is:
< Goodbye, 1>
< Hadoop, 2>
< Hello, 1>
The Reduce method (line 29-35) in reducer (line 28-36) sums only the number of occurrences of each key (the word in this case).
So the output of this job is:
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>
Several aspects of the job are specified in the Run method in the code, such as the input/output path passed over the command line, the type of key/value, the input/output format, and so on jobconf configuration information. The program then calls Jobclient.runjob (55 lines) to submit the job and monitor its execution.
We will learn more about jobconf, jobclient, tool and other interfaces and classes (class) in the remainder of this tutorial.
map-reduce-user interface
This section of the document provides the appropriate details for each aspect of the Map-reduce framework that users will face. This should help users to achieve, configure, and tune jobs more granular. However, please note that the Javadoc documentation for each class/interface is still the most comprehensive document available;
We'll look at the mapper and Reducer interfaces first. Applications typically implement them by providing the map and reduce methods.
We then discuss other core interfaces, including: Jobconf,jobclient,partitioner, Outputcollector,reporter, Inputformat,outputformat, and so on.
Finally, we'll end up with a discussion of some useful functional points (such as Distributedcache, Isolationrunner, etc.) in the framework.
Core Function Description
Applications typically implement mapper and reducer interfaces by providing map and reduce, which form the core of the job.
Mapper
Mapper maps the input key value pair (Key/value pair) to a set of intermediate-value pairs of key values.
A map is a class of stand-alone tasks that convert an input recordset into an intermediate format recordset. The intermediate format recordset for this transformation does not need to match the type of the input recordset. A given input key value pair can be mapped to 0 or more output key value pairs.
The Hadoop map-reduce Framework produces a MAP task for each inputsplit, and each inputsplit is generated by the inputformat corresponding to each job.
In a nutshell, the implementation of mapper needs to rewrite the Jobconfigurable.configure (jobconf) method, which needs to pass a jobconf parameter to complete the initialization of mapper. The framework then invokes a map (writablecomparable, writable, Outputcollector, Reporter) operation for each key value pair in the inputsplit of this task. The application can then perform cleanup work by overriding the Closeable.close () method.
The output key value pair does not need to match the type of the input key value pair. A given input key value pair can be mapped to 0 or more output key value pairs. You can collect output key-value pairs by calling Outputcollector.collect (writablecomparable,writable).
Applications can use reporter to report progress, set application-level status messages, update counters (counters), or simply indicate that they are functioning properly.
The framework then divides the values (value) of all intermediate processes associated with a particular key into groups, and then passes them to the reducer to produce the final result. Users can specify the Comparator that are specifically responsible for grouping by Jobconf.setoutputkeycomparatorclass (Class).
After the mapper output is sorted, it is divided into each reducer. The total number of blocks is the same as the number of reduce tasks for one job. Users can control which key is assigned to which reducer by implementing a custom partitioner.
Users can choose to specify a combiner through Jobconf.setcombinerclass (Class), which is responsible for local aggregation of the output from the intermediate process, which helps to reduce the amount of data transferred from mapper to reducer.
The output of these ordered intermediate processes is usually stored in sequencefile format files. Applications can control whether these intermediate results are compressed and how to compress them by jobconf, and which compressioncodec to use.
How many maps does
need?
The number of maps is usually determined by the size of the input data, which is generally the total block number of all input files.
The normal parallel size of the map is roughly 10 to 100 maps per node, and it has reached 300 of the size of a smaller CPU-consuming map task. Because each task initialization takes a certain amount of time, it is more reasonable to have the map execute for at least 1 minutes.
So, if you enter 10TB of data, the size of each block is 128MB, you will need about 82,000 maps to complete the task, unless you use Setnummaptasks (int) (Note: Here is only a hint of the frame (hint), Actual determinants see here) set this number higher.
Reducer
Reducer sets the set of intermediate values associated with a key (reduce) to a smaller set of values.
The user can set the number of reduce tasks in a job by jobconf.setnumreducetasks (int).
In a nutshell, the implementation of reducer needs to rewrite the Jobconfigurable.configure (jobconf) method, which needs to pass a jobconf parameter to complete the initialization of reducer. The framework is then used for each <key in a group of input data, (List of values) > for calls to reduce (writablecomparable, iterator, Outputcollector, Reporter) methods. The application can then perform cleanup work by overriding Closeable.close ().
Reducer has 3 main stages: Shuffle, sort, and reduce.
Shuffle
The input of the reducer is the output that the mapper has arranged. At this stage, the framework obtains all the mapper-related chunks for each reducer through HTTP.
Sort
At this stage, the framework groups the input of the reducer according to the value of the key (because the same key may be in the output of the different mapper).
The shuffle and sort two phases are simultaneous, and the output of the map is also merged while being retrieved.
Secondary Sort
If the intermediate process key rules differ from the rules before the first reduce, you can specify a comparator by Jobconf.setoutputvaluegroupingcomparator (Class). Plus Jobconf.setoutputkeycomparatorclass (Class) can be used to control the middle process of how the key is grouped, so the combination of both can be achieved by a value of two order.
Reduce
At this stage, the framework for each <key in the grouped input data, (List of values) > to call once reduce (writablecomparable, iterator, Outputcollector, Reporter) method.
The output of the reduce task is usually written to the file system by calling Outputcollector.collect (Writablecomparable, writable).
Applications can use reporter to report progress, set application-level status messages, update counters (counters), or simply indicate that they are functioning properly.
Reducer output is not sorted.
How many reduce do
need?
The number of reduce recommendations is 0.95 or 1.75 times (<no. of nodes> * mapred.tasktracker.reduce.tasks.maximum)
With 0.95, all reduce can be started as soon as the maps completes and the output of the map is transferred. With 1.75, the speed of the node can be completed in the first round of reduce task, you can start the second round, so that you can get a better load balancing effect.
Increasing the number of reduce reduces the cost of the entire framework, but can improve load balancing and reduce the negative impact of execution failures.
The above scaling factor is slightly smaller than the overall number in order to reserve some reduce resources for preventative tasks (speculative-tasks) or failed tasks in the framework.
without Reducer
If there is no reduction, it is legal to set the number of reduce tasks to zero.
In this case, the output of the map task is written directly to the output path specified by Setoutputpath (path). The framework does not sort them before they are written to filesystem.
Partitioner
The partitioner is used to divide the key value space (key spaces).
Partitioner is responsible for controlling the partition of the map output key. Key (or a key subset) is used to produce partitions, usually using a hash function. The number of partitions is the same as the number of reduce tasks for a job. Therefore, it controls which of the intermediate process key (that is, this record) should be sent to the M reduce task for reduce operations.
Hashpartitioner is the default partitioner.
Reporter
Reporter is a mechanism for MAP-REDUCE application reporting progress, setting application level status messages, and updating counters (counters).
Implementations of mapper and reducer can use reporter to report progress, or simply to show that they are functioning properly. This mechanism is critical in scenarios where an application spends a lot of time processing individual key-value pairs, because otherwise the framework might assume that the task timed out, forcing it to kill. Another way to avoid this is to set the configuration parameter mapred.task.timeout to a value that is sufficiently high (or simply set to zero, there is no time-out limit).
The application can update the update counter (counter) with reporter.
Outputcollector
Outputcollector is a common mechanism provided by a map-reduce framework for collecting mapper or reducer output data (including intermediate output results and output of jobs).
The Hadoop map-reduce framework comes with a class library of mapper, reducer, and partitioner that contains many practical types.
Job Configuration
Jobconf represents the configuration of a map-reduce job.
Jobconf is the main interface that users describe to the Hadoop framework how a map-reduce job performs. The framework will faithfully try to complete the assignment according to the information described by jobconf, however:
Some parameters may be marked as final by the manager, which means they cannot be changed. Some of the job's parameters can be set explicitly (for example: setnumreducetasks (int)), while others subtly interact with the framework or other parameters of the job, and are set up more complex (for example: setnummaptasks (int)).
Typically, jobconf indicates the specific implementation of Mapper, combiner (if any), partitioner, Reducer, InputFormat, and OutputFormat. Jobconf can also specify a set of input files (setinputpaths (jobconf, path ...)/addinputpath (jobconf, Path)) and (Setinputpaths (jobconf, String)/ Addinputpaths (jobconf, String) and where the output file should be written (Setoutputpath (Path)).
Jobconf can optionally set some advanced options on the job, such as setting comparator, files placed on Distributedcache, intermediate results, or whether the output of the job needs to be compressed and how to compress it; take advantage of user-supplied scripts ( Setmapdebugscript (String)/setreducedebugscript (string), whether the job allows the execution of a preventive (speculative) task ( Setmapspeculativeexecution (Boolean))/(Setreducespeculativeexecution (Boolean)), maximum number of attempts per task (setmaxmapattempts ( int)/setmaxreduceattempts (int)), the percentage of tasks that a job can tolerate failure (setmaxmaptaskfailurespercent (int)/ setmaxreducetaskfailurespercent (int));
Of course, users can use Set (String, String)/get (String, string) to set or get any parameters that the application requires. However, the use of Distributedcache is for large-scale read-only data.
task execution and environment
Tasktracker performs mapper/reducer tasks (Task) as a subprocess on a separate JVM.
Subtasks inherit the environment of the parent tasktracker. The user can set additional options on the child JVM through the mapred.child.java.opts configuration parameters in jobconf, for example, by-djava.library.path=<> to set a non-standard path as a link to the runtime To search for shared libraries, and so on. If Mapred.child.java.opts contains a symbolic @taskid@, it is replaced with the Map/reduce taskid value.
Here is an example of multiple parameters and replacements, including: Logging the JVM GC log; The JVM JMX agent starts in a password-free manner so that it can connect to the jconsole so that it can view the memory and threads of the subprocess and get the dump of the thread The JVM's maximum heap size is set to 512MB and an additional path is added for the java.library.path of the child JVM.
<property>
<name>mapred.child.java.opts</name>
<value>
-xmx512m-djava.library.path=/home/mycompany/lib-verbose:gc-xloggc:/tmp/@taskid @.gc
-dcom.sun.management.jmxremote.authenticate=false-dcom.sun.management.jmxremote.ssl=false
</value>
</property>
Users or administrators can also use Mapred.child.ulimit to set the maximum virtual memory for a subtask that is already running.
When a job starts running, the local job directory ${mapred.local.dir}/tasktracker/jobcache/$jobid/contains the following directories:
A ${mapred.local.dir}/tasktracker/jobcache/$jobid/work/that creates a local, job-specific, shared directory. This directory is exposed to the user through the Job.local.dir parameter. Each task can use this space as a staging space for sharing files between them. This path can be accessed via API Jobconf.getjoblocaldir (). It can also be obtained as a system property, so that the user can invoke System.getproperty ("Job.local.dir"), a jar path that holds the job's jar file and the expanded jar. A job.xml file, a generic job profile. Each task has a directory Task-id, which has the following directory structure: A job.xml file, task local job configuration file. A directory that holds an output file for the intermediate process. The working directory for this task, which has a temporary directory to create temporary files.
Distributedcache can be used as a basic software distribution mechanism in the Map/reduce task. It can be used to distribute jar packs and local libraries (native libraries). Distributedcache.addarchivetoclasspath (path, revisit) and Distributedcache.addfiletoclasspath (path, Revisit) APIs can be used to cache files and jar packages and add them to the classpath of the child JVM. Distributedcache creates symbolic connections to cached files under the working directory of tasks, so this mechanism can be used to distribute native libraries and load them. The detail behind this is that the child JVM that executes the task actually always adds its current working directory to the Java.library.path, so that the cached library can be loaded through system.loadlibrary or system.load.
job submission and monitoring
Jobclient is the primary interface between user-submitted jobs and jobtracker interactions.
Jobclient provides functions such as submitting jobs, tracking processes, accessing subtask log records, and obtaining Map-reduce cluster status information.
The job submission process includes:
Check job input/output style details calculates the Inputsplit value for the job. If necessary, create the necessary statistics for the distributedcache of the job. The jar and configuration files for the copy job are in the Map-reduce system directory on the filesystem. Submit a job to Jobtracker and monitor its status.
The history file for the job is recorded in the "_logs/history/" subdirectory of the specified directory. This specified directory is set by Hadoop.job.history.user.location, and the default is the directory for the job output. Therefore, the files are stored in the Mapred.output.dir/_logs/history directory by default. User can set Ha