MapReduce Programming Basics
1. WordCount Sample and MapReduce program framework
2. MapReduce Program Execution Flow
3. Deep Learning MapReduce Programming (1)
4. Reference and code download <1>. WordCount Sample and MapReduce program framework
First through a simple program to actually run a mapreduce program, and then through this program we come to Oh that knot mapreduce programming model.
Download source program:/files/xuqiang/wordcount.rar, package the program into Wordcount.jar the following command, write a text file, here is wordcountmrtrial, and uploaded to HDFs, where the path is/ Tmp/wordcountmrtrial, run the following command:
xuqiang@ubuntu:~/hadoop/src/hadoop-0.21.0$./bin/hadoop jar Wordcount.jar wordcount/tmp/wordcountmrtrial/tmp/ Result
If the task is completed, a result similar to this will be generated in the HDFs/tmp/result directory:
Gentleman give 8 Go 6 good 9 Government 16
Running a program is basically such a process, we look at the specific procedures:
The main function first generates a Job object, the job job = new Job (conf, Word count), and then sets the Mapperclass,reducerclass for the job. Set the input file path Fileinputformat.addinputpath (Job, New Path (otherargs[0)); Set the output file path: Fileoutputformat.setoutputpath (Job, New Path (otherargs[1]); Wait for the program to run Complete: System.exit (Job.waitforcompletion (true) 0:1); You can see that main just started a job, Then set the job-related parameters to implement MapReduce is the Mapper class and Reducer class.
The map function in the Tokenizermapper class splits a row into <k2, V2>, and then intsumreducer reduce <K2, list<v2>> to the final result <k3, v3>.
This example basically sums up a simple MapReduce programming model: A Mapper class, a reducer class, and a driver class.
<2>. MapReduce Program Execution Process
The implementation process described here is more focused on understanding from a procedural perspective, and a more comprehensive process can be referenced [here].
First the user specifies the file to be processed, in WordCount is the file wordcountmrtrial, this is Hadoop according to the set Inputdataformat to split the input file into a record (Key/value pair), The record is then passed to the map function, and in the WordCount example, the corresponding record is the <line_number line number, line_content the line content >;
The map function then forms <K2, V2> according to the input record, and <k2 in the WordCount example, V2> is <single_word, Word_count>, for example < "a", 1 >;
If the map process is complete, Hadoop groups the generated <K2, v2>, by K2, to form <k2,list (V2), and then to the reduce function, in which the output of the program is eventually <K3, V3 >. <3>. Deep Learning MapReduce Programming (1) 3.1 Hadoop data types
Because the Hadoop needs to serialize the Key/value and then send it over the network network to other machines in the cluster, the type in Hadoop needs to be serializable.
Specifically, the custom type, if a class of classes implements the writable interface, then this can be used as the value type if one class implements the Writablecomparable<t> interface, then this class can be either a value type or a key type.
Hadoop itself has implemented some predefined types predefined classes, and these types implement the Writablecomparable<t> interface.
3.2 Mapper
If a class wants to be a mapper, then the class needs to implement the Mapper interface while inheriting from Mapreducebase. In the Mapreducebase class, two methods are of particular note:
void Configure (jobconf job): This method is called before the task is run
void Close (): Called after the task is run
The rest of the work is to write the map method, the prototype is as follows: void map (Object key, Text value, context
) throws IOException, interruptedexception;
This method generates <K2, V2> according to <k1, v1>, and then outputs through the context.
The same mapper in Hadoop is predefined as follows:
3.3 Reducer
If a class wants to be a reducer, it needs to implement the Reducer interface first and then inherit from Mapreducebase.
When the reducer receives the key/value pairs that are passed from the mapper and then sorts according to key, groups, eventually generates <K2, list<v2>>, then reducer according to <K2, list<v2> > Generate <K3, V3>
Also, some reducer are predefined in Hadoop:
3.4 Partitioner
The main function of Partitioner is to "guide directing" to reducer Mapper running results. If a class wants to be a partitioner, it needs to implement the Partitioner interface, which inherits from Jobconfigurable and is defined as follows: public interface Partitioner < K2, V2 > ex Tends jobconfigurable {
/**
* Get the Paritition number for a given key (hence record) given the total
* Number of partitions i.e. number of reduce-tasks for the job.
*
* <p>typically a hash function on a all or a subset of the key.</p>
*
* @param key The key to be paritioned.
* @param value the entry value.
* @param numpartitions The total number of partitions.
* @return The partition number for the <code>key</code>.
*/
int getpartition (K2 key, V2 value, int numpartitions);
}
Hadoop determines that the value of the mapper is sent to that reducer based on the return value of the method getpartition. Key/value pairs that return the same value will be "directed" to the same reducer. 3.5 Input data format and Output data format 3.5.1 input data format
Above our assumption is that the MapReduce program input is Key/value, that is, <K1, V1>, but in fact, in general, the input of the MapReduce program is the form of big file, then how to convert this file to <k1, V1 ", that is, file-> <k1, v1>. This requires the use of the InputFormat interface.
Here are the implementation classes for several common inputformat:
Of course, in addition to using Hadoop predefined Inputdataformat, you can customize it, which is the need to implement the InputFormat interface. The interface contains only two methods:
Inputsplit[] Getsplits (jobconf job, int numsplits) throws IOException; the interface implementation splits large files into small pieces. Recordreader<k, v> Getrecordreader (inputsplit split, jobconf job,
Reporter Reporter) throws IOException;
The method enters the split split and then returns to Recordreader, traversing the record in the split by Recordreader. 3.5.2 Output Data Format
Each reducer writes its own output to the result file, which is the format of the file used to configure the output using output data format. Hadoop is implemented in advance:
3.6 Streaming in Hadoop 3.6.1 execution process
We know that there is a so-called "flow" concept in Linux, which means we can use the following command:
Cat Input.txt | randomsample.py >sampled_output.txt
Similarly, we can use similar commands in Hadoop, which obviously can speed up the process of program development to a large extent. Here's a look at the process of streaming in Hadoop:
The Hadoop streaming reads the data from the standard brick input stdin, by default using \ t to split each row, and if T is not present, then the contents of the positive line will be considered as key while the value content is empty;
Then call the Mapper program, output <K2, v2>;
After that, the call Partitioner to <K2, v2> output to the corresponding reducer;
Reducer according to the input of <K2, list (V2) > get the final result <k3, v3> and output to stdout. 3.6.2 Simple Sample Program
Let's assume that you need to do a job, type a file, each line in the file is a number, and then get the maximum number of digits in the file (of course, you can use the aggregate from the streaming). First we write a python file (if not very familiar with Python, see [here]):
3.6.2.1 Prepare data
xuqiang@ubuntu:~/hadoop/src/hadoop-0.21.0$ echo "http://www.cs.brandeis.edu" >URL1
xuqiang@ubuntu:~/hadoop/src/hadoop-0.21.0$ echo "http://www.nytimes.com" >url2
Upload to HDFs:
xuqiang@ubuntu:~/hadoop/src/hadoop-0.21.0$ bin/hadoop dfs-mkdir URLs xuqiang@ubuntu:~/hadoop/src/ hadoop-0.21.0$ bin/hadoop dfs-put url1 urls/xuqiang@ubuntu:~/hadoop/src/hadoop-0.21.0$ bin/hadoop dfs-put url2 urls/
3.6.2.2 Write mapper multifetch.py #!/usr/bin/env python
Import sys, urllib, re
title_re = re.compile ("<title>" (. *?) </title>,
re. multiline | re. dotall | re. IGNORECASE)
for line in sys.stdin:
# we assume that we are fed a series of urls, one per line
url = line.strip ()
# Fetch the content and output the title (pairs are tab-delimited)
match = title_re.search (Urllib.urlopen (URL). Read ())
if Match:
print url, "\ T", match.group (1). Strip ()
The primary function of the file is to give a URL and then output the title of the HTML page that the URL represents.
Test the program locally: xuqiang@ubuntu:~/hadoop/src/hadoop-0.21.0$ echo "http://www.cs.brandeis.edu" >urls xuqiang@ubuntu:~/ hadoop/src/hadoop-0.21.0$ echo "http://www.nytimes.com" >>urls
xuqiang@ubuntu:~/hadoop/src/hadoop-0.21.0$ sudo chmod u+x./multifetch.py
xuqiang@ubuntu:~/hadoop/src/hadoop-0.21.0$ Cat URLs |./multifetch.py will output:
http://www.cs.brandeis.edu Computer Science Department | Brandeis University http://www.nytimes.com The New York times-breaking news, World News & Multimedia
3.6.2.3 Write Reducer reducer.py
Write reducer.py file #!/usr/bin/env python from operator import Itemgetter
Import Sys
For line in Sys.stdin:
line = Line.strip ()
Print Line
xuqiang@ubuntu:~/hadoop/src/hadoop-0.21.0$ chmod u+x./reducer.py
Now that our mapper and reducer are ready, we first run the function of testing the program locally, and the following command simulates the process running on Hadoop:
First, mapper reads data from stdin, which is a row;
Then read the contents of the line as a URL, and then get the URL to represent the content of the HTML title, Output <url, url-title-content>;
Call the sort command to sort the mapper output;
Give reducer the result of the sort completion, where reducer simply outputs the results. Xuqiang@ubuntu: ~/hadoop/src/hadoop-0.21.0$ Cat URLs | . /multifetch.py | Sort | . /reducer.py
http://www.cs.brandeis.edu Computer Science Department | Brandeis University
Http://www.nytimes.com the New York times-breaking news, World News & Multimedia
Apparently, the program is able to correctly
3.6.2.4 runs Xuqiang@ubuntu on a Hadoop streaming: ~/hadoop/src/hadoop-0.21.0$ bin/hadoop jar. /mapred/contrib/streaming/hadoop-0.21.0-streaming.jar \
>-mapper/home/xuqiang/hadoop/src/hadoop-0.21.0/multifetch.py \
>-reducer/home/xuqiang/hadoop/src/hadoop-0.21.0/reducer.py \
>-Input URLs/*
>-output Titles
After the program has finished running, view the results of the operation:
xuqiang@ubuntu:~/hadoop/src/hadoop-0.21.0$ bin/hadoop dfs-cat titles/part-00000 <4>. Reference and Code Downloads
http://pages.cs.brandeis.edu/~cs147a/lab/hadoop-example/
Http://hadoop.apache.org/mapreduce/docs/r0.21.0/streaming.html#Hadoop+Streaming
Origin: Http://www.cnblogs.com/xuqiang/archive/2011/06/05/2071935.html