ArticleDirectory
- 3.5.1 input data format
- 3.5.2 output data format
- 3.6.1 Execution Process
- 3.6.2 simple example Program
1.Wordcount example and mapreduceProgramFramework
2. mapreduce Program Execution Process
3.Deep Learning of mapreduce programming (1)
4. References andCodeDownload
<1> wordcount example and mapreduce Program Framework
First, you can run a mapreduce program using a simple program. Then, we can use this program to complete the mapreduce programming model.
Download the source program:/files/xuqiang/wordcount.rar and package the program into wordcount. run the following command in jar to write a text file, wordcountmrtrial, and upload it to HDFS. The path here is/tmp/wordcountmrtrial. Run the following command:
Xuqiang @ Ubuntu :~ /Hadoop/src/hadoop-0.21.0 $./bin/hadoop jar wordcount. Jar wordcount/tmp/wordcountmrtrial/tmp/result
If the task is completed, a result similar to this will be generated in the/tmp/result directory of HDFS:
Gentleman11
Get 12 give 8go 6 Good 9 Government 16
The process of running a program is basically like this. Let's take a look at the specific program:
In the main function, the first job is a job object, job = new job (Conf, "Word Count"); then, set mapperclass and reducerclass of the job, and set fileinputformat in the input file path. addinputpath (job, new path (otherargs [0]); set the output file path: fileoutputformat. setoutputpath (job, new path (otherargs [1]); wait until the program runs successfully: system. exit (job. waitforcompletion (true )? 0: 1). It can be seen that the main only starts a job, and then sets parameters related to the job. The specific implementation of mapreduce is the Mapper class and reducer class.
The map function in the tokenizermapper class Splits a row into <k2, V2>, and the reduce function of intsumreducer converts <k2, list <V2> to the final result <K3, V3>.
This example can basically summarize the simple mapreduce Programming Model: A mapper class, a reducer class, and a driver class.
<2>. mapreduce Program Execution Process
The execution process described here focuses more on understanding from the perspective of the program. For more comprehensive procedures, refer to [here].
First, the User specifies the file to be processed. In wordcount, the file wordcountmrtrial is used. This is hadoop splits the input file into a record (key/Value Pair) according to the set inputdataformat ), then pass these records to the map function. In the wordcount example, the corresponding record is the <line_number row number, line_content row content>;
The map function then forms <k2, V2> Based on the input record. In the wordcount example, <k2, V2> is <single_word, word_count>, for example, <"", 1>;
After the map process is complete, hadoop groups the generated <k2, V2> according to K2 to form <k2, list (V2)>, and then passes it to the reduce function, in this function, the program output result <K3, V3> is obtained.
<3>. Deep Learning of mapreduce programming (1) 3.1 hadoop Data Types
Since key/value pairs need to be serialized in hadoop and then sent to other machines in the cluster through the network, the types in hadoop need to be serializable.
Specifically, if a class implements writable interface, this can be used as the value type. If a class implements writablecomparable <t> interface, this class can be of the value or key type.
Hadoop has implemented some predefined types of predefined classes, and these types implement the writablecomparable <t> interface.
3.2 mapper
If a class wants to become a Mapper, the class must implement the ER interface and inherit from mapreducebase. In the mapreducebase class, pay special attention to the following two methods:
Void configure (jobconf job): This method is called before the task is run.
Void close (): Called after the task is completed
The rest of the work is to compile the map method. The prototype is as follows:
Void map (Object key, text value, context Context
) Throws ioexception, interruptedexception;
This method generates <k2, V2> Based on <K1, V1>, and then outputs it through context.
Similarly, the following mapper is pre-defined in hadoop:
3.3 Reducer
If a class wants to become a reducer, the reducer interface must be implemented first and then inherited from mapreducebase.
When the reducer receives the key/value pair passed from the Mapper, sorts and groups the pairs based on the keys, and generates <k2, list <V2>. Then, the reducer generates <k2, list <V2>, list <V2> Generate <K3, V3>.
Some reducers are pre-defined in hadoop:
3.4 partitioner
The role of partitioner is to "direct the result of ER er Operation to Cer CER. If a class wants to become a partitioner, You need to implement the partitioner interface, which inherits from jobexcepable and is defined as follows:
Public Interface Partitioner < K2, V2 > Extends Jobretriable {
/**
* Get the paritition number for a given key (hence record) Given the total
* Number of partitions I. e. Number of reduce-tasks for the job.
*
* <P> typically a hash function on a all or a subset of the key. </P>
*
* @ Param Key the key to be paritioned.
* @ Param Value the entry value.
* @ Param Numpartitions the total number of partitions.
* @ Return The partition number for the <code> key </code>.
*/
Int Getpartition (K2 key, V2 value, Int Numpartitions );
}
Hadoop determines the Mapper value sent to the reducer based on the return value of the getpartition method. Key/value pairs with the same return value will be "directed" to the same CER Cer.
3.5 input data format and output data format3.5.1 input data format
The preceding assumption is that the input of the mapreduce program is a key/value pair, that is, <K1, V1>. However, in general, the input of the mapreduce program is in the form of big file, then how to convert the file to <K1, V1>, that is, file-> <K1, V1>. This requires the inputformat interface.
The following are several implementation classes for commonly used inputformat:
Of course, in addition to the pre-defined inputdataformat of hadoop, you can also customize it. This requires the implementation of the inputformat interface. This interface only contains two methods:
Inputsplit [] getsplits (jobconf job, int numsplits) throws ioexception; this interface divides large files into small splits. Recordreader <K, V> getrecordreader (inputsplit split, jobconf job,
Reporter reporter) throws ioexception;
This method is used to input the split, then return the recordreader, and use the recordreader to traverse the record in the split.
3.5.2 output data format
Each CER writes its own output to the result file, which is used to configure the output file format. Hadoop pre-implemented:
3.6 streaming in hadoop3.6.1 Execution Process
We know that the so-called "stream" concept exists in Linux, that is, we can use the following command:
Cat input.txt | randomsample. py 10> sampled_output.txt
Similar commands can also be used in hadoop. Obviously, this can accelerate the development process of the program to a great extent. Let's take a look at the process of hadoop streaming execution:
Hadoop streaming reads data from stdin input from the brick. By default, \ t is used to separate each line. If \ t does not exist, the content of the row is considered as the key, at this time, the value content is blank;
Then call the mapper program and output <k2, V2>;
Then, call partitioner to output <k2, V2> to the reducer;
Reducer obtains the final result <K3, V3> Based on the input <k2, list (V2)> and outputs it to stdout.
3.6.2 simple example Program
Let's assume that we need to do this. Input a file where each row in the file is a number and obtain the maximum number in the file (the aggregate in streaming can be used here, of course ). First, compile a python file (if you are not familiar with python, refer to [here]):
3.6.2.1 prepare data
Xuqiang @ Ubuntu :~ /Hadoop/src/hadoop-0.21.0 $ echo "http://www.cs.brandeis.edu"> url1
Xuqiang @ Ubuntu :~ /Hadoop/src/hadoop-0.21.0 $ echo "http://www.nytimes.com"> url2
Upload to HDFS:
Xuqiang @ Ubuntu :~ /Hadoop/src/hadoop-0.21.0 $ bin/hadoop DFS-mkdir URLs
Xuqiang @ Ubuntu :~ /Hadoop/src/hadoop-0.21.0 $ bin/hadoop DFS-put url1 URLs/xuqiang @ Ubuntu :~ /Hadoop/src/hadoop-0.21.0 $ bin/hadoop DFS-put url2 URLs/
3.6.2.2 write mapper multifetch. py
# ! /Usr/bin/ENV Python
Import Sys, urllib, re
Title_re=Re. Compile ("<Title> (.*?) </Title>",
Re. multiline|Re. dotall|Re. ignorecase)
For Line In SYS. stdin:
# We assume that we are fed a series of URLs, one per line
URL = Line. Strip ()
# Fetch the content and output the title (pairs are tab-delimited)
Match = Title_re.search (urllib. urlopen (URL). Read ())
If Match:
Print URL, " \ T " , Match. Group ( 1 ). Strip ()
This file is mainly used to specify a URL and then output the title of the HTML page represented by this URL.
Test the program locally:
Xuqiang @ Ubuntu :~ /Hadoop/src/hadoop-0.21.0 $ echo "http://www.cs.brandeis.edu"> urlsxuqiang @ Ubuntu :~ /Hadoop/src/hadoop-0.21.0 $ echo "http://www.nytimes.com"> URLs
Xuqiang @ Ubuntu :~ /Hadoop/src/hadoop-0.21.0 $ sudo chmod U + X./multifetch. py
Xuqiang @ Ubuntu :~ /Hadoop/src/hadoop-0.21.0 $ cat URLs |./multifetch. py will output:
Http://www.cs.brandeis.eduComputer Science Department | Brandeis University
Http://www.nytimes.com The New York Times-breaking news, World News & amp; Multimedia
3.6.2.3 compile CER reducer. py
Compile the CER Cer. py file
#! /Usr/bin/ENV Python
FromOperatorImportItemgetter
ImportSys
ForLineInSYS. stdin:
Line=Line. Strip ()
PrintLine
Xuqiang @ Ubuntu :~ /Hadoop/src/hadoop-0.21.0 $ chmod U + X./CER Cer. py
Now that our Mapper and reducer are ready, run the program function locally and run the following command to simulate the process of running on hadoop:
First, mapper reads data from stdin. Here is a row;
Then, read the content of the row as a URL, get the content of the title of the HTML represented by the URL, and output <URL, URL-title-content>;
Call the sort command to sort mapper output;
The sorted result is handed over to the reducer, where the reducer only outputs the result.
Xuqiang @ Ubuntu: ~ / Hadoop / SRC / Hadoop - 0.21 . 0 $ cat URLs | . / Multifetch. py | Sort | . / CER Cer. py
HTTP: // Www.cs.brandeis.edu Computer Science Department | Brandeis University
HTTP://Www.nytimes.com The New York Times-Breaking news, World News&Amp; Multimedia
Apparently the program is correct
3.6.2.4 run on hadoop streaming
Xuqiang @ Ubuntu: ~ / Hadoop / SRC / Hadoop - 0.21 . 0 $ Bin / Hadoop jar. / Mapred / Contrib / Streaming / Hadoop - 0.21 . 0 - Streaming. Jar \
> - Mapper / Home / Xuqiang / Hadoop / SRC / Hadoop - 0.21 . 0 / Multifetch. py \
> - Reducer / Home / Xuqiang / Hadoop / SRC / Hadoop - 0.21 . 0 / CER Cer. py \
> - Input URLs /* \< span style = "font-family: verdana, 'courier new'; font-size: 14px; "> > - output titles
after the program is run, view the running result:
xuqiang @ Ubuntu :~ /Hadoop/src/hadoop-0.21.0 $ bin/hadoop DFS-cat titles/part-00000
<4>. download references and Code
http://pages.cs.brandeis.edu /~ Cs147a/lab/hadoop-example/
http://hadoop.apache.org/mapreduce/docs/r0.21.0/streaming.html#Hadoop+Streaming