Hadoop is getting increasingly popular, and hadoop has a core thing, that is, mapreduce. It plays an important role in hadoop parallel computing and is also used for program development under hadoop, to learn more, let's take a look at wordcount, a simple example of maprecude.
First, let's get to know what mapreduce is.
Mapreduce is composed of two English words. map represents ing and reduce represents simplification. It is a programming model used for parallel operations on large-scale datasets (larger than 1 Tb, the main idea is functional programming.
In hadoop, The mapreduce process is divided into three steps: Map (mainly decomposition of parallel tasks) and combine (mainly to improve the efficiency of reduce) and reduce (summarize the processed results ).
Another blog on how to build a hadoop runtime environment: http://www.cnblogs.com/taven/archive/2012/08/12/2634145.html
Now let's take a look at the startup code for running a hadoop job:
Job job = new job (Conf, "Word Count ");
Job. setjarbyclass (wordcount. Class );
Job. setmapperclass (tokenizermapper. Class );
Job. setcombinerclass (intsumreducer. Class );
Job. setreducerclass (intsumreducer. Class );
Job. setoutputkeyclass (text. Class );
Job. setoutputvalueclass (intwritable. Class );
Fileinputformat. addinputpath (job, new path (otherargs [0]);
Fileoutputformat. setoutputpath (job, new path (otherargs [1]);
System. Exit (job. waitforcompletion (true )? 0: 1 );
No. Before running a hadoop job, you must specify mapperclass, combinerclass, and reducerclass.
Assume that the text we handed over to hadoop for analysis is:
Lixy CSY lixy zmde nitamade hehe
Realy amoeba woyou Weibo hehe
The provided content is very simple: three lines of text, 1st lines of text containing N words, 2nd rows are empty, and 3rd rows contain N words, words are separated by spaces. Let's take a look at how mapperclass is implemented and how it runs? Look at the code of tokenizermapper:
Public class tokenizermapper extends mapper <object, text, text, intwritable> {
Private Final Static intwritable one = new intwritable (1 );
Private text word = new text ();
Public void map (Object key, text value, context) throws ioexception, interruptedexception {
System. Out. println ("Tokenizermapper. map ...");
System. Out. println ("Map key :"+ Key. tostring () +"MAP value :"+ Value. tostring ());
Stringtokenizer itr = new stringtokenizer (value. tostring ());
While (itr. hasmoretokens ()){
String TMP = itr. nexttoken ();
Word. Set (TMP );
Context. Write (word, one );
System. Out. println ("Tmp :"+ TMP +"One :"+ One );
}
System. Out. println ("Context :"+ Context. tostring ());
}
}
Note: Let's talk about the intention of "intwritable one = new intwritable (1);", because no matter how many times a word appears, we calculate it once if it appears, so "context. write (word, one) "when this line of code writes a word, the value is always 1;
During running, based on the content in your file, the above map (Object key, text value, context) method may be called multiple times, after the file content provided in this example is executed, the console outputs the following content (for ease of reading, I added some line breaks ):
Tokenizermapper. map...
Map key: 0 MAP value: lixy CSY lixy zmde nitamade hehe
TMP: lixy one: 1
TMP: CSY one: 1
TMP: lixy one: 1
TMP: zmde one: 1
TMP: nitamade one: 1
TMP: Hehe one: 1
Context: org. Apache. hadoop. mapreduce. mapper $ context @ 1af0b4a3
Tokenizermapper. map...
Map key: 34 MAP value:
Context: org. Apache. hadoop. mapreduce. mapper $ context @ 1af0b4a3
Tokenizermapper. map...
Map key: 36 MAP value: realy amoeba woyou Weibo hehe
TMP: realy one: 1
TMP: amoeba one: 1
TMP: woyou one: 1
TMP: Weibo one: 1
TMP: Hehe one: 1
Context: org. Apache. hadoop. mapreduce. mapper $ context @ 1af0b4a3
Intsumreducer. Reduce...
Val. Get (): 1
Reduce Key: amoeba reduce result: 1
Reduce context: org. Apache. hadoop. mapreduce. Cer CER $ context @ d5d4de6 result: 1
Intsumreducer. Reduce...
Val. Get (): 1
Reduce Key: CSY reduce result: 1
Reduce context: org. Apache. hadoop. mapreduce. Cer CER $ context @ d5d4de6 result: 1
Intsumreducer. Reduce...
Val. Get (): 1
Val. Get (): 1
Reduce Key: Hehe reduce result: 2
Reduce context: org. Apache. hadoop. mapreduce. Cer CER $ context @ d5d4de6 result: 2
Intsumreducer. Reduce...
Val. Get (): 1
Val. Get (): 1
Reduce Key: lixy reduce result: 2
Reduce context: org. Apache. hadoop. mapreduce. Cer CER $ context @ d5d4de6 result: 2
Intsumreducer. Reduce...
Val. Get (): 1
Reduce Key: nitamade reduce result: 1
Reduce context: org. Apache. hadoop. mapreduce. Cer CER $ context @ d5d4de6 result: 1
Intsumreducer. Reduce...
Val. Get (): 1
Reduce Key: realy reduce result: 1
Reduce context: org. Apache. hadoop. mapreduce. Cer CER $ context @ d5d4de6 result: 1
Intsumreducer. Reduce...
Val. Get (): 1
Reduce Key: Weibo reduce result: 1
Reduce context: org. Apache. hadoop. mapreduce. Cer CER $ context @ d5d4de6 result: 1
Intsumreducer. Reduce...
Val. Get (): 1
Reduce Key: woyou reduce result: 1
Reduce context: org. Apache. hadoop. mapreduce. Cer CER $ context @ d5d4de6 result: 1
Intsumreducer. Reduce...
Val. Get (): 1
Reduce Key: zmde reduce result: 1
Reduce context: org. Apache. hadoop. mapreduce. Cer CER $ context @ d5d4de6 result: 1
The output of information called by map (Object key, text value, context) of tokenizermapper can be analyzed. There are two lines in the file, therefore, this method is called twice in total (because the textinputformat type is processed by row ).
The content of each row is transmitted in the value parameter, that is, the content of each row corresponds to a key, this key is located at the beginning of the line in this file (so the key of row 1st is 0, the key of row 2nd is 34, and the key of row 3rd is 36 ), it is generally numeric.
In this map method, we can add some processing logic, such as getting each word by space. Then we need to write the processed result to the context parameter, hadoop can easily process subsequent processing logic. (Here we need to note that the variable "intwritable one" is a value of 1)
After reading the map process above, let's look at the reduce process. Let's first look at the intsumreducer code:
Public class intsumreducer extends reducer <text, intwritable, text, intwritable> {
Private intwritable result = new intwritable ();
Public void reduce (Text key, iterable <intwritable> values, context) throws ioexception, interruptedexception {
System. Out. println ("Intsumreducer. Reduce ...");
Int sum = 0;
For (intwritable VAL: values ){
Sum + = Val. Get ();
System. Out. println ("Val. Get ():"+ Val. Get ());
}
Result. Set (SUM );
Context. Write (Key, result );
System. Out. println ("Reduce Key :"+ Key. tostring () +"Reduce Result :"+ Result. Get ());
System. Out. println ("Reduce context :"+ Context +"Result :"+ Result );
}
}
After the call is executed, the console outputs the following content:
Intsumreducer. Reduce...
Val. Get (): 1
Reduce Key: amoeba reduce result: 1
Reduce context: org. Apache. hadoop. mapreduce. Cer CER $ context @ 6c04ab2f result: 1
Intsumreducer. Reduce...
Val. Get (): 1
Reduce Key: CSY reduce result: 1
Reduce context: org. Apache. hadoop. mapreduce. Cer CER $ context @ 6c04ab2f result: 1
Intsumreducer. Reduce...
Val. Get (): 2
Reduce Key: Hehe reduce result: 2
Reduce context: org. Apache. hadoop. mapreduce. Cer CER $ context @ 6c04ab2f result: 2
Intsumreducer. Reduce...
Val. Get (): 2
Reduce Key: lixy reduce result: 2
Reduce context: org. Apache. hadoop. mapreduce. Cer CER $ context @ 6c04ab2f result: 2
Intsumreducer. Reduce...
Val. Get (): 1
Reduce Key: nitamade reduce result: 1
Reduce context: org. Apache. hadoop. mapreduce. Cer CER $ context @ 6c04ab2f result: 1
Intsumreducer. Reduce...
Val. Get (): 1
Reduce Key: realy reduce result: 1
Reduce context: org. Apache. hadoop. mapreduce. Cer CER $ context @ 6c04ab2f result: 1
Intsumreducer. Reduce...
Val. Get (): 1
Reduce Key: Weibo reduce result: 1
Reduce context: org. Apache. hadoop. mapreduce. Cer CER $ context @ 6c04ab2f result: 1
Intsumreducer. Reduce...
Val. Get (): 1
Reduce Key: woyou reduce result: 1
Reduce context: org. Apache. hadoop. mapreduce. Cer CER $ context @ 6c04ab2f result: 1
Intsumreducer. Reduce...
Val. Get (): 1
Reduce Key: zmde reduce result: 1
Reduce context: org. Apache. hadoop. mapreduce. Cer CER $ context @ 6c04ab2f result: 1
By executing the reduce (Text key, iterable <intwritable> values, context) method, the miracle occurred. The parameter passed to hadoop here has been de-duplicated. What does it mean? That is to say, the parameter key contains the word name. If a word appears twice, there will be two values in the parameter values, but the key is only once. For example, if the word "lixy" appears twice in the first line, the key appears only once, but there will be two intwritable values in the values, and the values are both 1, the value of 1 is actually set by yourself during map.
This example simply illustrates a simple use of mapreduce. In fact, its function is far more than that, and it can also sort and count data in the database, especially for some very large data tables. When you have a 1 TB text content and need to count the number of times each word appears separately, it may take a long time for a computer to calculate it. It may take a long time to load the text, however, if you hand it over to hadoop and run it together with several machines, hadoop can divide the 1 TB text content into multiple segments and distribute them to different physical machines, execute your mapreduce logic in parallel, and then summarize the content processed on several physical machines for you. The whole process is completed in parts and parallel processing, which greatly improves the efficiency.