Before we learned about the MapReduce programming ideas and programming examples, the students in this lesson were practicing together and doing the following projects.
Project requirements
An English book contains thousands of words or phrases, and now we need to find all the anagrams (crossword puzzles) of the same alphabet in a large number of words.
Data set
Here is an English book to intercept part of the word content. Poke this link to download the dataset
Initiateinitiatedinitiatesinitiatinginitiationinitiationsinitiativeinitiativesinitiatorinitiatorsinitiatoryinjectinjectan Tinjectedinjectinginjectioninjectionsinjectorinjectorsinjects
Thinking analysis
Based on the above requirements, we have completed the following steps:
1. In the Map stage, each word (word) is sorted alphabetically by Sortedword and then output key/value key value pair (Sortedword,word).
2. In the Reduce phase, all anagrams (crossword puzzles) of the same letter group are counted.
Data processing schematic flow
In the following words, find a crossword puzzle of the same letter.
Cattarbaractrat
Step one: Process through the map phase
Actcat >arttar>ABRbar>Actact>Artrat>
Step two: Process through the reduce phase
ABRbar>ActCat,act>Arttar,rat>
Program Development
1. Write program execution main class: Anagrammain
Packagecom.hadoop.test;Importorg.apache.hadoop.conf.Configuration;Importorg.apache.hadoop.conf.Configured;ImportOrg.apache.hadoop.fs.FileSystem;ImportOrg.apache.hadoop.fs.Path;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapreduce.Job;ImportOrg.apache.hadoop.mapreduce.lib.input.FileInputFormat;ImportOrg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;ImportOrg.apache.hadoop.util.Tool;ImportOrg.apache.hadoop.util.ToolRunner; Public classAnagrammainextendsConfiguredImplementstool{@SuppressWarnings ("Deprecation") @Override Public intRun (string[] args)throwsException {Configuration conf=NewConfiguration (); //Delete an already existing output directoryPath MyPath =NewPath (args[1]); FileSystem HDFs=mypath.getfilesystem (conf); if(Hdfs.isdirectory (MyPath)) {Hdfs.delete (MyPath,true); } Job Job=NewJob (conf, "Testanagram"); Job.setjarbyclass (Anagrammain.class);//Setting the main classJob.setmapperclass (anagrammapper.class);//MapperJob.setmapoutputkeyclass (Text.class); Job.setmapoutputvalueclass (Text.class); Job.setreducerclass (anagramreducer.class);//ReducerJob.setoutputkeyclass (Text.class); Job.setoutputvalueclass (Text.class); Fileinputformat.addinputpath (Job,NewPath (Args[0]));//Setting the input pathFileoutputformat. Setoutputpath (Job,NewPath (args[1]));//Setting the output pathJob.waitforcompletion (true); return0; } Public Static voidMain (string[] args)throwsexception{string[] Args0= {"Hdfs://cloud004:9000/anagram/anagram.txt" , "Hdfs://cloud004:9000/anagram/output"}; intEC = Toolrunner.run (NewConfiguration (),NewAnagrammain (), ARGS0); System. Exit (EC); }}
2, Write Mapper:anagrammapper
Packagecom.hadoop.test;Importjava.io.IOException;Importjava.util.Arrays;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapreduce.Mapper; Public classAnagrammapperextendsmapper< Object, text, text, text> { PrivateText Sortedtext =NewText (); PrivateText Orginaltext =NewText (); Public voidMap (Object key, Text value, context context)throwsIOException, interruptedexception {String word=value.tostring (); Char[] Wordchars = Word.tochararray ();//Word conversion to character arrayArrays.sort (Wordchars);//alphabetical sorting of character arraysString Sortedword =NewString (Wordchars);//converting character arrays to stringsSortedtext.set (Sortedword);//set the value of the output keyOrginaltext.set (word);//set the value of the output valueContext.write (Sortedtext, Orginaltext);//Map Output }}
3, Write Reducer:anagramreducer
Packagecom.hadoop.test;Importjava.io.IOException;ImportJava.util.StringTokenizer;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapreduce.Reducer; Public classAnagramreducerextendsreducer< text, text, text, text> { PrivateText Outputkey =NewText (); PrivateText Outputvalue =NewText (); Public voidReduce (Text anagramkey, iterable< text>Anagramvalues, Context context)throwsIOException, interruptedexception {String output= ""; //Use the ~ symbol to stitch words with the same letter for(Text anagam:anagramvalues) {if(!output.equals ("") ) {Output= output + "~" ; } Output= output +anagam.tostring (); } StringTokenizer Outputtokenizer=NewStringTokenizer (Output, "~" ); //output anagrams (crossword puzzle) more than 2 results if(Outputtokenizer.counttokens () >=2) {Output= Output.replace ("~", ","); Outputkey.set (Anagramkey.tostring ());//set the value of keyOutputvalue.set (output);//set the value ofContext.write (Outputkey, Outputvalue);//Reduce } }}
Compiling and executing a mapreduce job
1. Compile and package the project as Anagram.jar, and use the SSH client to upload the Anagram.jar to the/HOME/HADOOP/DJT directory of Hadoop.
2. Use CD/HOME/HADOOP/DJT to switch to the current directory and execute the task from the command line.
Hadoop jar Anagram.jar Com.hadoop.test.AnagramMain
View Run Results
The final result of the task is output to HDFS, using the following command to view the results.
[Email protected] hadoop-2.2.0-x64]$ Hadoop fs-cat/anagram/output/part-r-00000
The partial result set is shown below.
Cehors Cosher,chores,ochres,ocherscehorst troches,hectors,torchescehort Troche,hectorcehortu toucher,couther,r Etouchcehoss coshes,chosescehrt chert,retchcehstu chutes,tuschecehsty chesty,scytheceht etch,techceiijstu Jesuitic,juiciestceiikst Ickiest,ekisticceiilnos Isocline,siliconeceiilnoss Isoclines,siliconesceiimmnoorss Commissioner,recommissionceiimmnoorsss Recommissions,commissionersceiimorst isometric,eroticismceiimost semiotic, Comitiesceiinnopst Inceptions,inspectionceiinrsstu Scrutinies,scrutiniseceiinrst Citrines,crinites,incitersceiinr T citrine,inciterceiinss Iciness,incisesceiintz citizen,zinciteceiist ICIEST,CITIESCEIKLN Nickel,nickleceik LNR Crinkle,clinkerceiklnrs clinkers,crinklesceiklns Nickels,nicklesceiklrs slicker,lickersceiklrsst Stickl Ers,stricklesceiklrst TRICKLES,TICKLERS,STICKLERCEIKLRT tickler,trickleceiklsst Slickest,sticklesceiklst Kelti Cs,stickle,ticklesceiklt Tickle,kelticceiknrs Nickers,snickerceikorr Rockier,corkierceikorst stockier,corkiest,rockiestceikpst skep Tic,picketsceikrst Rickets,tickers,stickerceil lice,ceilceilmop compile,polemicceilmopr compiler,complierceilm Oprs compliers,compilersceilmops Polemics,complies,compilesceilnoos colonise,coloniesceilnors Incloser,licenso Rceilnorss inclosers,licensors
Hadoop Combat Project: Find a crossword puzzle of the same alphabet