Hadoop Combat Project: Find a crossword puzzle of the same alphabet

Source: Internet
Author: User
Tags hadoop fs

Before we learned about the MapReduce programming ideas and programming examples, the students in this lesson were practicing together and doing the following projects.

Project requirements

An English book contains thousands of words or phrases, and now we need to find all the anagrams (crossword puzzles) of the same alphabet in a large number of words.

Data set

Here is an English book to intercept part of the word content. Poke this link to download the dataset

Initiateinitiatedinitiatesinitiatinginitiationinitiationsinitiativeinitiativesinitiatorinitiatorsinitiatoryinjectinjectan Tinjectedinjectinginjectioninjectionsinjectorinjectorsinjects
Thinking analysis

Based on the above requirements, we have completed the following steps:

1. In the Map stage, each word (word) is sorted alphabetically by Sortedword and then output key/value key value pair (Sortedword,word).

2. In the Reduce phase, all anagrams (crossword puzzles) of the same letter group are counted.

Data processing schematic flow

In the following words, find a crossword puzzle of the same letter.

Cattarbaractrat

Step one: Process through the map phase

Actcat >arttar>ABRbar>Actact>Artrat>   

Step two: Process through the reduce phase

ABRbar>ActCat,act>Arttar,rat>

Program Development

1. Write program execution main class: Anagrammain

 Packagecom.hadoop.test;Importorg.apache.hadoop.conf.Configuration;Importorg.apache.hadoop.conf.Configured;ImportOrg.apache.hadoop.fs.FileSystem;ImportOrg.apache.hadoop.fs.Path;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapreduce.Job;ImportOrg.apache.hadoop.mapreduce.lib.input.FileInputFormat;ImportOrg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;ImportOrg.apache.hadoop.util.Tool;ImportOrg.apache.hadoop.util.ToolRunner; Public classAnagrammainextendsConfiguredImplementstool{@SuppressWarnings ("Deprecation") @Override Public  intRun (string[] args)throwsException {Configuration conf=NewConfiguration (); //Delete an already existing output directoryPath MyPath =NewPath (args[1]); FileSystem HDFs=mypath.getfilesystem (conf); if(Hdfs.isdirectory (MyPath)) {Hdfs.delete (MyPath,true); } Job Job=NewJob (conf, "Testanagram"); Job.setjarbyclass (Anagrammain.class);//Setting the main classJob.setmapperclass (anagrammapper.class);//MapperJob.setmapoutputkeyclass (Text.class); Job.setmapoutputvalueclass (Text.class); Job.setreducerclass (anagramreducer.class);//ReducerJob.setoutputkeyclass (Text.class); Job.setoutputvalueclass (Text.class); Fileinputformat.addinputpath (Job,NewPath (Args[0]));//Setting the input pathFileoutputformat. Setoutputpath (Job,NewPath (args[1]));//Setting the output pathJob.waitforcompletion (true); return0; }     Public Static voidMain (string[] args)throwsexception{string[] Args0= {"Hdfs://cloud004:9000/anagram/anagram.txt" ,                 "Hdfs://cloud004:9000/anagram/output"}; intEC = Toolrunner.run (NewConfiguration (),NewAnagrammain (), ARGS0); System.    Exit (EC); }}

2, Write Mapper:anagrammapper

 Packagecom.hadoop.test;Importjava.io.IOException;Importjava.util.Arrays;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapreduce.Mapper; Public classAnagrammapperextendsmapper< Object, text, text, text> {    PrivateText Sortedtext =NewText (); PrivateText Orginaltext =NewText ();  Public voidMap (Object key, Text value, context context)throwsIOException, interruptedexception {String word=value.tostring (); Char[] Wordchars = Word.tochararray ();//Word conversion to character arrayArrays.sort (Wordchars);//alphabetical sorting of character arraysString Sortedword =NewString (Wordchars);//converting character arrays to stringsSortedtext.set (Sortedword);//set the value of the output keyOrginaltext.set (word);//set the value of the output valueContext.write (Sortedtext, Orginaltext);//Map Output    }}

3, Write Reducer:anagramreducer

 Packagecom.hadoop.test;Importjava.io.IOException;ImportJava.util.StringTokenizer;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapreduce.Reducer; Public classAnagramreducerextendsreducer< text, text, text, text> {       PrivateText Outputkey =NewText (); PrivateText Outputvalue =NewText ();  Public voidReduce (Text anagramkey, iterable< text>Anagramvalues, Context context)throwsIOException, interruptedexception {String output= ""; //Use the ~ symbol to stitch words with the same letter             for(Text anagam:anagramvalues) {if(!output.equals ("") ) {Output= output + "~" ; } Output= output +anagam.tostring (); } StringTokenizer Outputtokenizer=NewStringTokenizer (Output, "~" ); //output anagrams (crossword puzzle) more than 2 results            if(Outputtokenizer.counttokens () >=2) {Output= Output.replace ("~", ","); Outputkey.set (Anagramkey.tostring ());//set the value of keyOutputvalue.set (output);//set the value ofContext.write (Outputkey, Outputvalue);//Reduce            }    }}
Compiling and executing a mapreduce job

1. Compile and package the project as Anagram.jar, and use the SSH client to upload the Anagram.jar to the/HOME/HADOOP/DJT directory of Hadoop.

2. Use CD/HOME/HADOOP/DJT to switch to the current directory and execute the task from the command line.

Hadoop jar Anagram.jar Com.hadoop.test.AnagramMain
View Run Results

The final result of the task is output to HDFS, using the following command to view the results.

[Email protected] hadoop-2.2.0-x64]$ Hadoop fs-cat/anagram/output/part-r-00000

The partial result set is shown below.

Cehors Cosher,chores,ochres,ocherscehorst troches,hectors,torchescehort Troche,hectorcehortu toucher,couther,r    Etouchcehoss coshes,chosescehrt chert,retchcehstu chutes,tuschecehsty chesty,scytheceht etch,techceiijstu    Jesuitic,juiciestceiikst Ickiest,ekisticceiilnos Isocline,siliconeceiilnoss Isoclines,siliconesceiimmnoorss Commissioner,recommissionceiimmnoorsss Recommissions,commissionersceiimorst isometric,eroticismceiimost semiotic, Comitiesceiinnopst Inceptions,inspectionceiinrsstu Scrutinies,scrutiniseceiinrst Citrines,crinites,incitersceiinr T citrine,inciterceiinss Iciness,incisesceiintz citizen,zinciteceiist ICIEST,CITIESCEIKLN Nickel,nickleceik LNR Crinkle,clinkerceiklnrs clinkers,crinklesceiklns Nickels,nicklesceiklrs slicker,lickersceiklrsst Stickl Ers,stricklesceiklrst TRICKLES,TICKLERS,STICKLERCEIKLRT tickler,trickleceiklsst Slickest,sticklesceiklst Kelti Cs,stickle,ticklesceiklt    Tickle,kelticceiknrs Nickers,snickerceikorr Rockier,corkierceikorst stockier,corkiest,rockiestceikpst skep Tic,picketsceikrst Rickets,tickers,stickerceil lice,ceilceilmop compile,polemicceilmopr compiler,complierceilm Oprs compliers,compilersceilmops Polemics,complies,compilesceilnoos colonise,coloniesceilnors Incloser,licenso Rceilnorss inclosers,licensors

Hadoop Combat Project: Find a crossword puzzle of the same alphabet

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.