Environment:
hadoop1.x,centos6.5, a simulated distributed environment built by three virtual machines, gnuplot,
Data: http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html
Programme objectives:
The provided blog data is a simple file request to access the data
205.189.154.54--[01/jul/1995:00:00:29-0400] "Get/shuttle/countdown/count.gif http/1.0" 200 40310
Each line is a rule as shown above. The goal is to calculate the number of accesses per file, and the frequency distribution of the number of visits
Ideas:
This goal is actually very easy to achieve. One of the biggest points of knowledge involved is the reliance on job. In this target solution, you can use two sets of MapReduce, the previous one calculates the number of accesses per file, the last one counts the frequency, and finally draws the distribution graph using the Gnuplot tool.
First, the MapReduce procedure
In this program, the preparation of MapReduce is very simple, it is not written. Basically the framework of the main program is written well.
PackageRen.snail;ImportJava.util.regex.Matcher;ImportJava.util.regex.Pattern;Importorg.apache.hadoop.conf.Configuration;Importorg.apache.hadoop.conf.Configured;ImportOrg.apache.hadoop.fs.Path;ImportOrg.apache.hadoop.hdfs.util.EnumCounters.Map;Importorg.apache.hadoop.io.IntWritable;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapred.TextInputFormat;ImportOrg.apache.hadoop.mapred.jobcontrol.JobControl;ImportOrg.apache.hadoop.mapreduce.Job;ImportOrg.apache.hadoop.mapreduce.Reducer;ImportOrg.apache.hadoop.mapreduce.lib.input.FileInputFormat;ImportOrg.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;ImportOrg.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob;ImportOrg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;ImportOrg.apache.hadoop.mapreduce.lib.output.TextOutputFormat;ImportOrg.apache.hadoop.util.Tool;ImportOrg.apache.hadoop.util.ToolRunner;ImportCom.sun.xml.internal.ws.api.model.wsdl.editable.EditableWSDLBoundFault; Public classMainextendsConfiguredImplementsTool { Public Static voidMain (string[] args)throwsException {intresult = Toolrunner.run (NewConfiguration (),NewMain (), args); } @Override Public intRun (string[] arg0)throwsException {//TODO auto-generated Method StubConfiguration Configuration =getconf (); Job Job1=NewJob (Configuration, "GroupBy")); Job1.setjarbyclass (Main.class); Fileinputformat.addinputpath (JOB1,NewPath (arg0[0])); Fileoutputformat.setoutputpath (JOB1,NewPath (arg0[1])); Job1.setmapperclass (groupmapper.class); Job1.setreducerclass (groupreducer.class); Job1.setoutputformatclass (Textoutputformat.class); Job1.setoutputkeyclass (Text.class); Job1.setoutputvalueclass (intwritable.class); Job Job2=NewJob (Configuration, "sort"); Job2.setjarbyclass (Main.class); Fileinputformat.addinputpath (JOB2,NewPath (Arg0[1] + "/part-r-00000")); Fileoutputformat.setoutputpath (JOB2,NewPath (arg0[1]+ "/out2")); Job2.setmapperclass (sortmapper.class); Job2.setreducerclass (sortreducer.class); Job2.setinputformatclass (Keyvaluetextinputformat.class); Job2.setoutputformatclass (Textoutputformat.class); Job2.setoutputkeyclass (intwritable.class);//the output format defined here is the format of the map output to reduce, not the format of the reduce output to HDFsJob2.setoutputvalueclass (intwritable.class); Controlledjob ControlledJob1=NewControlledjob (Job1.getconfiguration ()); Controlledjob ControlledJob2=NewControlledjob (Job2.getconfiguration ()); Controlledjob2.adddependingjob (CONTROLLEDJOB1); //job dependency, which allows JOB2 to use the data generated by JOB1Jobcontrol Jobcontrol =NewJobcontrol ("Jobcontroldemogroup"); Jobcontrol.addjob (CONTROLLEDJOB1); Jobcontrol.addjob (CONTROLLEDJOB2); Thread Jobcontrolthread=NewThread (Jobcontrol); Jobcontrolthread.start (); while(!jobcontrol.allfinished ()) {Thread.Sleep (500); } jobcontrol.stop (); return0; }}
Finally, we get the data we want, as well as the frequency distribution. Next use the Gunplot to draw
Second, GnuPlot
The installation of the gnuplot is simple and can be installed with the Yum install Gunplot.
Once installed, write the following code:
"Freqdist.png" // output filename "Frequnecy distribution of Hits by Url"; // drawing the image name set ylabel "Number of Hits""Urls (Sorted by Hits)""~/test/data.txt" using 2 title "Freq Uency "with Linespoints
Problems may occur:
Could not find/open font when opening font "Arial", using internal non-scalable font
Solution:
yum
install
wqy-zenhei-fonts.noarch #其实这个是安装字体,但是一般都已经安装了的
Enter the gnuplot shell, input set term PNG font "/USR/SHARE/FONTS/WQY-ZENHEI/WQY-ZENHEI.TTC" 10 # Set the font for PNG images, may output
Options are ' Nocrop FONT/USR/SHARE/FONTS/WQY-ZENHEI/WQY-ZENHEI.TTC 12 ', no tube, in the run program, in fact you have generated the picture you want
Not only can draw scatter chart, also can have histogram line chart and so on, mainly is the change of plot program, not in one by one experiment
Mapreduce-nasa Blog Data Frequency Simple analysis