Mapreduce-nasa Blog Data Frequency Simple analysis

Source: Internet
Author: User

Environment:
hadoop1.x,centos6.5, a simulated distributed environment built by three virtual machines, gnuplot,

Data: http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html

Programme objectives:

The provided blog data is a simple file request to access the data

205.189.154.54--[01/jul/1995:00:00:29-0400] "Get/shuttle/countdown/count.gif http/1.0" 200 40310

Each line is a rule as shown above. The goal is to calculate the number of accesses per file, and the frequency distribution of the number of visits

Ideas:
This goal is actually very easy to achieve. One of the biggest points of knowledge involved is the reliance on job. In this target solution, you can use two sets of MapReduce, the previous one calculates the number of accesses per file, the last one counts the frequency, and finally draws the distribution graph using the Gnuplot tool.

First, the MapReduce procedure

In this program, the preparation of MapReduce is very simple, it is not written. Basically the framework of the main program is written well.

  

 PackageRen.snail;ImportJava.util.regex.Matcher;ImportJava.util.regex.Pattern;Importorg.apache.hadoop.conf.Configuration;Importorg.apache.hadoop.conf.Configured;ImportOrg.apache.hadoop.fs.Path;ImportOrg.apache.hadoop.hdfs.util.EnumCounters.Map;Importorg.apache.hadoop.io.IntWritable;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapred.TextInputFormat;ImportOrg.apache.hadoop.mapred.jobcontrol.JobControl;ImportOrg.apache.hadoop.mapreduce.Job;ImportOrg.apache.hadoop.mapreduce.Reducer;ImportOrg.apache.hadoop.mapreduce.lib.input.FileInputFormat;ImportOrg.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;ImportOrg.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob;ImportOrg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;ImportOrg.apache.hadoop.mapreduce.lib.output.TextOutputFormat;ImportOrg.apache.hadoop.util.Tool;ImportOrg.apache.hadoop.util.ToolRunner;ImportCom.sun.xml.internal.ws.api.model.wsdl.editable.EditableWSDLBoundFault; Public classMainextendsConfiguredImplementsTool { Public Static voidMain (string[] args)throwsException {intresult = Toolrunner.run (NewConfiguration (),NewMain (), args); } @Override Public intRun (string[] arg0)throwsException {//TODO auto-generated Method StubConfiguration Configuration =getconf (); Job Job1=NewJob (Configuration, "GroupBy")); Job1.setjarbyclass (Main.class); Fileinputformat.addinputpath (JOB1,NewPath (arg0[0])); Fileoutputformat.setoutputpath (JOB1,NewPath (arg0[1])); Job1.setmapperclass (groupmapper.class); Job1.setreducerclass (groupreducer.class); Job1.setoutputformatclass (Textoutputformat.class); Job1.setoutputkeyclass (Text.class); Job1.setoutputvalueclass (intwritable.class); Job Job2=NewJob (Configuration, "sort"); Job2.setjarbyclass (Main.class); Fileinputformat.addinputpath (JOB2,NewPath (Arg0[1] + "/part-r-00000")); Fileoutputformat.setoutputpath (JOB2,NewPath (arg0[1]+ "/out2")); Job2.setmapperclass (sortmapper.class); Job2.setreducerclass (sortreducer.class); Job2.setinputformatclass (Keyvaluetextinputformat.class); Job2.setoutputformatclass (Textoutputformat.class); Job2.setoutputkeyclass (intwritable.class);//the output format defined here is the format of the map output to reduce, not the format of the reduce output to HDFsJob2.setoutputvalueclass (intwritable.class); Controlledjob ControlledJob1=NewControlledjob (Job1.getconfiguration ()); Controlledjob ControlledJob2=NewControlledjob (Job2.getconfiguration ());   Controlledjob2.adddependingjob (CONTROLLEDJOB1); //job dependency, which allows JOB2 to use the data generated by JOB1Jobcontrol Jobcontrol =NewJobcontrol ("Jobcontroldemogroup");        Jobcontrol.addjob (CONTROLLEDJOB1);        Jobcontrol.addjob (CONTROLLEDJOB2); Thread Jobcontrolthread=NewThread (Jobcontrol);        Jobcontrolthread.start ();  while(!jobcontrol.allfinished ()) {Thread.Sleep (500);                 } jobcontrol.stop (); return0; }}

Finally, we get the data we want, as well as the frequency distribution. Next use the Gunplot to draw

Second, GnuPlot

The installation of the gnuplot is simple and can be installed with the Yum install Gunplot.

Once installed, write the following code:

"Freqdist.png"     // output filename  "Frequnecy distribution of Hits by Url";  // drawing the image name set ylabel "Number of Hits""Urls (Sorted by Hits)""~/test/data.txt" using 2 title "Freq Uency "with Linespoints

Problems may occur:

Could not find/open font when opening font "Arial", using internal non-scalable font

Solution:

yum installwqy-zenhei-fonts.noarch  #其实这个是安装字体,但是一般都已经安装了的

Enter the gnuplot shell, input set term PNG font "/USR/SHARE/FONTS/WQY-ZENHEI/WQY-ZENHEI.TTC" 10 # Set the font for PNG images, may output

Options are ' Nocrop FONT/USR/SHARE/FONTS/WQY-ZENHEI/WQY-ZENHEI.TTC 12 ', no tube, in the run program, in fact you have generated the picture you want

Not only can draw scatter chart, also can have histogram line chart and so on, mainly is the change of plot program, not in one by one experiment

Mapreduce-nasa Blog Data Frequency Simple analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.