Mapreduce-nasa Blog Data Frequency Simple analysis

Last Update:2016-04-16 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Environment:
hadoop1.x,centos6.5, a simulated distributed environment built by three virtual machines, gnuplot,

Data: http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html

Programme objectives:

The provided blog data is a simple file request to access the data

205.189.154.54--[01/jul/1995:00:00:29-0400] "Get/shuttle/countdown/count.gif http/1.0" 200 40310

Each line is a rule as shown above. The goal is to calculate the number of accesses per file, and the frequency distribution of the number of visits

Ideas:
This goal is actually very easy to achieve. One of the biggest points of knowledge involved is the reliance on job. In this target solution, you can use two sets of MapReduce, the previous one calculates the number of accesses per file, the last one counts the frequency, and finally draws the distribution graph using the Gnuplot tool.

First, the MapReduce procedure

In this program, the preparation of MapReduce is very simple, it is not written. Basically the framework of the main program is written well.

 PackageRen.snail;ImportJava.util.regex.Matcher;ImportJava.util.regex.Pattern;Importorg.apache.hadoop.conf.Configuration;Importorg.apache.hadoop.conf.Configured;ImportOrg.apache.hadoop.fs.Path;ImportOrg.apache.hadoop.hdfs.util.EnumCounters.Map;Importorg.apache.hadoop.io.IntWritable;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapred.TextInputFormat;ImportOrg.apache.hadoop.mapred.jobcontrol.JobControl;ImportOrg.apache.hadoop.mapreduce.Job;ImportOrg.apache.hadoop.mapreduce.Reducer;ImportOrg.apache.hadoop.mapreduce.lib.input.FileInputFormat;ImportOrg.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;ImportOrg.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob;ImportOrg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;ImportOrg.apache.hadoop.mapreduce.lib.output.TextOutputFormat;ImportOrg.apache.hadoop.util.Tool;ImportOrg.apache.hadoop.util.ToolRunner;ImportCom.sun.xml.internal.ws.api.model.wsdl.editable.EditableWSDLBoundFault; Public classMainextendsConfiguredImplementsTool { Public Static voidMain (string[] args)throwsException {intresult = Toolrunner.run (NewConfiguration (),NewMain (), args); } @Override Public intRun (string[] arg0)throwsException {//TODO auto-generated Method StubConfiguration Configuration =getconf (); Job Job1=NewJob (Configuration, "GroupBy")); Job1.setjarbyclass (Main.class); Fileinputformat.addinputpath (JOB1,NewPath (arg0[0])); Fileoutputformat.setoutputpath (JOB1,NewPath (arg0[1])); Job1.setmapperclass (groupmapper.class); Job1.setreducerclass (groupreducer.class); Job1.setoutputformatclass (Textoutputformat.class); Job1.setoutputkeyclass (Text.class); Job1.setoutputvalueclass (intwritable.class); Job Job2=NewJob (Configuration, "sort"); Job2.setjarbyclass (Main.class); Fileinputformat.addinputpath (JOB2,NewPath (Arg0[1] + "/part-r-00000")); Fileoutputformat.setoutputpath (JOB2,NewPath (arg0[1]+ "/out2")); Job2.setmapperclass (sortmapper.class); Job2.setreducerclass (sortreducer.class); Job2.setinputformatclass (Keyvaluetextinputformat.class); Job2.setoutputformatclass (Textoutputformat.class); Job2.setoutputkeyclass (intwritable.class);//the output format defined here is the format of the map output to reduce, not the format of the reduce output to HDFsJob2.setoutputvalueclass (intwritable.class); Controlledjob ControlledJob1=NewControlledjob (Job1.getconfiguration ()); Controlledjob ControlledJob2=NewControlledjob (Job2.getconfiguration ());   Controlledjob2.adddependingjob (CONTROLLEDJOB1); //job dependency, which allows JOB2 to use the data generated by JOB1Jobcontrol Jobcontrol =NewJobcontrol ("Jobcontroldemogroup");        Jobcontrol.addjob (CONTROLLEDJOB1);        Jobcontrol.addjob (CONTROLLEDJOB2); Thread Jobcontrolthread=NewThread (Jobcontrol);        Jobcontrolthread.start ();  while(!jobcontrol.allfinished ()) {Thread.Sleep (500);                 } jobcontrol.stop (); return0; }}

Finally, we get the data we want, as well as the frequency distribution. Next use the Gunplot to draw

Second, GnuPlot

The installation of the gnuplot is simple and can be installed with the Yum install Gunplot.

Once installed, write the following code:

"Freqdist.png"     // output filename  "Frequnecy distribution of Hits by Url";  // drawing the image name set ylabel "Number of Hits""Urls (Sorted by Hits)""~/test/data.txt" using 2 title "Freq Uency "with Linespoints

Problems may occur:

Could not find/open font when opening font "Arial", using internal non-scalable font

Solution:

yum installwqy-zenhei-fonts.noarch ＃其实这个是安装字体，但是一般都已经安装了的

Enter the gnuplot shell, input set term PNG font "/USR/SHARE/FONTS/WQY-ZENHEI/WQY-ZENHEI.TTC" 10 # Set the font for PNG images, may output

Options are ' Nocrop FONT/USR/SHARE/FONTS/WQY-ZENHEI/WQY-ZENHEI.TTC 12 ', no tube, in the run program, in fact you have generated the picture you want

Not only can draw scatter chart, also can have histogram line chart and so on, mainly is the change of plot program, not in one by one experiment

Mapreduce-nasa Blog Data Frequency Simple analysis

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Mapreduce-nasa Blog Data Frequency Simple analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Mapreduce-nasa Blog Data Frequency Simple analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support