Related articles recommended
Hadoop Classic case Spark implementation (i)--analysis of the maximum temperature per year by meteorological data collected
Hadoop Classic case Spark Implementation (ii)--data-heavy problem
Hadoop Classic case Spark implementation (iii)--Data sorting
Hadoop Classic case Spark implementation (IV.)--average score
Spark implementation of Hadoop classic case (v)--seeking maximum minimum value problem
Hadoop Classic case Spark implementation (vi)--Find the largest K value and sort
Hadoop Classic case Spark implementation (vii)--Log analysis: Analysis of unstructured files
Hadoop Classic case Spark implementation (vii)--Log analysis: Analysis of unstructured files
1, requirements: According to the Tomcat Log computing URL access to the situation, the specific URL below,
Requirements: Distinguishing statistics get and post URL accesses
The result is: Access mode, URL, number of visits
Test Data set:
196.168.2.1--[03/jul/2014:23:36:38 +0800] "get/course/detail/3.htm http/1.0" 38435 0.038 182.131.89.195--[03/ju l/2014:23:37:43 +0800] "get/html/notes/20140617/888.html http/1.0" 301-0.000 196.168.2.1--[03/jul/2014:23:38:27 +080 0] "post/service/notes/addviewtimes_23.htm http/1.0" 2 0.003 196.168.2.1--[03/jul/2014:23:39:03 +0800] "get/html/ notes/20140617/779.html http/1.0 "69539 0.046 196.168.2.1--[03/jul/2014:23:43:00 +0800]" get/html/notes/20140318/ 24.html http/1.0 "67171 0.049 196.168.2.1--[03/jul/2014:23:43:59 +0800]" post/service/notes/addviewtimes_779.htm H ttp/1.0 "1 0.003 196.168.2.1--[03/jul/2014:23:45:51 +0800]" get/html/notes/20140617/888.html http/1.0 "200 70044 0 .060 196.168.2.1--[03/jul/2014:23:46:17 +0800] "get/course/list/73.htm http/1.0" 12125 0.010 196.168.2.1--[03/j ul/2014:23:46:58 +0800] "get/html/notes/20140609/542.html http/1.0" 94971 0.077 196.168.2.1--[03/jul/2014:23:48:3 1 +0800] "post/service/noTes/addviewtimes_24.htm http/1.0 "2 0.003 196.168.2.1--[03/jul/2014:23:48:34 +0800]" Post/service/notes/addviewti Mes_542.htm http/1.0 "2 0.003 196.168.2.1--[03/jul/2014:23:49:31 +0800]" get/notes/index-top-3.htm http/1.0 "200 5 3494 0.041 196.168.2.1--[03/jul/2014:23:50:55 +0800] "get/html/notes/20140609/544.html http/1.0" 200 183694 0.076 196. 168.2.1--[03/jul/2014:23:53:32 +0800] "post/service/notes/addviewtimes_544.htm http/1.0" 200 2 0.004 196.168.2.1--[ 03/jul/2014:23:54:53 +0800] "get/service/notes/addviewtimes_900.htm http/1.0" 151770 0.054 196.168.2.1--[03/JUL/2 014:23:57:42 +0800] "get/html/notes/20140620/872.html http/1.0" 52373 0.034 196.168.2.1--[03/jul/2014:23:58:17 +08 ] "post/service/notes/addviewtimes_900.htm http/1.0" 2 0.003 196.168.2.1--[03/jul/2014:23:58:51 +0800] "get/htm l/notes/20140617/888.html http/1.0 "70044 0.057 186.76.76.76--[03/jul/2014:23:48:34 +0800]" post/service/notes/ad Dviewtimes_542.htm http/1.0"2 0.003 186.76.76.76--[03/jul/2014:23:46:17 +0800]" get/course/list/73.htm http/1.0 "200 12125 0.010 8.8.8.8-- [03/jul/2014:23:46:58 +0800] "Get/html/notes/20140609/542.html http/1.0" 200 94971 0.077
Because Tomcat logs are irregular, you need to filter the cleaning data first.
2, Hadoop MapReduce implementation:
Map class
Import java.io.IOException;
Import Javax.naming.spi.DirStateFactory.Result;
Import org.apache.hadoop.io.IntWritable;
Import org.apache.hadoop.io.LongWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.Mapper; public class Logmapper extends mapper<longwritable, text, text, intwritable> {private intwritable val = new INTWR
Itable (1);
@Override protected void Map (longwritable key, Text Value,context context) throws IOException, Interruptedexception {
String line = value.tostring (). Trim ();
String tmp = Handlerlog (line);
if (Tmp.length () >0) {Context.write (new Text (TMP), Val); }//127.0.0.1--[03/jul/2014:23:36:38 +0800] "get/course/detail/3.htm http/1.0" 38435 0.038 private String ha
Ndlerlog (string line) {string result = ""; try{if (line.length () >20) {if (Line.indexof (' get ') >0) {result = Line.substring (Line.indexof (' get '), line.
IndexOf ("http/1.0")). Trim ();
}else if (Line.indexof ("POST") >0) { result = Line.substring (Line.indexof ("POST"), Line.indexof ("http/1.0")). Trim ();
}}catch (Exception e) {System.out.println (line);
return result; public static void Main (string[] args) {String line = "127.0.0.1--[03/jul/2014:23:36:38 +0800] \" Get/course/det
Ail/3.htm http/1.0\ "200 38435 0.038";
System.out.println (New Logmapper (). Handlerlog (line));
}
}
Reduce class
Import java.io.IOException;
Import org.apache.hadoop.io.IntWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.Reducer;
public class Logreducer extends Reducer<text, intwritable, Text, intwritable> {
@Override
protected void Reduce (Text key, iterable<intwritable> values,context context)
throws IOException, interruptedexception {
int sum = 0;
for (intwritable val:values) {
sum + = Val.get ();
}
Context.write (Key, New intwritable (sum));
}
Start class
Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.fs.FileSystem;
Import Org.apache.hadoop.fs.Path;
Import org.apache.hadoop.io.IntWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.Job;
Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
Import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class Jobmain {/** * @param args */public static void main (string[] args) throws Exception {Configuratio
n Configuration = new configuration ();
Job Job = new Job (configuration, "log_job");
Job.setjarbyclass (Jobmain.class);
Job.setmapperclass (Logmapper.class);
Job.setmapoutputkeyclass (Text.class);
Job.setmapoutputvalueclass (Intwritable.class);
Job.setreducerclass (Logreducer.class);
Job.setoutputkeyclass (Text.class);
Job.setoutputvalueclass (Intwritable.class);
Fileinputformat.addinputpath (Job, New Path (args[0));
Path PATH = new Path (args[1]); FileSystem fs = Filesystem.get (ConfiguRation);
if (fs.exists (path)) {Fs.delete (path, true);
} fileoutputformat.setoutputpath (Job, path);
System.exit (Job.waitforcompletion (true)? 0:1); }
}
3, spark implementation of the Scala version
Textfile () load data
val = sc.textfile ("/spark/seven.txt")
//filter filter length is less than 0, filter does not contain get and post URL
val Filtered = Data.filter (_.length () >0). Filter (Line => (Line.indexof ("get") >0 | | line.indexof ("POST") >0)
//Convert to key value pair Operation
val res = Filtered.map (line => {
if (Line.indexof ("get") >0) {//Intercept get to URL string
( Line.substring (Line.indexof ("get"), Line.indexof ("http/1.0"). trim,1)
}else{ //intercept string POST to URL
( Line.substring (Line.indexof ("POST"), Line.indexof ("http/1.0")). trim,1)
}//finally pass Reducebykey for Sum
}). Reducebykey (_+_)
//Trigger action Event execution
Res.collect ()
The code for Scala's functional programming is concise and elegant, and there will be similar new features after JDK1.8.
The contrast output is consistent with Mr
(post/service/notes/addviewtimes_779.htm,1),
(get/service/notes/addviewtimes_900.htm,1),
(Post/service /notes/addviewtimes_900.htm,1),
(get/notes/index-top-3.htm,1),
(get/html/notes/20140318/24.html,1),
(get/html/notes/20140609/544.html,1),
(post/service/notes/addviewtimes_542.htm,2),
(post/service/notes/addviewtimes_544.htm,1),
(get/html/ notes/20140609/542.html,2),
(post/service/notes/addviewtimes_23.htm,1),
(get/html/notes/20140617/888. html,3),
(post/service/notes/addviewtimes_24.htm,1), (
get/course/detail/3.htm,1), (
get/course/ list/73.htm,2),
(get/html/notes/20140617/779.html,1),
(get/html/notes/20140620/872.html,1)