Related articles recommended
Hadoop Classic case Spark implementation (i)-analysis of the highest temperature per year through collected meteorological data
Hadoop Classic case Spark Implementation (ii)-Data deduplication issues
Hadoop Classic case Spark implementation (iii)--Data sorting
Hadoop Classic case Spark implementation (iv)--average score
Hadoop Classic case Spark Implementation (v)--Max Minimum value problem
Hadoop Classic case Spark implementation (vi)--Find the maximum K value and sort
Hadoop Classic case Spark implementation (vii)--Log analysis: Analyzing unstructured files
Hadoop Classic case Spark implementation (vii)--Log analysis: Analyzing unstructured files
1, demand: According to the Tomcat log to calculate the URL to access the situation, the specific URL is as follows,
Requirements: Distinguish statistics get and post URL traffic
The result is: Access, URL, number of visits
Test Data set:
196.168.2.1--[03/jul/2014:23:36:38 +0800] "get/course/detail/3.htm http/1.0"--38435 0.038182.131.89.195--[03/Jul /2014:23:37:43 +0800] "get/html/notes/20140617/888.html http/1.0" 301-0.000196.168.2.1--[03/jul/2014:23:38:27 +0800 ] "post/service/notes/addviewtimes_23.htm http/1.0" 2 0.003196.168.2.1--[03/jul/2014:23:39:03 +0800] "get/html/no tes/20140617/779.html http/1.0 "69539 0.046196.168.2.1--[03/jul/2014:23:43:00 +0800]" get/html/notes/20140318/ 24.html http/1.0 "67171 0.049196.168.2.1--[03/jul/2014:23:43:59 +0800]" post/service/notes/addviewtimes_779.htm H ttp/1.0 "1 0.003196.168.2.1--[03/jul/2014:23:45:51 +0800]" get/html/notes/20140617/888.html HTTP/1.0 "200 70044 0. 060196.168.2.1--[03/jul/2014:23:46:17 +0800] "get/course/list/73.htm http/1.0"--12125 0.010196.168.2.1--[03/Jul/ 2014:23:46:58 +0800] "get/html/notes/20140609/542.html http/1.0" 94971 0.077196.168.2.1--[03/jul/2014:23:48:31 +0 ] "Post/service/notes/addviEwtimes_24.htm http/1.0 "2 0.003196.168.2.1--[03/jul/2014:23:48:34 +0800]" post/service/notes/addviewtimes_542. HTM http/1.0 "2 0.003196.168.2.1--[03/jul/2014:23:49:31 +0800]" get/notes/index-top-3.htm http/1.0 "200 53494 0.04 1196.168.2.1--[03/jul/2014:23:50:55 +0800] "get/html/notes/20140609/544.html http/1.0" 200 183694 0.076196.168.2.1-- [03/jul/2014:23:53:32 +0800] "Post/service/notes/addviewtimes_544.htm http/1.0" 2 0.004196.168.2.1--[03/jul/2014:23:54:53 +0800] "Get/service /notes/addviewtimes_900.htm http/1.0 "151770 0.054196.168.2.1--[03/jul/2014:23:57:42 +0800]" get/html/notes/ 20140620/872.html http/1.0 "52373 0.034196.168.2.1--[03/jul/2014:23:58:17 +0800]" post/service/notes/ Addviewtimes_900.htm http/1.0 "2 0.003196.168.2.1--[03/jul/2014:23:58:51 +0800]" get/html/notes/20140617/888. HTML http/1.0 "70044 0.057186.76.76.76--[03/jul/2014:23:48:34 +0800]" post/service/notes/addviewtimes_542.htm HTT p/1.0 "200 2 0.003186.76.76.76--[03/jul/2014:23:46:17 +0800] "get/course/list/73.htm http/1.0"--12125 0.0108.8.8.8--[03/jul/2014:23:46: +0800] "get/html/notes/20140609/542.html http/1.0" 200 94971 0.077
Since the tomcat logs are irregular, the cleaning data needs to be filtered first.
2. The MapReduce implementation of Hadoop:
Map class
Import Java.io.ioexception;import Javax.naming.spi.dirstatefactory.result;import org.apache.hadoop.io.IntWritable ; Import Org.apache.hadoop.io.longwritable;import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.mapper;public class Logmapper extends Mapper<longwritable, text, text, intwritable> {Private intwritable val = new intwritable (1); @Overrideprotected void map (longwritable key, Text value,context Context) th Rows IOException, interruptedexception {String line = value.tostring (). Trim (); String tmp = Handlerlog (line), if (Tmp.length () >0) {Context.write (new Text (TMP), Val);}} 127.0.0.1--[03/jul/2014:23:36:38 +0800] "get/course/detail/3.htm http/1.0" 38435 0.038private String Handlerlog ( String line) {string result = ""; Try{if (Line.length () >20) {if (Line.indexof ("GET") >0) {result = Line.substring ( Line.indexof ("GET"), Line.indexof ("http/1.0")). Trim (); else if (Line.indexof ("post") >0) {result = Line.substring (Line.indexof ("post"), Line.indexof ("http/1.0")). Trim ();}}} catch (Exception e) {System.out.println (line);} return result;} public static void Main (string[] args) {String line = "127.0.0.1--[03/jul/2014:23:36:38 +0800] \" get/course/detail/3.ht M http/1.0\ "200 38435 0.038"; System.out.println (New Logmapper (). Handlerlog (line));}}
Reduce class
Import Java.io.ioexception;import Org.apache.hadoop.io.intwritable;import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.reducer;public class Logreducer extends Reducer<text, intwritable, Text, IntWritable > {@Overrideprotected void reduce (Text key, iterable<intwritable> Values,context Context) throws IOException, interruptedexception {int sum = 0;for (intwritable val:values) {sum + = Val.get ();} Context.write (Key, New intwritable (sum));}}
Start class
Import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.filesystem;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.io.intwritable;import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.job;import Org.apache.hadoop.mapreduce.lib.input.fileinputformat;import Org.apache.hadoop.mapreduce.lib.output.fileoutputformat;public class Jobmain {/** * @param args */public static void Main (string[] args) throws Exception {Configuration configuration = new configuration (); Job Job = new Job (configuration, "Log_job"); Job.setjarbyclass (Jobmain.class); Job.setmapperclass (Logmapper.class); Job.setmapoutputkeyclass (Text.class); Job.setmapoutputvalueclass (Intwritable.class); Job.setReducerClass ( Logreducer.class); Job.setoutputkeyclass (Text.class); Job.setoutputvalueclass (Intwritable.class); Fileinputformat.addinputpath (Job, New Path (Args[0])); Path PATH = new Path (args[1]); FileSystem fs = filesystem.get (Configuration), if (fs.exists (path)) {Fs.delete (path, true);} Fileoutputformat.sEtoutputpath (Job, path); System.exit (Job.waitforcompletion (true)? 0:1);}}
3. The Scala version of Spark implementation
Textfile () load Data val = sc.textfile ("/spark/seven.txt")//filter filter length less than 0, filter does not contain get with post URL val filtered = Data.filter (_.length () >0). Filter (line = (Line.indexof ("GET") >0 | | line.indexof ("POST") >0)// Convert to a key value pair operation val res = Filtered.map (line + = {if (Line.indexof ("get") >0) {//Intercept get to URL string (line.substring (line.indexof ("GET"), Line.indexof ("http/1.0")). trim,1)}else{ //intercept the POST-to-url string (line.substring (Line.indexof ("POST"), Line.indexof ("http/1.0")). trim,1)}//finally through Reducebykey for sum}). Reducebykey (_+_)//Trigger Action Event execution Res.collect ()
The code of Scala's functional programming is simple and elegant, and there are similar new features after JDK1.8.
The contrast output is consistent with Mr
(post/service/notes/addviewtimes_779.htm,1), (get/service/notes/addviewtimes_900.htm,1), (post/service/notes/ addviewtimes_900.htm,1), (get/notes/index-top-3.htm,1), (get/html/notes/20140318/24.html,1), (get/html/notes/ 20140609/544.html,1), (post/service/notes/addviewtimes_542.htm,2), (post/service/notes/addviewtimes_544.htm,1), ( get/html/notes/20140609/542.html,2), (post/service/notes/addviewtimes_23.htm,1), (get/html/notes/20140617/888. html,3), (post/service/notes/addviewtimes_24.htm,1), (get/course/detail/3.htm,1), (get/course/list/73.htm,2), (GET /html/notes/20140617/779.html,1), (get/html/notes/20140620/872.html,1)
Hadoop Classic case Spark implementation (vii)--Log analysis: Analyzing unstructured files