Hadoop Classic case Spark implementation (vii)--Log analysis: Analyzing unstructured files

Source: Internet
Author: User
Tags map class

Related articles recommended

Hadoop Classic case Spark implementation (i)-analysis of the highest temperature per year through collected meteorological data
Hadoop Classic case Spark Implementation (ii)-Data deduplication issues
Hadoop Classic case Spark implementation (iii)--Data sorting
Hadoop Classic case Spark implementation (iv)--average score
Hadoop Classic case Spark Implementation (v)--Max Minimum value problem
Hadoop Classic case Spark implementation (vi)--Find the maximum K value and sort
Hadoop Classic case Spark implementation (vii)--Log analysis: Analyzing unstructured files


Hadoop Classic case Spark implementation (vii)--Log analysis: Analyzing unstructured files

1, demand: According to the Tomcat log to calculate the URL to access the situation, the specific URL is as follows,
Requirements: Distinguish statistics get and post URL traffic
The result is: Access, URL, number of visits

Test Data set:

196.168.2.1--[03/jul/2014:23:36:38 +0800] "get/course/detail/3.htm http/1.0"--38435 0.038182.131.89.195--[03/Jul /2014:23:37:43 +0800] "get/html/notes/20140617/888.html http/1.0" 301-0.000196.168.2.1--[03/jul/2014:23:38:27 +0800 ] "post/service/notes/addviewtimes_23.htm http/1.0" 2 0.003196.168.2.1--[03/jul/2014:23:39:03 +0800] "get/html/no tes/20140617/779.html http/1.0 "69539 0.046196.168.2.1--[03/jul/2014:23:43:00 +0800]" get/html/notes/20140318/ 24.html http/1.0 "67171 0.049196.168.2.1--[03/jul/2014:23:43:59 +0800]" post/service/notes/addviewtimes_779.htm H ttp/1.0 "1 0.003196.168.2.1--[03/jul/2014:23:45:51 +0800]" get/html/notes/20140617/888.html HTTP/1.0 "200 70044 0. 060196.168.2.1--[03/jul/2014:23:46:17 +0800] "get/course/list/73.htm http/1.0"--12125 0.010196.168.2.1--[03/Jul/ 2014:23:46:58 +0800] "get/html/notes/20140609/542.html http/1.0" 94971 0.077196.168.2.1--[03/jul/2014:23:48:31 +0 ] "Post/service/notes/addviEwtimes_24.htm http/1.0 "2 0.003196.168.2.1--[03/jul/2014:23:48:34 +0800]" post/service/notes/addviewtimes_542. HTM http/1.0 "2 0.003196.168.2.1--[03/jul/2014:23:49:31 +0800]" get/notes/index-top-3.htm http/1.0 "200 53494 0.04  1196.168.2.1--[03/jul/2014:23:50:55 +0800] "get/html/notes/20140609/544.html http/1.0" 200 183694 0.076196.168.2.1-- [03/jul/2014:23:53:32 +0800] "Post/service/notes/addviewtimes_544.htm http/1.0" 2 0.004196.168.2.1--[03/jul/2014:23:54:53 +0800] "Get/service /notes/addviewtimes_900.htm http/1.0 "151770 0.054196.168.2.1--[03/jul/2014:23:57:42 +0800]" get/html/notes/ 20140620/872.html http/1.0 "52373 0.034196.168.2.1--[03/jul/2014:23:58:17 +0800]" post/service/notes/ Addviewtimes_900.htm http/1.0 "2 0.003196.168.2.1--[03/jul/2014:23:58:51 +0800]" get/html/notes/20140617/888. HTML http/1.0 "70044 0.057186.76.76.76--[03/jul/2014:23:48:34 +0800]" post/service/notes/addviewtimes_542.htm HTT p/1.0 "200 2 0.003186.76.76.76--[03/jul/2014:23:46:17 +0800] "get/course/list/73.htm http/1.0"--12125 0.0108.8.8.8--[03/jul/2014:23:46: +0800] "get/html/notes/20140609/542.html http/1.0" 200 94971 0.077


Since the tomcat logs are irregular, the cleaning data needs to be filtered first.


2. The MapReduce implementation of Hadoop:

Map class

Import Java.io.ioexception;import Javax.naming.spi.dirstatefactory.result;import org.apache.hadoop.io.IntWritable ; Import Org.apache.hadoop.io.longwritable;import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.mapper;public class Logmapper extends Mapper<longwritable, text, text, intwritable> {Private intwritable val = new intwritable (1); @Overrideprotected void map (longwritable key, Text value,context Context) th Rows IOException, interruptedexception {String line = value.tostring (). Trim (); String tmp = Handlerlog (line), if (Tmp.length () >0) {Context.write (new Text (TMP), Val);}} 127.0.0.1--[03/jul/2014:23:36:38 +0800] "get/course/detail/3.htm http/1.0" 38435 0.038private String Handlerlog ( String line) {string result = ""; Try{if (Line.length () >20) {if (Line.indexof ("GET") >0) {result = Line.substring ( Line.indexof ("GET"), Line.indexof ("http/1.0")). Trim (); else if (Line.indexof ("post") >0) {result = Line.substring (Line.indexof ("post"), Line.indexof ("http/1.0")). Trim ();}}} catch (Exception e) {System.out.println (line);} return result;} public static void Main (string[] args) {String line = "127.0.0.1--[03/jul/2014:23:36:38 +0800] \" get/course/detail/3.ht M http/1.0\ "200 38435 0.038"; System.out.println (New Logmapper (). Handlerlog (line));}}

Reduce class

Import Java.io.ioexception;import Org.apache.hadoop.io.intwritable;import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.reducer;public class Logreducer extends Reducer<text, intwritable, Text, IntWritable > {@Overrideprotected void reduce (Text key, iterable<intwritable> Values,context Context) throws IOException, interruptedexception {int sum = 0;for (intwritable val:values) {sum + = Val.get ();} Context.write (Key, New intwritable (sum));}}

Start class

Import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.filesystem;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.io.intwritable;import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.job;import Org.apache.hadoop.mapreduce.lib.input.fileinputformat;import Org.apache.hadoop.mapreduce.lib.output.fileoutputformat;public class Jobmain {/** * @param args */public static void Main (string[] args) throws Exception {Configuration configuration = new configuration (); Job Job = new Job (configuration, "Log_job"); Job.setjarbyclass (Jobmain.class); Job.setmapperclass (Logmapper.class); Job.setmapoutputkeyclass (Text.class); Job.setmapoutputvalueclass (Intwritable.class); Job.setReducerClass ( Logreducer.class); Job.setoutputkeyclass (Text.class); Job.setoutputvalueclass (Intwritable.class); Fileinputformat.addinputpath (Job, New Path (Args[0])); Path PATH = new Path (args[1]); FileSystem fs = filesystem.get (Configuration), if (fs.exists (path)) {Fs.delete (path, true);} Fileoutputformat.sEtoutputpath (Job, path); System.exit (Job.waitforcompletion (true)? 0:1);}}


3. The Scala version of Spark implementation

Textfile () load Data val = sc.textfile ("/spark/seven.txt")//filter filter length less than 0, filter does not contain get with post URL val filtered = Data.filter (_.length () >0). Filter (line = (Line.indexof ("GET") >0 | | line.indexof ("POST") >0)// Convert to a key value pair operation val res = Filtered.map (line + = {if (Line.indexof ("get") >0) {//Intercept get to URL string (line.substring (line.indexof ("GET"), Line.indexof ("http/1.0")). trim,1)}else{   //intercept the POST-to-url string (line.substring (Line.indexof ("POST"), Line.indexof ("http/1.0")). trim,1)}//finally through Reducebykey for sum}). Reducebykey (_+_)//Trigger Action Event execution Res.collect ()


The code of Scala's functional programming is simple and elegant, and there are similar new features after JDK1.8.




The contrast output is consistent with Mr

(post/service/notes/addviewtimes_779.htm,1), (get/service/notes/addviewtimes_900.htm,1), (post/service/notes/ addviewtimes_900.htm,1), (get/notes/index-top-3.htm,1), (get/html/notes/20140318/24.html,1), (get/html/notes/ 20140609/544.html,1), (post/service/notes/addviewtimes_542.htm,2), (post/service/notes/addviewtimes_544.htm,1), ( get/html/notes/20140609/542.html,2), (post/service/notes/addviewtimes_23.htm,1), (get/html/notes/20140617/888. html,3), (post/service/notes/addviewtimes_24.htm,1), (get/course/detail/3.htm,1), (get/course/list/73.htm,2), (GET /html/notes/20140617/779.html,1), (get/html/notes/20140620/872.html,1)




Hadoop Classic case Spark implementation (vii)--Log analysis: Analyzing unstructured files

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.