Hadoop Classic case Spark implementation (vii)--Log analysis: Analysis of unstructured files _hadoop

Source: Internet
Author: User
Tags map class hadoop mapreduce

Related articles recommended

Hadoop Classic case Spark implementation (i)--analysis of the maximum temperature per year by meteorological data collected
Hadoop Classic case Spark Implementation (ii)--data-heavy problem
Hadoop Classic case Spark implementation (iii)--Data sorting
Hadoop Classic case Spark implementation (IV.)--average score
Spark implementation of Hadoop classic case (v)--seeking maximum minimum value problem
Hadoop Classic case Spark implementation (vi)--Find the largest K value and sort
Hadoop Classic case Spark implementation (vii)--Log analysis: Analysis of unstructured files


Hadoop Classic case Spark implementation (vii)--Log analysis: Analysis of unstructured files

1, requirements: According to the Tomcat Log computing URL access to the situation, the specific URL below,
Requirements: Distinguishing statistics get and post URL accesses
The result is: Access mode, URL, number of visits

Test Data set:

196.168.2.1--[03/jul/2014:23:36:38 +0800] "get/course/detail/3.htm http/1.0" 38435 0.038 182.131.89.195--[03/ju l/2014:23:37:43 +0800] "get/html/notes/20140617/888.html http/1.0" 301-0.000 196.168.2.1--[03/jul/2014:23:38:27 +080 0] "post/service/notes/addviewtimes_23.htm http/1.0" 2 0.003 196.168.2.1--[03/jul/2014:23:39:03 +0800] "get/html/ notes/20140617/779.html http/1.0 "69539 0.046 196.168.2.1--[03/jul/2014:23:43:00 +0800]" get/html/notes/20140318/ 24.html http/1.0 "67171 0.049 196.168.2.1--[03/jul/2014:23:43:59 +0800]" post/service/notes/addviewtimes_779.htm H ttp/1.0 "1 0.003 196.168.2.1--[03/jul/2014:23:45:51 +0800]" get/html/notes/20140617/888.html http/1.0 "200 70044 0 .060 196.168.2.1--[03/jul/2014:23:46:17 +0800] "get/course/list/73.htm http/1.0" 12125 0.010 196.168.2.1--[03/j ul/2014:23:46:58 +0800] "get/html/notes/20140609/542.html http/1.0" 94971 0.077 196.168.2.1--[03/jul/2014:23:48:3 1 +0800] "post/service/noTes/addviewtimes_24.htm http/1.0 "2 0.003 196.168.2.1--[03/jul/2014:23:48:34 +0800]" Post/service/notes/addviewti Mes_542.htm http/1.0 "2 0.003 196.168.2.1--[03/jul/2014:23:49:31 +0800]" get/notes/index-top-3.htm http/1.0 "200 5 3494 0.041 196.168.2.1--[03/jul/2014:23:50:55 +0800] "get/html/notes/20140609/544.html http/1.0" 200 183694 0.076 196. 168.2.1--[03/jul/2014:23:53:32 +0800] "post/service/notes/addviewtimes_544.htm http/1.0" 200 2 0.004 196.168.2.1--[ 03/jul/2014:23:54:53 +0800] "get/service/notes/addviewtimes_900.htm http/1.0" 151770 0.054 196.168.2.1--[03/JUL/2 014:23:57:42 +0800] "get/html/notes/20140620/872.html http/1.0" 52373 0.034 196.168.2.1--[03/jul/2014:23:58:17 +08 ] "post/service/notes/addviewtimes_900.htm http/1.0" 2 0.003 196.168.2.1--[03/jul/2014:23:58:51 +0800] "get/htm l/notes/20140617/888.html http/1.0 "70044 0.057 186.76.76.76--[03/jul/2014:23:48:34 +0800]" post/service/notes/ad Dviewtimes_542.htm http/1.0"2 0.003 186.76.76.76--[03/jul/2014:23:46:17 +0800]" get/course/list/73.htm http/1.0 "200 12125 0.010 8.8.8.8-- [03/jul/2014:23:46:58 +0800] "Get/html/notes/20140609/542.html http/1.0" 200 94971 0.077


Because Tomcat logs are irregular, you need to filter the cleaning data first.


2, Hadoop MapReduce implementation:

Map class

Import java.io.IOException;

Import Javax.naming.spi.DirStateFactory.Result;
Import org.apache.hadoop.io.IntWritable;
Import org.apache.hadoop.io.LongWritable;
Import Org.apache.hadoop.io.Text;

Import Org.apache.hadoop.mapreduce.Mapper; public class Logmapper extends mapper<longwritable, text, text, intwritable> {private intwritable val = new INTWR
	Itable (1);
		
		@Override protected void Map (longwritable key, Text Value,context context) throws IOException, Interruptedexception {
		String line = value.tostring (). Trim ();
		String tmp = Handlerlog (line);
		if (Tmp.length () >0) {Context.write (new Text (TMP), Val); }//127.0.0.1--[03/jul/2014:23:36:38 +0800] "get/course/detail/3.htm http/1.0" 38435 0.038 private String ha
		Ndlerlog (string line) {string result = ""; try{if (line.length () >20) {if (Line.indexof (' get ') >0) {result = Line.substring (Line.indexof (' get '), line.
				IndexOf ("http/1.0")). Trim ();
		}else if (Line.indexof ("POST") >0) {			result = Line.substring (Line.indexof ("POST"), Line.indexof ("http/1.0")). Trim ();
		}}catch (Exception e) {System.out.println (line);
	return result; public static void Main (string[] args) {String line = "127.0.0.1--[03/jul/2014:23:36:38 +0800] \" Get/course/det
		Ail/3.htm http/1.0\ "200 38435 0.038";
	System.out.println (New Logmapper (). Handlerlog (line));
 }
}

Reduce class

Import java.io.IOException;

Import org.apache.hadoop.io.IntWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.Reducer;

public class Logreducer extends Reducer<text, intwritable, Text, intwritable> {

	@Override
	protected void Reduce (Text key, iterable<intwritable> values,context context)
			throws IOException, interruptedexception {
		int sum = 0;
		for (intwritable val:values) {
			sum + = Val.get ();
		}
		Context.write (Key, New intwritable (sum));
		
	}



Start class

Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.fs.FileSystem;
Import Org.apache.hadoop.fs.Path;
Import org.apache.hadoop.io.IntWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.Job;
Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

Import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class Jobmain {/** * @param args */public static void main (string[] args) throws Exception {Configuratio
		
		
		n Configuration = new configuration ();
		Job Job = new Job (configuration, "log_job");
		
		Job.setjarbyclass (Jobmain.class);
		Job.setmapperclass (Logmapper.class);
		Job.setmapoutputkeyclass (Text.class);
		
		Job.setmapoutputvalueclass (Intwritable.class);
		Job.setreducerclass (Logreducer.class);
		Job.setoutputkeyclass (Text.class);
		
		Job.setoutputvalueclass (Intwritable.class);
		Fileinputformat.addinputpath (Job, New Path (args[0));
		Path PATH = new Path (args[1]); FileSystem fs = Filesystem.get (ConfiguRation);
		if (fs.exists (path)) {Fs.delete (path, true);
		
		} fileoutputformat.setoutputpath (Job, path);

	System.exit (Job.waitforcompletion (true)? 0:1); }

}


3, spark implementation of the Scala version

Textfile () load data
val = sc.textfile ("/spark/seven.txt")

//filter filter length is less than 0, filter does not contain get and post URL 
val Filtered = Data.filter (_.length () >0). Filter (Line => (Line.indexof ("get") >0 | | line.indexof ("POST") >0)

//Convert to key value pair Operation
val res = Filtered.map (line => {
if (Line.indexof ("get") >0) {//Intercept get to URL string
( Line.substring (Line.indexof ("get"), Line.indexof ("http/1.0"). trim,1)
}else{   //intercept string POST to URL
( Line.substring (Line.indexof ("POST"), Line.indexof ("http/1.0")). trim,1)
}//finally pass Reducebykey for Sum
}). Reducebykey (_+_)

//Trigger action Event execution
Res.collect ()


The code for Scala's functional programming is concise and elegant, and there will be similar new features after JDK1.8.


The contrast output is consistent with Mr

(post/service/notes/addviewtimes_779.htm,1), 
(get/service/notes/addviewtimes_900.htm,1), 
(Post/service /notes/addviewtimes_900.htm,1), 
(get/notes/index-top-3.htm,1), 
(get/html/notes/20140318/24.html,1), 
(get/html/notes/20140609/544.html,1), 
(post/service/notes/addviewtimes_542.htm,2), 
(post/service/notes/addviewtimes_544.htm,1), 
(get/html/ notes/20140609/542.html,2), 
(post/service/notes/addviewtimes_23.htm,1), 
(get/html/notes/20140617/888. html,3), 
(post/service/notes/addviewtimes_24.htm,1), ( 
get/course/detail/3.htm,1), ( 
get/course/ list/73.htm,2), 
(get/html/notes/20140617/779.html,1), 
(get/html/notes/20140620/872.html,1)




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.