Reprint Please specify source: http://blog.csdn.net/xiaojimanman/article/details/40372189
The WordCount case in the Hadoop source code implements the word statistics, but the output to the HDFs file, the online program wants to use its calculation results and also to write a program again, so I study about the MapReduce output problem, Here's a simple example of how to output the results of a mapreduce calculation to a database.
Requirements Description:
Analyze the Apache logs on the network server, count the number of times each IP is accessing the resource, and write the results to the MySQL database.
Data format:
The Apache log data is as follows:
A row of data is an HTTP request record, which only makes a simple number of IP statistics.
Requirements Analysis:
By using MapReduce to distribute the log files, map mainly makes a simple split count for the log, and reduce sums the results of the map.
The map program makes a simple split of a row of log data, obtains the client IP, and outputs the key to the client Ip,value as the number of IP occurrences. The result sample is shown in the following example:
The reduce program sums the values of the key value, and the output is key for the client ip,value the number of IP occurrences. The result sample is shown in the following example:
The above MapReduce program is similar to the WordCount program, just a simple summation of the IP, the following need to write the output format of reduce, so that the results are written to the MySQL database.
MapReduce supports user-defined output formats, and the defined classes only need to inherit Fileoutputformat. The implementation is as shown in the following:
The custom output needs to implement the Getrecordwriter method, which implements the custom Recordwriter in the way of the inner class, and implements the related output in the Mysqlrecordwriter class to write the reduce result data to the database. The concrete implementation is as shown in the following:
In the MapReduce program, the result of reduce is written to the database only if the output format is specified as the output format in relation to the job's settings.
Job.setoutputformatclass (Mysqloutputformat.class);
Code implementation:
The log line records the Analysis class Textline, which implements the extraction of IP information from the log records and the number of IPs (one row of data is 1 times), the code is as follows:
/*** journal line Analysis * @author lulei*/package com;import org.apache.hadoop.io.intwritable;import org.apache.hadoop.io.Text; public class TextLine {private String ip;private intwritable one = new intwritable (1);//identity data is available private Boolean right = T Rue;public TextLine (String TextLine) {//verifies whether a row of log data meets the requirements, such as non-conformance, identifying it as unavailable if (TextLine = = NULL | | ". Equals (TextLine)) {this.right = False;return;} String []strs = Textline.split (""); if (Strs.length < 2) {this.right = False;return;} This.ip = Strs[0];} public Boolean isright () {return to right;} /** * Returns the output key value of Map * @return */public text Getipcountmapoutkey () {return new text (THIS.IP);} /** * Returns the output value of map value * @return */public intwritable getipcountmapoutvalue () {return this.one;}}
The IPCOUNTMR class implements the MapReduce function, realizes the statistics of the number of IP occurrences in the log data, the code is as follows:
/** * Each IP occurrence count */package com;import java.io.ioexception;import org.apache.hadoop.conf.configuration;import Org.apache.hadoop.conf.configured;import Org.apache.hadoop.fs.path;import org.apache.hadoop.io.IntWritable; Import Org.apache.hadoop.io.longwritable;import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.Job; Import Org.apache.hadoop.mapreduce.mapper;import Org.apache.hadoop.mapreduce.reducer;import Org.apache.hadoop.mapreduce.lib.input.fileinputformat;import Org.apache.hadoop.mapreduce.lib.input.textinputformat;import Org.apache.hadoop.util.tool;import Org.apache.hadoop.util.toolrunner;public class IPCOUNTMR extends configured implements tool{/** * IP number Statistics map * @author Lul Ei */public static class Ipcountmap extends Mapper<longwritable, text, text, intwritable> {@Overridepublic void map ( Longwritable key, Text value, Context context) throws IOException, interruptedexception {TextLine TextLine = new TextLine (v Alue.tostring ()), if (Textline.isright ()) {Context.write (teXtline.getipcountmapoutkey (), Textline.getipcountmapoutvalue ());}}} /** * IP count reduce * @author Lulei */public static class Ipcountreduce extends Reducer<text, intwritable, Text, Intwrit able> {@Overridepublic void reduce (Text key, iterable<intwritable> values, context context) throws IOException, interruptedexception {int sum = 0;for (intwritable value:values) {sum + = Value.get ();} Context.write (Key, New intwritable (sum));}} @SuppressWarnings ("deprecation") @Overridepublic int run (string[] arg0) throws Exception {Configuration conf = new Configuration (); Job Job = new Job (conf); Job.setjobname ("Ipcount"); Job.setinputformatclass (textinputformat.class);// Set the output to Mysqloutputformatjob.setoutputformatclass (Mysqloutputformat.class); Job.setoutputkeyclass (Text.class); Job.setoutputvalueclass (Intwritable.class); Job.setmapperclass (Ipcountmap.class); Job.setcombinerclass ( Ipcountreduce.class); Job.setreducerclass (Ipcountreduce.class); Fileinputformat.addinputpath (Job, New Path (Arg0[0]));//personally think the followingShould not be set, but do not set the error, do not know what is the problem Mysqloutputformat.setoutputpath (Job, New Path (arg0[1])); Job.waitforcompletion ( true); return job.issuccessful ()? 0:1;} /** * @param args */public static void main (string[] args) {//TODO auto-generated method stubtry {int res = toolrunner.ru N (New Configuration (), New IPCOUNTMR (), args); System.exit (res);} catch (Exception e) {//TODO auto-generated catch Blocke.printstacktrace ();}}}
The Mysqloutputformat class implements the reduce custom output, outputting the results of the reduce calculation to the database with the following code:
Package Com;import Java.io.ioexception;import Org.apache.hadoop.io.intwritable;import org.apache.hadoop.io.Text; Import Org.apache.hadoop.mapreduce.recordwriter;import Org.apache.hadoop.mapreduce.taskattemptcontext;import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; @SuppressWarnings ("Hiding") public class Mysqloutputformat<text, intwritable> extends Fileoutputformat<text, intwritable> {//Mysql Recordwriter private Static Class Mysqlrecordwriter<text, intwritable> extends Recordwriter<text, intwritable> {Private Logdb logdb;/** * uses externally transmitted LOGDB objects * @param logdb */mysqlrecordwriter (logdb logdb) {this.logdb = Lo GDB;} @Overridepublic void Close (Taskattemptcontext arg0) throws Ioexception,interruptedexception {//TODO auto-generated Method stub}/** * Writes key-value to the database */@Overridepublic void Write (Text key, intwritable value) throws Ioexception,interrup tedexception {//TODO auto-generated method Stublogdb.insert (Key.tostring (), value.tostring ());}} @OverridepuBlic Recordwriter<text, intwritable> getrecordwriter (Taskattemptcontext arg0) throws IOException, interruptedexception {//TODO auto-generated method stub//returns Mysqlrecordwriter object return new Mysqlrecordwriter<text, Intwritable> (New Logdb ());}}
The above Logdb class is its own encapsulated database operation class, the implementation of data insertion, the implementation code is as follows:
Package Com;import Java.sql.sqlexception;import Com.lulei.db.manager.dbserver;public class Logdb {//New connection pool private DBServer dbserver = new DBServer ("Proxool.log");/** * Insert data into database * @param IP * @param num */public void Insert (String IP, S Tring num) {try {Dbserver.insert ("INSERT into LOGMP (IP, num) VALUES ('" + IP + "', '" +num + "')");} catch (SQLException e) {} finally {dbserver.close ();}} public static void Main (string[] args) {//TODO auto-generated method Stubnew logdb (). Insert ("127.0.0.2", "1");}}
The above program DBServer class is based on the connection pool Proxool-0.9.1.jar encapsulation database operation class, here does not do the detailed introduction, here also can not pass through the connection pool, directly writes the data to the database, this is not this case the focus, does not do the detailed introduction.
Upload run:
The specific operation commands refer to the upload run in the blog http://blog.csdn.net/xiaojimanman/article/details/40184581.
Execution Result:
After the execution of the program, the corresponding data table records are viewed by command, and the results are correctly written to the database, as shown in:
The log files above the client are accessed from the intranet, so the records are intranet addresses.
Note: There is an associated database connection pool code in resource http://download.csdn.net/detail/xiaojimanman/6920219
Using Hadoop to implement IP count ~ and write results to the database