First, demand analysis
1. Log files are generated each day (the log files need to be uploaded to HDFs) 2. Analyze the fields included in the log file: Access IP, access time, access URL, access status, access traffic
3. There are now "yesterday" log files that are Logclean.jar 3. Requirement indicator A. Statistics PV value b. Number of registered persons c. Statistics IP number D. Statistical bounce rate F. Statistics two hop rate
Second, data analysis
1. Data acquisition use shell scripts to upload 2. Data Cleansing Filter field format time, etc. field 3. Data analysis uses first-level partitioning (date) 4. Data export SQOOP5. The framework used is: Shell script HDFs mapreduce hive sqoop MySQL expected result PV register IP jumpprob two_jumpprob
Third, the implementation
1. Automatically upload to HDFs $HADOOP _home/bin/hdfs dfs-rm-r $HDFS _input_path >/dev/null 2>&1 $HADOOP _home/bin/hdfs DF S-mkdir-p $HDFS _input_path/$yesterday >/dev/null 2>&1 $HADOOP _home/bin/hdfs dfs-put $LOG _path $HDFS _INP ut_path/$yesterday >/dev/null 2>&12. Data cleansing (using MapReduce to filter dirty data with unwanted static data and de-double quotes, convert date) $HADOOP _home/bin/hdfs dfs-rm-r $HDFS _output_path >/dev/null 2>&am P;1 $HADOOP _home/bin/yarn jar $JAR _path $ENTRANCE $HDFS _input_path/$yesterday $HDFS _output_path/date= $yesterday 3. Create a log database and partition table in HIVE and add the cleaned file to the partition $HIVE _home/bin/hive-e "CREATE database if not exists $HIVE _database" >/dev/null 2&G T;&1 $HIVE _home/bin/hive--database $HIVE _database-e "Create external table if not EXISTS $HIVE _table (IP str Ing,day String,url String) partitioned by (date string) row format delimited fields terminated by ' \ t ' location ' $HDFS _output_path ' "$HIVE _home/bin/hive--database $HIVE _database-e" ALTER TABLE $HIVE _table add partItion (date= ' $yesterday ') "4. Analyze the data and use Sqoop to export to MySQL pv:create table if not exists PV_TB (PV string) row format delimited fields terminated by ' \ t '; Insert Overwrite table PV_TB Select COUNT (1) from Weblog_clean where date= ' 2016_11_13 '; Register:create table if not exists REGISTER_TB (register string) row format delimited fields terminated by ' \ t '; Insert Overwrite table REGISTER_TB Select COUNT (1) from Weblog_clean where date= ' 2016_11_13 ' and InStr (URL, ' member. Php?mod=register ') > 0; Ip:create table if not exists IP_TB (IP string) row format delimited fields terminated by ' \ t '; Insert Overwrite table IP_TB Select count (distinct IP) from Weblog_clean where date= ' 2016_11_13 '; Jumpprob:create table if not exists JUMPPROB_TB (jump double) row format delimited fields terminated by ' \ t '; Insert Overwrite table JUMPPROB_TB Select Ghip.singleip/aip.ips from (select COUNT (1) Singleip from (SELECT Count (IP) IPs from WEblog_clean where date= ' 2016_11_13 ' GROUP by IP has IPs <2) GIP, (select COUNT (IP) IPs from ghip n where date= ' 2016_11_13 ') AIP; Two_jumpprob:create table if not exists TWO_JUMPPROB_TB (jump double) row format delimited fields terminated by ' \ t ‘; Insert Overwrite table TWO_JUMPPROB_TB Select Ghip.singleip/aip.ips from (select COUNT (1) Singleip from (Select Coun T (IP) IPs from Weblog_clean where date= ' 2016_11_13 ' GROUP by IP has IPs >=2) GIP, (select COUNT (IP) IPs From Weblog_clean where date= ' 2016_11_13 ') AIP; Merge Table # Note that the above tables are created separately, efficiency is higher than below, but storage consumption is higher above create table if not exists Log_result (PV string,register string,ip stri Ng,jumpprob Double,two_jumpprob Double) row format delimited fields terminated by ' \ t '; Insert Overwrite table Log_result Select Log_pv.pv,log_register.register,log_ip.ip,log_jumpprob.jumpprob,log_two_ju Mpprob.two_jumpprob from (select COUNT (1) PV from Weblog_clean where date= ' 2016_11_13 ') LOG_PV, (select COUNT (1) register from Weblog_clean where date= ' 2016_11_13 ' and InStr (URL, ' member.ph P?mod=register ') > 0) log_register, (select COUNT (distinct IP) IP from Weblog_clean where date= ' 2016_11_13 ') log _ip, (select Ghip.singleip/aip.ips jumpprob from (select COUNT (1) Singleip from (select COUNT (IP) IPs from WEBLOG_CL EAN where date= ' 2016_11_13 ' GROUP by IP has IPs <2) GIP) ghip, (select COUNT (IP) IPs from Weblog_clean where Date= ' 2016_11_13 ') AIP) Log_jumpprob, (select Ghip.singleip/aip.ips two_jumpprob from (select COUNT (1) Singleip fro M (select COUNT (IPs) IPs from Weblog_clean where date= ' 2016_11_13 ' GROUP by IP has IPs >=2) GIP, (select Count (IP) IPs from Weblog_clean where date= ' 2016_11_13 ') AIP) Log_two_jumpprob;
Iv. Display of results
Mysql> select * from Weblog_result; +--------+----------+-------+----------+--------------+ | PV | register | IP | jumpprob | Two_jumpprob | +--------+----------+-------+----------+--------------+ | 169857 | | 10411 | 0.02 | 0.04 | +--------+----------+-------+----------+--------------+ 1 row in Set (0.00 sec)
V. Logclean.jar (Filter log field: Date conversion, remove double quotes, past root URL)
Package Org.apache.hadoop.log.project;import Java.net.uri;import Java.text.parseexception;import Java.text.simpledateformat;import Java.util.date;import Java.util.locale;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.conf.configured;import Org.apache.hadoop.fs.filesystem;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.io.longwritable;import Org.apache.hadoop.io.nullwritable;import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.job;import Org.apache.hadoop.mapreduce.mapper;import Org.apache.hadoop.mapreduce.reducer;import Org.apache.hadoop.mapreduce.lib.input.fileinputformat;import Org.apache.hadoop.mapreduce.lib.output.fileoutputformat;import Org.apache.hadoop.util.tool;import Org.apache.hadoop.util.toolrunner;public class Logclean extends configured implements Tool {public static void main (St Ring[] args) {Configuration conf = new configuration (); try {int res = Toolrunner.run (conf, new Logclean (), args); System.exit (RES); } catch (Exception e) {e.printstacktrace (); }} public int run (string[] args) throws Exception {configuration conf = new Configuration (); Job Job = job.getinstance (conf, "Logclean"); Set to be packaged to run Job.setjarbyclass (Logclean.class); Fileinputformat.setinputpaths (Job, args[0]); Job.setmapperclass (Mymapper.class); Job.setmapoutputkeyclass (Longwritable.class); Job.setmapoutputvalueclass (Text.class); Job.setreducerclass (Myreducer.class); Job.setoutputkeyclass (Text.class); Job.setoutputvalueclass (Nullwritable.class); Fileoutputformat.setoutputpath (Job, New Path (Args[1])); Clean up an existing output file FileSystem fs = Filesystem.get (new URI (Args[0]), getconf ()); Path Outpath = new Path (args[1]); if (fs.exists (Outpath)) {Fs.delete (Outpath, true); } Boolean success = Job.waitforcompletion (true); if (success) { SYSTEM.OUT.PRINTLN ("Clean process success!"); } else{System.out.println ("clean process failed!"); } return 0; } static class Mymapper extends Mapper<longwritable, Text, longwritable, text> {logparser Logp Arser = new LogParser (); Text Outputvalue = new text (); protected void Map (longwritable key, Text value, Org.apache.hadoop.mapreduce . Mapper<longwritable, Text, longwritable, Text> Context context) throws Java.io.IOException, Interruptedexception {final string[] parsed = Logp Arser.parse (Value.tostring ()); Step1. Filtering out the static resource access request if (Parsed[2].startswith ("get/static/") | | parsed[2].startswith ("GET/ Uc_server ")) {return; }//Step2. Filter out the beginning of the specified string if (Parsed[2].startswith ("GET/")) {parsed[2] = Parsed[2].sub String ("GET/". Length ()); } else if (Parsed[2].startswith ("Post/")) {parsed[2] = parsed[2].substring ("Post/". Length ()); }//Step3. Filter out the end of the specific string if (Parsed[2].endswith ("http/1.1")) {parsed[2] = parsed[2] . substring (0, parsed[2].length ()-"http/1.1". Length ()); }//Step4. Write only the first three record type entries Outputvalue.set (parsed[0] + "\ T" + parsed[1] + "\ T" + parsed[2]); Context.write (key, Outputvalue); }} static Class Myreducer extends Reducer<longwritable, text, text, nullwritable> {protect ed void reduce (longwritable K2, java.lang.iterable<text> V2s, Org.apac he.hadoop.mapreduce.reducer<longwritable, text, text, Nullwritable> Context context) throws Java.io.IOException, interruptedexception {for (Text v2:v2s) { Context.write (v2, NullwritaBle.get ()); } }; }/* Log parsing class */static class LogParser {public static final SimpleDateFormat FORMAT = new SimpleDate Format ("D/mmm/yyyy:hh:mm:ss", locale.english); public static final SimpleDateFormat DATEFORMAT1 = new SimpleDateFormat ("Yyyymmddhhmmss"); public static void Main (string[] args) throws ParseException {final String S1 = "27.19.74.143--[30/may/2013 : 17:38:20 +0800] \ "Get/static/image/common/faq.gif http/1.1\" 200 1127 "; LogParser parser = new LogParser (); Final string[] array = parser.parse (S1); SYSTEM.OUT.PRINTLN ("Sample data:" + S1); System.out.format ("Analytic results: ip=%s, time=%s, url=%s, status=%s, traffic=%s", Array[0], ARRAY[1], array[2], array[3], array[4]); /** * Parse the English time string * * @param string * @return * @throws parseexception */ Private Date Parsedateformat (string string) {Date parse = null; try {parse = Format.parse (string); } catch (ParseException e) {e.printstacktrace (); } return parse; /** * Parse Log Row Records * * @param line * @return Array contains 5 elements, namely IP, time, URL, status, traffic */ Public string[] Parse (string line) {String ip = Parseip (line); String time = Parsetime (line); String url = parseurl (line); String status = Parsestatus (line); String traffic = parsetraffic (line); return new string[] {IP, time, URL, status, traffic}; } private string Parsetraffic (string line) {Final String trim = line.substring (Line.lastindexof ("\" ") + 1). Trim (); String traffic = Trim.split ("") [1]; return traffic; } private string Parsestatus (string line{Final String trim = line.substring (Line.lastindexof ("\") + 1). Trim (); String status = Trim.split ("") [0]; return status; private string parseURL (string line) {Final int first = Line.indexof ("\" "); final int last = Line.lastindexof ("\" "); String URL = line.substring (first + 1, last); return URL; private string Parsetime (string line) {Final int first = Line.indexof ("["); final int last = Line.indexof ("+0800]"); String time = line.substring (first + 1, last). Trim (); Date date = Parsedateformat (time); return Dateformat1.format (date); } private string Parseip (string line) {String ip = Line.split ("--") [0].trim (); return IP; } }}
Six, full shell, note prepare Logclean.jar (for Log Filter Mr Program), and "Yesterday" Log files and file location
#!/bin/bashecho-ne | Cat <<eot################################################################################################ ####### Purdue Sentient beings ########################### _oo0oo_ 088888880 88 " . "88 (| -_- |) 0\ =/0 ___/'---' \___ . ' \\\\| |//'. / \\\\||| : ||| // \\ /_ ||||| -:- ||||| - \\ | | \\\\\\ - /// | | | \_| ‘‘\---/‘‘ |_/ | \ .-\__ ‘-‘ __/-. / ___‘. .‘ /--.--\ ‘. .‘ ___ ."" ' < '. ___\_<|>_/___. ' > ' ". | | : ‘- \‘.;‘ \ _ /‘;.‘ /-': | | \ \ ‘_. \_ __\/__ _/.-'//===== '-.____ '. ___ \_____/___.-' ____.-' ===== ' =---= ' ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ' ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ the ^^ Buddha Bless III never error eot# ^^ yesterday dateyesterday= ' date-d '-1 day ' + '%y_% M_%d ' echo $yesterday ############## define ############# #HADOOP_HOME =/opt/cdh-5.6.3/hadoop-2.5.0-cdh5.3.6hive_ Home=/opt/cdh-5.6.3/hive-0.13.1-cdh5.3.6sqoop_home=/opt/cdh-5.6.3/sqoop-1.4.5-cdh5.2.6hive_database=webloghive _table=weblog_cleanhive_rstabLe=weblog_resultmysql_username=rootmysql_password=rootexport_dir=/user/hive/warehouse/weblog.db/weblog_ resultnum_mappers=1########################### get logfile path ########################## #LOG_PATH =/home/liuwl/ Opt/datas/weblog/access_$yesterday.logjar_path=/home/liuwl/opt/datas/logclean.jarentrance= Org.apache.hadoop.log.project.logcleanhdfs_input_path=/weblog/sourcehdfs_output_path=/weblog/cleansqoop_jdbc= jdbc:mysql://hadoop09-linux-01.ibeifeng.com:3306/$HIVE _database############################## upload logfile to HDFs ############################# #echo "start to upload logfile" # $HADOOP _home/bin/hdfs dfs-rm-r $HDFS _input_path > /dev/null 2>&1hsfiles= ' $HADOOP _home/bin/hdfs dfs-ls $HDFS _input_path/$yesterday ' If [-Z ' $HSFiles ']; Then$hadoop_home/bin/hdfs dfs-mkdir-p $HDFS _input_path/$yesterday >/dev/null 2>&1$hadoop_home/bin/hdfs Dfs-put $LOG _path $HDFS _input_path/$yesterday >/dev/null 2>&1echo "upload OK" Elseecho "exists" fi########### ################## Clean the source file ############################ #echo "Start-to-clean logfile" hcfiles= ' $HADOOP _home/ Bin/hdfs dfs-ls $HDFS _output_path ' If [-Z ' $HCFiles]; Then$hadoop_home/bin/yarn jar $JAR _path $ENTRANCE $HDFS _input_path/$yesterday $HDFS _output_path/date= $yesterdayecho ' Clean OK ' fi############################# create the Hive table ############################ #echo "start to create the HI ve table "$HIVE _home/bin/hive-e" CREATE database if not exists $HIVE _database ">/dev/null 2>&1$hive_home/bin/h Ive--database $HIVE _database-e "Create external table if not EXISTS $HIVE _table (IP string,day string,url string) partiti OneD by (date string) row format delimited fields terminated by ' \ t ' location ' $HDFS _output_path ' "echo" add patition to H Ive table "$HIVE _home/bin/hive--database $HIVE _database-e" ALTER TABLE $HIVE _table add partition (date= ' $yesterday ') "# # ################################## Create the Hive reslut table ################################### #echo "Start to create the Hive reslut table" $HIVE _home/bin/hive--database $HIVE _database-e "CREATE table if not exists $HIVE _rstable (PV string,register string,ip string,jumpprob double,two_jumpprob double) row format delimited fields terminated By ' t '; ################### Insert Data ################## #echo "start to insert Data" htfiles= ' $HADOOP _home/bin/hdfs dfs-ls $ Export_dir ' If [-Z ' $HTFiles]; Then$hive_home/bin/hive--database $HIVE _database-e "Insert overwrite table $HIVE _rstable Select Log_pv.pv,log_ Register.register,log_ip.ip,log_jumpprob.jumpprob,log_two_jumpprob.two_jumpprob from (select COUNT (1) PV from $HIVE _ TABLE where date= ' $yesterday ') LOG_PV, (select COUNT (1) register from $HIVE _table where date= ' $yesterday ' and InStr (URL, ' m Ember.php?mod=register ') > 0) log_register, (select COUNT (distinct IP) IP from $HIVE _table where date= ' $yesterday ') log _ip, (select Ghip.singleip/aip.ips jumpprob from (select COUNT (1) Singleip from (select COUNT (IP) IPs from $HIVE _table where DatE= ' $yesterday ' GROUP by IP has IPs <2) GIP) ghip, (select COUNT (IP) IPs from $HIVE _table where date= ' $yesterday ') AIP ) Log_jumpprob, (select Ghip.singleip/aip.ips two_jumpprob from (select COUNT (1) Singleip from (select COUNT (IP) IPs from $H ive_table where date= ' $yesterday ' GROUP by IP has IPs >=2) GIP) ghip, (select COUNT (IP) IPs from $HIVE _table where da Te= ' $yesterday ') AIP) Log_two_jumpprob "fi##################################### create the MySQL reslut table ######### ########################### #mysql-u$mysql_username-p$mysql_password-e "CREATE database if not exists $HIVE _database DEFAULT CHARACTER SET UTF8 COLLATE utf8_general_ci;use $HIVE _database;create table if not exists $HIVE _rstable (PV varchar ( Not null,register varchar (a) Not null,ip varchar (NO) Null,jumpprob double (6,2) not null,two_jumpprob double (6,2) Not null) The DEFAULT CHARACTER SET UTF8 COLLATE utf8_general_ci; TRUNCATE TABLE if exists $HIVE _rstable;quit "######################################### ExporT hive result table to MySQL ######################################## #echo "Start-to-export hive result table to MySQL" $SQ Oop_home/bin/sqoop export--connect $SQOOP _jdbc--username $MYSQL _username--password $MYSQL _password--table $HIVE _ Rstable--export-dir $EXPORT _dir--num-mappers $NUM _mappers--input-fields-terminated-by ' \ t ' echo "Shell finished"
Log analysis using shell full log analysis case