Website Log Analysis Project case (i) Project description: http://www.cnblogs.com/edisonchou/p/4449082.html
Website Log Analysis Project case (ii) Data cleansing: Current Page
Website Log Analysis Project case (iii) statistical analysis: http://www.cnblogs.com/edisonchou/p/4464349.html
I. Data situation analysis 1.1 data review
There are two parts to the forum data:
(1) Historical data of about 56GB, statistics to 2012-05-29. This also shows that before 2012-05-29, the log files were in a file, using the Append write method.
(2) Since 2013-05-30, a daily data file is generated, about 150MB. This also indicates that, from 2013-05-30, the log file is no longer in a file.
Figure 1 shows the recording format of the log data, where each row of records has 5 parts: The visitor's IP, access time, access to resources, Access status (HTTP status code), this access traffic.
Figure 1 Logging data format
This usage data is from two log files for 2013 years, Access_2013_05_30.log and Access_2013_05_31.log, respectively: http://pan.baidu.com/s/1pJE7XR9
1.2 Data to clean up
(1) According to the analysis of the key indicators of the previous article, we want to statistical analysis is not related to the Access status (HTTP status code) and the traffic of this visit, so we can first clean up these two records;
(2) According to the data format of the log record, we need to convert the date format to the usual common format such as 20150426, so we can write a class to convert the date of logging;
(3) Because the access request of the static resource has no meaning to our data analysis, we can filter out the access record beginning with "get/staticsource/", and because the GET and post strings are meaningless to us, we can also omit it.
Second, the data cleaning process 2.1 regularly upload logs to HDFs
First, the log data uploaded to HDFs for processing, can be divided into the following situations:
(1) If the log server data is small, the pressure is small, you can directly use the shell command to upload data to HDFs;
(2) If the log server data is large and stressful, use NFS to upload data on another server;
(3) If the log server is very large, the volume of data, using flume for data processing;
Here our experimental data files are small, so we directly adopt the first shell command mode. Because the log file is generated daily, it is necessary to set a timed task that automatically uploads the log file generated from the previous day to the specified directory in HDFs at 1 o'clock the next morning. So, we created a timed task techbbs_core.sh with the shell script combined with crontab, which reads:
#!/bin/sh
#step1. Get yesterday format string
yesterday=$ (Date--date= ' 1 days ago ' +%y_%m_%d)
#step2. Upload logs to HDFs
Hadoop Fs-put/usr/local/files/apache_logs/access_${yesterday}.log/project/techbbs/data
Combined with crontab set to a recurring task that is automatically performed daily 1 o'clock: crontab-e, the contents are as follows (where 1 represents the script file to be executed every day 1:00,techbbs_core.sh):
* 1 * * * techbbs_core.sh
How to verify: You can view scheduled tasks that have been set by command crontab-l
2.2 Writing the MapReduce program cleanup log
(1) Write the log parsing class to parse the five components of each row of records separately
Static class LogParser {public static final SimpleDateFormat FORMAT = new SimpleDateFormat ("d/ Mmm/yyyy:hh:mm:ss ", locale.english); public static final SimpleDateFormat DATEFORMAT1 = new SimpleDateFormat ("Yyyymmddhhmmss");/** * parsing English time String * * @param string * @return * @throws parseexception */Private Da Te Parsedateformat (String string) {Date parse = null; try {parse = Format.parse (string); } catch (ParseException e) {e.printstacktrace (); } return parse; /** * Parse Log Row Records * * @param line * @return Array contains 5 elements, namely IP, time, URL, status, traffic */ Public string[] Parse (string line) {String ip = Parseip (line); String time = Parsetime (line); String url = parseurl (line); String status = Parsestatus (line); String traffic = parsetraffic (line); return new string[] {IP, time, URL, status, traffic}; } private string Parsetraffic (string line) {Final String trim = line.substring (Line.lastindexof ("\" ") + 1). Trim (); String traffic = Trim.split ("") [1]; return traffic; } private string Parsestatus (string line) {Final String trim = line.substring (Line.lastindexof ("\" ") + 1). Trim (); String status = Trim.split ("") [0]; return status; private string parseURL (string line) {Final int first = Line.indexof ("\" "); final int last = Line.lastindexof ("\" "); String URL = line.substring (first + 1, last); return URL; private string Parsetime (string line) {Final int first = Line.indexof ("["); final int last = Line.indexof ("+0800]"); String time = Line.subString (first + 1, last). Trim (); Date date = Parsedateformat (time); return Dateformat1.format (date); } private string Parseip (string line) {String ip = Line.split ("--") [0].trim (); return IP; } }
(2) write a mapreduce program to filter all records of a specified log file
Mapper class:
Static class Mymapper extends Mapper<longwritable, Text, longwritable, text> {logparser l Ogparser = new LogParser (); Text Outputvalue = new text (); protected void Map (longwritable key, Text value, Org.apache.hadoop.mapreduce . Mapper<longwritable, Text, longwritable, Text> Context context) throws Java.io.IOException, Interruptedexception {final string[] parsed = Logp Arser.parse (Value.tostring ()); Step1. Filtering out the static resource access request if (Parsed[2].startswith ("get/static/") | | parsed[2].startswith ("GET/ Uc_server ")) {return; }//Step2. Filter out the beginning of the specified string if (Parsed[2].startswith ("GET/")) {parsed[2] = Parsed[2].sub String ("GET/". Length ()); } else if (Parsed[2].startswith ("Post/")) {parsed[2] = parsed[2].substring ("Post/". Length ()); } Step3. Filter out the end of the specific string if (Parsed[2].endswith ("http/1.1")) {parsed[2] = parsed[2].substring (0, Parsed[2].length ()-"http/1.1". Length ()); }//Step4. Write only the first three record type entries Outputvalue.set (parsed[0] + "\ T" + parsed[1] + "\ T" + parsed[2]); Context.write (key, Outputvalue); } }
Reducer class:
Static class Myreducer extends reducer<longwritable, text, text, nullwritable> { protected void reduce ( C5/>longwritable K2, java.lang.iterable<text> v2s, org.apache.hadoop.mapreduce.reducer< longwritable, text, text, Nullwritable> Context context) throws Java.io.IOException, interruptedexception {for (Text v2:v2s) { Context.write (v2 , Nullwritable.get ()); } }; }
(3) Complete sample code for Logcleanjob.java
View Code
(4) Export the jar package and upload it to the Linux server in the specified directory
2.3 Regularly clean up logs to HDFs
Here we rewrite just the timed task script, and will automatically perform a cleanup of the MapReduce program into the script, the contents are as follows:
#!/bin/sh
#step1. Get yesterday format string
yesterday=$ (Date--date= ' 1 days ago ' +%y_%m_%d)
#step2. Upload logs to HDFs
Hadoop Fs-put/usr/local/files/apache_logs/access_${yesterday}.log/project/techbbs/data
#step3. Clean log data
Hadoop jar/usr/local/files/apache_logs/mycleaner.jar/project/techbbs/data/access_${yesterday}.log/project/ Techbbs/cleaned/${yesterday}
This script means that after uploading the log files to HDFs at 1 points per day, the data cleanup program filters the log files that have been stored in HDFs and stores the filtered data in the cleaned directory.
2.4 Timed Task Test
(1) Since two log files are 2013, the name is changed to 2015 and the previous day so that it can be tested and passed.
(2) Execution command:techbbs_core.sh 2014_04_26
The output information for the console is as follows, and you can see that the filtered records are reduced a lot:
15/04/26 04:27:20 INFO input. Fileinputformat:total input paths to process:1
15/04/26 04:27:20 INFO util. Nativecodeloader:loaded The Native-hadoop Library
15/04/26 04:27:20 WARN Snappy. Loadsnappy:snappy Native Library not loaded
15/04/26 04:27:22 INFO mapred. Jobclient:running job:job_201504260249_0002
15/04/26 04:27:23 INFO mapred. Jobclient:map 0% Reduce 0%
15/04/26 04:28:01 INFO mapred. Jobclient:map 29% Reduce 0%
15/04/26 04:28:07 INFO mapred. Jobclient:map 42% Reduce 0%
15/04/26 04:28:10 INFO mapred. Jobclient:map 57% Reduce 0%
15/04/26 04:28:13 INFO mapred. Jobclient:map 74% Reduce 0%
15/04/26 04:28:16 INFO mapred. Jobclient:map 89% Reduce 0%
15/04/26 04:28:19 INFO mapred. Jobclient:map 100% Reduce 0%
15/04/26 04:28:49 INFO mapred. Jobclient:map 100% Reduce 100%
15/04/26 04:28:50 INFO mapred. Jobclient:job complete:job_201504260249_0002
15/04/26 04:28:50 INFO mapred. Jobclient:counters:29
15/04/26 04:28:50 INFO mapred. Jobclient:job Counters
15/04/26 04:28:50 INFO mapred. jobclient:launched Reduce Tasks=1
15/04/26 04:28:50 INFO mapred. Jobclient:slots_millis_maps=58296
15/04/26 04:28:50 INFO mapred. Jobclient:total time spent by all reduces waiting after reserving slots (ms) =0
15/04/26 04:28:50 INFO mapred. Jobclient:total time spent by all maps waiting after reserving slots (ms) =0
15/04/26 04:28:50 INFO mapred. jobclient:launched Map Tasks=1
15/04/26 04:28:50 INFO mapred. Jobclient:data-local Map Tasks=1
15/04/26 04:28:50 INFO mapred. Jobclient:slots_millis_reduces=25238
15/04/26 04:28:50 INFO mapred. Jobclient:file Output Format Counters
15/04/26 04:28:50 INFO mapred. Jobclient:bytes written=12794925
15/04/26 04:28:50 INFO mapred. Jobclient:filesystemcounters
15/04/26 04:28:50 INFO mapred. jobclient:file_bytes_read=14503530
15/04/26 04:28:50 INFO mapred. jobclient:hdfs_bytes_read=61084325
15/04/26 04:28:50 INFO mapred. jobclient:file_bytes_written=29111500
15/04/26 04:28:50 INFO mapred. jobclient:hdfs_bytes_written=12794925
15/04/26 04:28:50 INFO mapred. Jobclient:file Input Format Counters
15/04/26 04:28:50 INFO mapred. Jobclient:bytes read=61084192
15/04/26 04:28:50 INFO mapred. Jobclient:map-reduce Framework
15/04/26 04:28:50 INFO mapred. Jobclient:map output materialized bytes=14503530
15/04/26 04:28:50 INFO mapred. Jobclient:map input records=548160
15/04/26 04:28:50 INFO mapred. Jobclient:reduce Shuffle bytes=14503530
15/04/26 04:28:50 INFO mapred. Jobclient:spilled records=339714
15/04/26 04:28:50 INFO mapred. Jobclient:map Output bytes=14158741
15/04/26 04:28:50 INFO mapred. Jobclient:cpu Time Spent (ms) =21200
15/04/26 04:28:50 INFO mapred. Jobclient:total committed heap usage (bytes) =229003264
15/04/26 04:28:50 INFO mapred. Jobclient:combine input Records=0
15/04/26 04:28:50 INFO mapred. jobclient:split_raw_bytes=133
15/04/26 04:28:50 INFO mapred. Jobclient:reduce input records=169857
15/04/26 04:28:50 INFO mapred. Jobclient:reduce input groups=169857
15/04/26 04:28:50 INFO mapred. Jobclient:combine Output Records=0
15/04/26 04:28:50 INFO mapred. Jobclient:physical memory (bytes) snapshot=154001408
15/04/26 04:28:50 INFO mapred. Jobclient:Reduce Output records=169857
15/04/26 04:28:50 INFO mapred. Jobclient:virtual memory (bytes) snapshot=689442816
15/04/26 04:28:50 INFO mapred. Jobclient:Map Output records=169857
Clean Process success!
(3) View the log data in HDFs via the Web interface:
Non-filtered log data deposited:/project/techbbs/data/
Filtered log data deposited:/project/techbbs/cleaned/
original link:http://www.cnblogs.com/edisonchou/
Hadoop Learning Notes-20. Website Log Analysis Project case (ii) Data cleansing