Hadoop Learning Notes-20. Website Log Analysis Project case (ii) Data cleansing

Source: Internet
Author: User
Tags hadoop fs

Website Log Analysis Project case (i) Project description: http://www.cnblogs.com/edisonchou/p/4449082.html

Website Log Analysis Project case (ii) Data cleansing: Current Page

Website Log Analysis Project case (iii) statistical analysis: http://www.cnblogs.com/edisonchou/p/4464349.html

I. Data situation analysis 1.1 data review

There are two parts to the forum data:

(1) Historical data of about 56GB, statistics to 2012-05-29. This also shows that before 2012-05-29, the log files were in a file, using the Append write method.

(2) Since 2013-05-30, a daily data file is generated, about 150MB. This also indicates that, from 2013-05-30, the log file is no longer in a file.

Figure 1 shows the recording format of the log data, where each row of records has 5 parts: The visitor's IP, access time, access to resources, Access status (HTTP status code), this access traffic.

Figure 1 Logging data format

This usage data is from two log files for 2013 years, Access_2013_05_30.log and Access_2013_05_31.log, respectively: http://pan.baidu.com/s/1pJE7XR9

1.2 Data to clean up

(1) According to the analysis of the key indicators of the previous article, we want to statistical analysis is not related to the Access status (HTTP status code) and the traffic of this visit, so we can first clean up these two records;

(2) According to the data format of the log record, we need to convert the date format to the usual common format such as 20150426, so we can write a class to convert the date of logging;

(3) Because the access request of the static resource has no meaning to our data analysis, we can filter out the access record beginning with "get/staticsource/", and because the GET and post strings are meaningless to us, we can also omit it.

Second, the data cleaning process 2.1 regularly upload logs to HDFs

First, the log data uploaded to HDFs for processing, can be divided into the following situations:

(1) If the log server data is small, the pressure is small, you can directly use the shell command to upload data to HDFs;

(2) If the log server data is large and stressful, use NFS to upload data on another server;

(3) If the log server is very large, the volume of data, using flume for data processing;

Here our experimental data files are small, so we directly adopt the first shell command mode. Because the log file is generated daily, it is necessary to set a timed task that automatically uploads the log file generated from the previous day to the specified directory in HDFs at 1 o'clock the next morning. So, we created a timed task techbbs_core.sh with the shell script combined with crontab, which reads:

#!/bin/sh

#step1. Get yesterday format string
yesterday=$ (Date--date= ' 1 days ago ' +%y_%m_%d)
#step2. Upload logs to HDFs
Hadoop Fs-put/usr/local/files/apache_logs/access_${yesterday}.log/project/techbbs/data

Combined with crontab set to a recurring task that is automatically performed daily 1 o'clock: crontab-e, the contents are as follows (where 1 represents the script file to be executed every day 1:00,techbbs_core.sh):

* 1 * * * techbbs_core.sh

How to verify: You can view scheduled tasks that have been set by command crontab-l

2.2 Writing the MapReduce program cleanup log

(1) Write the log parsing class to parse the five components of each row of records separately

    Static class LogParser {public static final SimpleDateFormat FORMAT = new SimpleDateFormat ("d/        Mmm/yyyy:hh:mm:ss ", locale.english); public static final SimpleDateFormat DATEFORMAT1 = new SimpleDateFormat ("Yyyymmddhhmmss");/** * parsing English time String * * @param string * @return * @throws parseexception */Private Da            Te Parsedateformat (String string) {Date parse = null;            try {parse = Format.parse (string);            } catch (ParseException e) {e.printstacktrace ();        } return parse;        /** * Parse Log Row Records * * @param line * @return Array contains 5 elements, namely IP, time, URL, status, traffic */            Public string[] Parse (string line) {String ip = Parseip (line);            String time = Parsetime (line);            String url = parseurl (line);          String status = Parsestatus (line);  String traffic = parsetraffic (line);        return new string[] {IP, time, URL, status, traffic};  } private string Parsetraffic (string line) {Final String trim = line.substring (Line.lastindexof ("\" ") +            1). Trim ();            String traffic = Trim.split ("") [1];        return traffic; } private string Parsestatus (string line) {Final String trim = line.substring (Line.lastindexof ("\" ") +            1). Trim ();            String status = Trim.split ("") [0];        return status;            private string parseURL (string line) {Final int first = Line.indexof ("\" ");            final int last = Line.lastindexof ("\" ");            String URL = line.substring (first + 1, last);        return URL;            private string Parsetime (string line) {Final int first = Line.indexof ("[");            final int last = Line.indexof ("+0800]"); String time = Line.subString (first + 1, last). Trim ();            Date date = Parsedateformat (time);        return Dateformat1.format (date);            } private string Parseip (string line) {String ip = Line.split ("--") [0].trim ();        return IP; }    }

(2) write a mapreduce program to filter all records of a specified log file

Mapper class:

        Static class Mymapper extends Mapper<longwritable, Text, longwritable, text> {logparser l        Ogparser = new LogParser ();        Text Outputvalue = new text (); protected void Map (longwritable key, Text value, Org.apache.hadoop.mapreduce . Mapper<longwritable, Text, longwritable, Text> Context context) throws Java.io.IOException, Interruptedexception {final string[] parsed = Logp            Arser.parse (Value.tostring ()); Step1. Filtering out the static resource access request if (Parsed[2].startswith ("get/static/") | | parsed[2].startswith ("GET/            Uc_server ")) {return; }//Step2. Filter out the beginning of the specified string if (Parsed[2].startswith ("GET/")) {parsed[2] = Parsed[2].sub            String ("GET/". Length ());            } else if (Parsed[2].startswith ("Post/")) {parsed[2] = parsed[2].substring ("Post/". Length ());           } Step3. Filter out the end of the specific string if (Parsed[2].endswith ("http/1.1")) {parsed[2] = parsed[2].substring (0,            Parsed[2].length ()-"http/1.1". Length ());            }//Step4. Write only the first three record type entries Outputvalue.set (parsed[0] + "\ T" + parsed[1] + "\ T" + parsed[2]);        Context.write (key, Outputvalue); }    }

Reducer class:

    Static class Myreducer extends            reducer<longwritable, text, text, nullwritable> {        protected void reduce ( C5/>longwritable K2,                java.lang.iterable<text> v2s,                org.apache.hadoop.mapreduce.reducer< longwritable, text, text, Nullwritable> Context context)                throws Java.io.IOException, interruptedexception {for            (Text v2:v2s) {                Context.write (v2 , Nullwritable.get ());            }        };    }

(3) Complete sample code for Logcleanjob.java

View Code

(4) Export the jar package and upload it to the Linux server in the specified directory

2.3 Regularly clean up logs to HDFs

Here we rewrite just the timed task script, and will automatically perform a cleanup of the MapReduce program into the script, the contents are as follows:

#!/bin/sh

#step1. Get yesterday format string
yesterday=$ (Date--date= ' 1 days ago ' +%y_%m_%d)
#step2. Upload logs to HDFs
Hadoop Fs-put/usr/local/files/apache_logs/access_${yesterday}.log/project/techbbs/data
#step3. Clean log data
Hadoop jar/usr/local/files/apache_logs/mycleaner.jar/project/techbbs/data/access_${yesterday}.log/project/ Techbbs/cleaned/${yesterday}

This script means that after uploading the log files to HDFs at 1 points per day, the data cleanup program filters the log files that have been stored in HDFs and stores the filtered data in the cleaned directory.

2.4 Timed Task Test

(1) Since two log files are 2013, the name is changed to 2015 and the previous day so that it can be tested and passed.

(2) Execution command:techbbs_core.sh 2014_04_26

  The output information for the console is as follows, and you can see that the filtered records are reduced a lot:

15/04/26 04:27:20 INFO input. Fileinputformat:total input paths to process:1
15/04/26 04:27:20 INFO util. Nativecodeloader:loaded The Native-hadoop Library
15/04/26 04:27:20 WARN Snappy. Loadsnappy:snappy Native Library not loaded
15/04/26 04:27:22 INFO mapred. Jobclient:running job:job_201504260249_0002
15/04/26 04:27:23 INFO mapred. Jobclient:map 0% Reduce 0%
15/04/26 04:28:01 INFO mapred. Jobclient:map 29% Reduce 0%
15/04/26 04:28:07 INFO mapred. Jobclient:map 42% Reduce 0%
15/04/26 04:28:10 INFO mapred. Jobclient:map 57% Reduce 0%
15/04/26 04:28:13 INFO mapred. Jobclient:map 74% Reduce 0%
15/04/26 04:28:16 INFO mapred. Jobclient:map 89% Reduce 0%
15/04/26 04:28:19 INFO mapred. Jobclient:map 100% Reduce 0%
15/04/26 04:28:49 INFO mapred. Jobclient:map 100% Reduce 100%
15/04/26 04:28:50 INFO mapred. Jobclient:job complete:job_201504260249_0002
15/04/26 04:28:50 INFO mapred. Jobclient:counters:29
15/04/26 04:28:50 INFO mapred. Jobclient:job Counters
15/04/26 04:28:50 INFO mapred. jobclient:launched Reduce Tasks=1
15/04/26 04:28:50 INFO mapred. Jobclient:slots_millis_maps=58296
15/04/26 04:28:50 INFO mapred. Jobclient:total time spent by all reduces waiting after reserving slots (ms) =0
15/04/26 04:28:50 INFO mapred. Jobclient:total time spent by all maps waiting after reserving slots (ms) =0
15/04/26 04:28:50 INFO mapred. jobclient:launched Map Tasks=1
15/04/26 04:28:50 INFO mapred. Jobclient:data-local Map Tasks=1
15/04/26 04:28:50 INFO mapred. Jobclient:slots_millis_reduces=25238
15/04/26 04:28:50 INFO mapred. Jobclient:file Output Format Counters
15/04/26 04:28:50 INFO mapred. Jobclient:bytes written=12794925
15/04/26 04:28:50 INFO mapred. Jobclient:filesystemcounters
15/04/26 04:28:50 INFO mapred. jobclient:file_bytes_read=14503530
15/04/26 04:28:50 INFO mapred. jobclient:hdfs_bytes_read=61084325
15/04/26 04:28:50 INFO mapred. jobclient:file_bytes_written=29111500
15/04/26 04:28:50 INFO mapred. jobclient:hdfs_bytes_written=12794925
15/04/26 04:28:50 INFO mapred. Jobclient:file Input Format Counters
15/04/26 04:28:50 INFO mapred. Jobclient:bytes read=61084192
15/04/26 04:28:50 INFO mapred. Jobclient:map-reduce Framework
15/04/26 04:28:50 INFO mapred. Jobclient:map output materialized bytes=14503530
15/04/26 04:28:50 INFO mapred. Jobclient:map input records=548160
15/04/26 04:28:50 INFO mapred. Jobclient:reduce Shuffle bytes=14503530
15/04/26 04:28:50 INFO mapred. Jobclient:spilled records=339714
15/04/26 04:28:50 INFO mapred. Jobclient:map Output bytes=14158741
15/04/26 04:28:50 INFO mapred. Jobclient:cpu Time Spent (ms) =21200
15/04/26 04:28:50 INFO mapred. Jobclient:total committed heap usage (bytes) =229003264
15/04/26 04:28:50 INFO mapred. Jobclient:combine input Records=0
15/04/26 04:28:50 INFO mapred. jobclient:split_raw_bytes=133
15/04/26 04:28:50 INFO mapred. Jobclient:reduce input records=169857
15/04/26 04:28:50 INFO mapred. Jobclient:reduce input groups=169857
15/04/26 04:28:50 INFO mapred. Jobclient:combine Output Records=0
15/04/26 04:28:50 INFO mapred. Jobclient:physical memory (bytes) snapshot=154001408
15/04/26 04:28:50 INFO mapred. Jobclient:Reduce Output records=169857
15/04/26 04:28:50 INFO mapred. Jobclient:virtual memory (bytes) snapshot=689442816
15/04/26 04:28:50 INFO mapred. Jobclient:Map Output records=169857
Clean Process success!

(3) View the log data in HDFs via the Web interface:

Non-filtered log data deposited:/project/techbbs/data/

Filtered log data deposited:/project/techbbs/cleaned/

original link:http://www.cnblogs.com/edisonchou/

Hadoop Learning Notes-20. Website Log Analysis Project case (ii) Data cleansing

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.