Website Log Analysis Project case (ii) data cleansing (minimapreduce)

Last Update:2016-08-08 Source: Internet

Author: User

Tags hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

 Website Log Analysis Project case (ii) data Analysis 1.1 Data situation review the Forum data has two parts: (1) historical data about 56GB, statistics to 2012-05-29.　　This also shows that before 2012-05-29, the log files were in a file, using the Append write method. (2) Since 2013-05-30, a daily data file is generated, about 150MB.　　This also indicates that, from 2013-05-30, the log file is no longer in a file. Figure 1 shows the recording format of the log data, where each row of records has 5 parts: The visitor's IP, access time, access to resources, Access status (HTTP status code), this access traffic. Log Figure 1 Logging data format this usage data is from two 2013-year log files, respectively Access_2013_05_30.log and Access_2013_05_31.log, as: http://pan.baidu.com/s/ 1pje7xr91.2 data to be cleaned (1) According to the analysis of the key indicators of the previous article, we want to analyze the statistics of the Access status (HTTP status code) and the traffic of this access, so we can first clean up these two records, (2) according to the data format of the log record, We need to convert the date format to the usual common format like 20150426, so we can write a class that converts the date of the log, (3) because the access request for the static resource does not make sense for our data analysis, so we can "get/staticsource/"　　The access records at the beginning are filtered out, and because the get and post strings do not make sense to us, they can also be omitted; second, the data cleaning process 2.1 regularly upload logs to HDFs first, the log data uploaded to HDFs for processing, can be divided into the following situations: (1) If the log server data is small, the pressure is small, you can directly use the shell command to upload data to HDFs, (2) If the log server data is large, pressure, use NFS to upload data on another server, (3) If the log server is very large, the volume of data, Data processing using Flume; Here our experimental data files are small, so we directly adopt the first shell command mode. Because the log file is generated daily, it is necessary to set a timed task that automatically uploads the log file generated from the previous day to the specified directory in HDFs at 1 o'clock the next morning. So, we created a timed task techbbs_core.sh with the shell script combined with crontab, which reads: #!/bin/sh #step1. Get YesteRday format string yesterday=$ (date--date= ' 1 days ago ' +%y_%m_%d) #step2. Upload logs to HDFs Hadoop fs-put/usr /local/files/apache_logs/access_${yesterday}.log/project/techbbs/data combined with crontab is set to a recurring task that is automatically performed 1 o'clock daily: CRONTAB-E, The content is as follows (where 1 represents the script file to be executed daily 1:00,techbbs_core.sh): * 1 * * * techbbs_core.sh authentication method: Through the command crontab-l you can view the scheduled tasks that have been set 2.2 write Mapreduc E Program cleanup log (1) write a log parsing class separate parsing of five components of each row of records copy code copy code static class LogParser {public static final SimpleDateFormat for        MAT = new SimpleDateFormat ("D/mmm/yyyy:hh:mm:ss", locale.english); public static final SimpleDateFormat DATEFORMAT1 = new SimpleDateFormat ("Yyyymmddhhmmss");/** * parsing English time String * * @param string * @return * @throws parseexception */Private Da            Te Parsedateformat (String string) {Date parse = null;            try {parse = Format.parse (string); } catch (ParseException e) {E.printStackTrace ();        } return parse;        /** * Parse Log Row Records * * @param line * @return Array contains 5 elements, namely IP, time, URL, status, traffic */            Public string[] Parse (string line) {String ip = Parseip (line);            String time = Parsetime (line);            String url = parseurl (line);            String status = Parsestatus (line);            String traffic = parsetraffic (line);        return new string[] {IP, time, URL, status, traffic};  } private string Parsetraffic (string line) {Final String trim = line.substring (Line.lastindexof ("\" ") +            1). Trim ();            String traffic = Trim.split ("") [1];        return traffic; } private string Parsestatus (string line) {Final String trim = line.substring (Line.lastindexof ("\" ") +            1). Trim ();            String status = Trim.split ("") [0];        return status; } Private STring parseURL (String line) {final int first = Line.indexof ("\" ");            final int last = Line.lastindexof ("\" ");            String URL = line.substring (first + 1, last);        return URL;            private string Parsetime (string line) {Final int first = Line.indexof ("[");            final int last = Line.indexof ("+0800]");            String time = line.substring (first + 1, last). Trim ();            Date date = Parsedateformat (time);        return Dateformat1.format (date);            } private string Parseip (string line) {String ip = Line.split ("--") [0].trim ();        return IP; Copy Code copy Code (2) write a mapreduce program to filter all records of a specified log file Mapper class: Copy code copy code static class Mymapper extends mapper& Lt        Longwritable, Text, longwritable, text> {logparser logparser = new LogParser ();        Text Outputvalue = new text ();   protected void Map (longwritable key, Text value,             Org.apache.hadoop.mapreduce.mapper<longwritable, Text, longwritable, Text> Context context) throws Java.io.IOException, Interruptedexception {final string[] parsed = Logp            Arser.parse (Value.tostring ()); Step1. Filtering out the static resource access request if (Parsed[2].startswith ("get/static/") | | parsed[2].startswith ("GET/            Uc_server ")) {return; }//Step2. Filter out the beginning of the specified string if (Parsed[2].startswith ("GET/")) {parsed[2] = Parsed[2].sub            String ("GET/". Length ());            } else if (Parsed[2].startswith ("Post/")) {parsed[2] = parsed[2].substring ("Post/". Length ()); }//Step3. Filter out the end of the specific string if (Parsed[2].endswith ("http/1.1")) {parsed[2] = parsed[2]            . substring (0, parsed[2].length ()-"http/1.1". Length ()); }//Step4. Write only the first three record type entries Outputvalue.set (parsed[0] + "\ T" + parsed[1] + "\ T" + parsed[2]);        Context.write (key, Outputvalue); Copy Code copy Code Reducer class: Copy code copy code static class Myreducer extends Reducer<longwritable, text, text, NULLWR itable> {protected void reduce (longwritable K2, Java.lang.iterable<text> v 2s, org.apache.hadoop.mapreduce.reducer<longwritable, text, text, Nullwritable>                Context context) throws Java.io.IOException, interruptedexception {for (Text v2:v2s) {            Context.write (v2, Nullwritable.get ());    }        }; Copy Code copy Code (3) The complete sample code for the Logcleanjob.java (4) Export the jar package and upload it to the Linux server in the specified directory 2.3 periodic cleanup log to HDFs here we rewrite the timed task script, A mapreduce program that automates cleanup is added to the script with the following contents: #!/bin/sh #step1. Get yesterday format string yesterday=$ (date--date= ' 1 days ago ' +%y_%m_%d) #step2. Upload logs to HDFs Hadoop fs-put/usr/local/files/apache_logs/access_${yesterday}.log/project/t Echbbs/data #steP3.clean log data Hadoop jar/usr/local/files/apache_logs/mycleaner.jar/project/techbbs/data/access_${yesterday}. Log/project/techbbs/cleaned/${yesterday} This script means that the log files are uploaded to HDFs at 1 points per day, and the data cleanup program filters the log files that have been stored in HDFs. And the filtered data is stored in the cleaned directory.　　2.4 Timed Task Test (1) because two log files are 2013, the name is changed to 2015 and the day before so it can be tested. (2) Execution command: The output information of the techbbs_core.sh 2014_04_26 console is as follows, and you can see that the filtered records are reduced a lot: 15/04/26 04:27:20 INFO input. Fileinputformat:total input paths to process:1 15/04/26 04:27:20 INFO util. nativecodeloader:loaded the Native-hadoop library 15/04/26 04:27:20 WARN snappy. Loadsnappy:snappy Native library not loaded 15/04/26 04:27:22 INFO mapred. Jobclient:running job:job_201504260249_0002 15/04/26 04:27:23 INFO mapred. Jobclient:map 0% reduce 0% 15/04/26 04:28:01 INFO mapred. Jobclient:map 29% reduce 0% 15/04/26 04:28:07 INFO mapred. Jobclient:map 42% reduce 0% 15/04/26 04:28:10 INFO mapred. Jobclient:map 57% reduce 0% 15/04/26 04:28:13 INFO mapred. Jobclient:map 74% Reduce 0%   15/04/26 04:28:16 INFO mapred. Jobclient:map 89% reduce 0% 15/04/26 04:28:19 INFO mapred. Jobclient:map 100% reduce 0% 15/04/26 04:28:49 INFO mapred. Jobclient:map 100% reduce 100% 15/04/26 04:28:50 INFO mapred. Jobclient:job complete:job_201504260249_0002 15/04/26 04:28:50 INFO mapred. jobclient:counters:29 15/04/26 04:28:50 INFO mapred. Jobclient:job Counters 15/04/26 04:28:50 INFO mapred. jobclient:launched reduce Tasks=1 15/04/26 04:28:50 INFO mapred. jobclient:slots_millis_maps=58296 15/04/26 04:28:50 INFO mapred. Jobclient:total time spent by all reduces waiting after reserving slots (ms) =0 15/04/26 04:28:50 INFO mapred. Jobclient:total time spent by all maps waiting after reserving slots (ms) =0 15/04/26 04:28:50 INFO mapred. jobclient:launched map Tasks=1 15/04/26 04:28:50 INFO mapred. Jobclient:data-local map Tasks=1 15/04/26 04:28:50 INFO mapred. jobclient:slots_millis_reduces=25238 15/04/26 04:28:50 INFO mapred. Jobclient:file OUTPUT Format Counters 15/04/26 04:28:50 INFO mapred. Jobclient:bytes written=12794925 15/04/26 04:28:50 INFO mapred. Jobclient:filesystemcounters 15/04/26 04:28:50 INFO mapred. jobclient:file_bytes_read=14503530 15/04/26 04:28:50 INFO mapred. jobclient:hdfs_bytes_read=61084325 15/04/26 04:28:50 INFO mapred. jobclient:file_bytes_written=29111500 15/04/26 04:28:50 INFO mapred. jobclient:hdfs_bytes_written=12794925 15/04/26 04:28:50 INFO mapred. Jobclient:file Input Format Counters 15/04/26 04:28:50 INFO mapred. Jobclient:bytes read=61084192 15/04/26 04:28:50 INFO mapred. Jobclient:map-reduce Framework 15/04/26 04:28:50 INFO mapred. Jobclient:map output materialized bytes=14503530 15/04/26 04:28:50 INFO mapred. Jobclient:map input records=548160 15/04/26 04:28:50 INFO mapred. Jobclient:reduce Shuffle bytes=14503530 15/04/26 04:28:50 INFO mapred. jobclient:spilled records=339714 15/04/26 04:28:50 INFO mapred. Jobclient:map Output bytes=14158741 15/04/26 04:28:50 INFO mapred. Jobclient:cpu Time Spent (ms) =21200 15/04/26 04:28:50 INFO mapred. Jobclient:total committed heap usage (bytes) =229003264 15/04/26 04:28:50 INFO mapred. Jobclient:combine input records=0 15/04/26 04:28:50 INFO mapred. jobclient:split_raw_bytes=133 15/04/26 04:28:50 INFO mapred. Jobclient:reduce input records=169857 15/04/26 04:28:50 INFO mapred. Jobclient:reduce input groups=169857 15/04/26 04:28:50 INFO mapred. Jobclient:combine output records=0 15/04/26 04:28:50 INFO mapred. Jobclient:physical memory (bytes) snapshot=154001408 15/04/26 04:28:50 INFO mapred. Jobclient:reduce output records=169857 15/04/26 04:28:50 INFO mapred. Jobclient:virtual memory (bytes) snapshot=689442816 15/04/26 04:28:50 INFO mapred.　　Jobclient:map output records=169857 Clean process success! (3) Viewing log data in HDFs via the Web interface: Unfiltered log data deposited:/project/techbbs/data/filtered log data deposited:/project/techbbs/cleaned/

Website Log Analysis Project case (ii) data cleansing (minimapreduce)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More