Hadoop applets-Data Filtering

Source: Internet
Author: User

1. There are a batch of routing logs. You need to extract the MAC address and time and delete other contents.

The log Content format is as follows:

Apr 15 10:04:42 hostapd: wlan0: STA 14:7D:C5:9E:84Apr 15 10:04:43 hostapd: wlan0: STA 14:7D:C5:9E:85Apr 15 10:04:44 hostapd: wlan0: STA 14:7D:C5:9E:86Apr 15 10:04:45 hostapd: wlan0: STA 14:7D:C5:9E:87Apr 15 10:04:46 hostapd: wlan0: STA 14:7D:C5:9E:88Apr 15 10:04:47 hostapd: wlan0: STA 14:7D:C5:9E:89Apr 15 10:04:48 hostapd: wlan0: STA 14:7D:C5:9E:14Apr 15 10:04:49 hostapd: wlan0: STA 14:7D:C5:9E:24Apr 15 10:04:52 hostapd: wlan0: STA 14:7D:C5:9E:34Apr 15 10:04:32 hostapd: wlan0: STA 14:7D:C5:9E:44Apr 15 10:04:22 hostapd: wlan0: STA 14:7D:C5:9E:54

The filtered content format is:

Apr 15 10:04:42 14:7D:C5:9E:84Apr 15 10:04:43 14:7D:C5:9E:85Apr 15 10:04:44 14:7D:C5:9E:86Apr 15 10:04:45 14:7D:C5:9E:87Apr 15 10:04:46 14:7D:C5:9E:88Apr 15 10:04:47 14:7D:C5:9E:89Apr 15 10:04:48 14:7D:C5:9E:14Apr 15 10:04:49 14:7D:C5:9E:24Apr 15 10:04:52 14:7D:C5:9E:34Apr 15 10:04:32 14:7D:C5:9E:44Apr 15 10:04:22 14:7D:C5:9E:54

  

2. algorithm ideas

Source file -- mapper (dividing raw data, outputting required data, and processing abnormal data) -- output to HDFS

 

3. Write a program

Import Java. io. ioexception; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. conf. configured; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. longwritable; import Org. apache. hadoop. io. nullwritable; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapreduce. job; import Org. apache. hadoop. mapreduce. mapper; import Org. apache. hadoop. mapreduce. lib. input. fileinputformat; Import Org. apache. hadoop. mapreduce. lib. output. fileoutputformat; import Org. apache. hadoop. mapreduce. lib. output. textoutputformat; import Org. apache. hadoop. util. tool; import Org. apache. hadoop. util. toolrunner; public class app_1 extends configured implements tool {Enum counter {lineskip, // records the rows with errors}/*** mapper <longwritable, text, nullwritable, text> * longwritable, text is the key and value of the input data, for example, the offset of the first character of each line of the route log as the key, The content of the entire row is used as value * nullwritable. Text is the key and value of the output data. **/public static class routermapper extends mapper <longwritable, text, nullwritable, text> {// map (longwritable key, text value, context) // longwritable key, text value, corresponding to the key and value of the routermapper class input data // context public void map (longwritable key, text value, context) throws ioexception, interruptedexception {string line = value. tostring (); Try {string [] linesp.pdf = line. split (""); // split the original data string month = linesp.pdf [0]; // obtain the month string day = linesp.pdf [1]; // obtain the date string time = linesploud [2]; // obtain the time string MAC = linesploud [6]; // obtain the NIC address // convert it to the output format that can be read by hadoop, to be consistent with the output data format of the routermapper class, text out = new text (month + "" + day + "" + time + "" + Mac); // output context. write (nullwritable. get (), OUT);} catch (arrayindexoutofboundsexception e) {// process the exception data. If an exception occurs, the counter is + 1. Context. getcounter (counter. lineskip ). increment (1); Return ;}}@ override public int run (string [] arg0) throws exception {configuration conf = getconf (); job = new job (Conf, "app_1"); // specify the job name. setjarbyclass (app_1.class); // specify the class fileinputformat. addinputpath (job, new path (arg0 [0]); // input path fileoutputformat. setoutputpath (job, new path (arg0 [1]); // output path job. setmapperclass (routermapper. class );/ /Call the routermapper class as the Mapper task code job. setoutputformatclass (textoutputformat. class); job. setoutputkeyclass (nullwritable. class); // specifies the output key format, which must be consistent with the output data format of routermapper job. setoutputvalueclass (text. class); // specifies the output value format, which must be consistent with the output data format of routermapper job. waitforcompletion (true); Return job. issuccessful ()? ;} // The main method used for testing // you must specify the input path and output path public static void main (string [] ARGs) when running the main method) throws exception {int res = toolrunner. run (new configuration (), new app_1 (), argS); system. exit (RES );}}

The route log file to be analyzed has been uploaded to the HDFS: // H1: 9000/user/coder/In directory of HDFS. H1 is my namenode host name. Configure parameters before running. Note that the storage directory of the output path must not exist.

 

4. After running, you can view the result directly in eclipse.

  

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.