Hadoop applets-Data Filtering

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. There are a batch of routing logs. You need to extract the MAC address and time and delete other contents.

The log Content format is as follows:

Apr 15 10:04:42 hostapd: wlan0: STA 14:7D:C5:9E:84Apr 15 10:04:43 hostapd: wlan0: STA 14:7D:C5:9E:85Apr 15 10:04:44 hostapd: wlan0: STA 14:7D:C5:9E:86Apr 15 10:04:45 hostapd: wlan0: STA 14:7D:C5:9E:87Apr 15 10:04:46 hostapd: wlan0: STA 14:7D:C5:9E:88Apr 15 10:04:47 hostapd: wlan0: STA 14:7D:C5:9E:89Apr 15 10:04:48 hostapd: wlan0: STA 14:7D:C5:9E:14Apr 15 10:04:49 hostapd: wlan0: STA 14:7D:C5:9E:24Apr 15 10:04:52 hostapd: wlan0: STA 14:7D:C5:9E:34Apr 15 10:04:32 hostapd: wlan0: STA 14:7D:C5:9E:44Apr 15 10:04:22 hostapd: wlan0: STA 14:7D:C5:9E:54

The filtered content format is:

Apr 15 10:04:42 14:7D:C5:9E:84Apr 15 10:04:43 14:7D:C5:9E:85Apr 15 10:04:44 14:7D:C5:9E:86Apr 15 10:04:45 14:7D:C5:9E:87Apr 15 10:04:46 14:7D:C5:9E:88Apr 15 10:04:47 14:7D:C5:9E:89Apr 15 10:04:48 14:7D:C5:9E:14Apr 15 10:04:49 14:7D:C5:9E:24Apr 15 10:04:52 14:7D:C5:9E:34Apr 15 10:04:32 14:7D:C5:9E:44Apr 15 10:04:22 14:7D:C5:9E:54

2. algorithm ideas

Source file -- mapper (dividing raw data, outputting required data, and processing abnormal data) -- output to HDFS

3. Write a program

Import Java. io. ioexception; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. conf. configured; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. longwritable; import Org. apache. hadoop. io. nullwritable; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapreduce. job; import Org. apache. hadoop. mapreduce. mapper; import Org. apache. hadoop. mapreduce. lib. input. fileinputformat; Import Org. apache. hadoop. mapreduce. lib. output. fileoutputformat; import Org. apache. hadoop. mapreduce. lib. output. textoutputformat; import Org. apache. hadoop. util. tool; import Org. apache. hadoop. util. toolrunner; public class app_1 extends configured implements tool {Enum counter {lineskip, // records the rows with errors}/*** mapper <longwritable, text, nullwritable, text> * longwritable, text is the key and value of the input data, for example, the offset of the first character of each line of the route log as the key, The content of the entire row is used as value * nullwritable. Text is the key and value of the output data. **/public static class routermapper extends mapper <longwritable, text, nullwritable, text> {// map (longwritable key, text value, context) // longwritable key, text value, corresponding to the key and value of the routermapper class input data // context public void map (longwritable key, text value, context) throws ioexception, interruptedexception {string line = value. tostring (); Try {string [] linesp.pdf = line. split (""); // split the original data string month = linesp.pdf [0]; // obtain the month string day = linesp.pdf [1]; // obtain the date string time = linesploud [2]; // obtain the time string MAC = linesploud [6]; // obtain the NIC address // convert it to the output format that can be read by hadoop, to be consistent with the output data format of the routermapper class, text out = new text (month + "" + day + "" + time + "" + Mac); // output context. write (nullwritable. get (), OUT);} catch (arrayindexoutofboundsexception e) {// process the exception data. If an exception occurs, the counter is + 1. Context. getcounter (counter. lineskip ). increment (1); Return ;}}@ override public int run (string [] arg0) throws exception {configuration conf = getconf (); job = new job (Conf, "app_1"); // specify the job name. setjarbyclass (app_1.class); // specify the class fileinputformat. addinputpath (job, new path (arg0 [0]); // input path fileoutputformat. setoutputpath (job, new path (arg0 [1]); // output path job. setmapperclass (routermapper. class );/ /Call the routermapper class as the Mapper task code job. setoutputformatclass (textoutputformat. class); job. setoutputkeyclass (nullwritable. class); // specifies the output key format, which must be consistent with the output data format of routermapper job. setoutputvalueclass (text. class); // specifies the output value format, which must be consistent with the output data format of routermapper job. waitforcompletion (true); Return job. issuccessful ()? ;} // The main method used for testing // you must specify the input path and output path public static void main (string [] ARGs) when running the main method) throws exception {int res = toolrunner. run (new configuration (), new app_1 (), argS); system. exit (RES );}}

The route log file to be analyzed has been uploaded to the HDFS: // H1: 9000/user/coder/In directory of HDFS. H1 is my namenode host name. Configure parameters before running. Note that the storage directory of the output path must not exist.

4. After running, you can view the result directly in eclipse.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop applets-Data Filtering

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hadoop applets-Data Filtering

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support