Statistics of invalid user data in TV set-top box and output in compressed format

Source: Internet
Author: User

Before we learned how to use the MapReduce counter, we used the following project to consolidate what we learned

1. Introduction

This project we use the TV set-top box data, statistics out invalid user data records, and resolve the effective user data in a compressed format output

2. Data set

The data comes from " Hadoop Small file merge "The result after processing

3. Analysis

Based on the requirements, we do this in the following steps:

1, first use Jsoup, parse out the HTML format set-top box data

2, Write Mapper class, custom counter statistics invalid set-top box data, and the effective set-top box data in a compressed format output

4. Realize

1. First define a Parsetvdata class, parse the input dataset, and return with the list collection

 PackageCom.buaa;Importjava.util.ArrayList;Importjava.util.List;Importorg.apache.commons.lang.StringUtils;ImportOrg.jsoup.Jsoup;Importorg.jsoup.nodes.Document;Importorg.jsoup.nodes.Element;Importorg.jsoup.select.Elements;ImportOrg.slf4j.Logger;Importorg.slf4j.LoggerFactory;/*** @ProjectName countsettopboxuserdata* @PackageName com.buaa* @ClassName parsetvdata* @Description parsing data * @Author Liu Jishu * @Date 2016-05-28 16:15:08*/ Public classParsetvdata {Private StaticLogger Logger = Loggerfactory.getlogger (parsetvdata.class); /*** Use Jsoup tool to parse input data * *@paramtext *@returnList*/     Public StaticList<string>transdata (String text) {List<String> list =NewArraylist<string>(); Try {            //Jsoup parsing DataDocument doc =jsoup.parse (text); Elements content= Doc.getelementsbytag ("WIC"); //STB NumberString stbnum = content.get (0). attr ("Stbnum"); if(Stringutils.isempty (stbnum)) {returnlist; }                        //DateString date = content.get (0). attr ("date"); Elements Els= Doc.getelementsbytag ("A"); if(Els.isempty ()) {returnlist; }                         for(Element el:els) {//End TimeString e = el.attr ("e"); //Start TimeString s = el.attr ("s"); //Channel nameString sn = el.attr ("SN"); StringBuilder Rec=NewStringBuilder (). Append (Stbnum). Append ("@"). Append ("@"). Append ("@"). Append ("@"). Append (s). Append ("@ "). Append (e);            List.add (Rec.tostring ()); }        } Catch(Exception e) {logger.error ("", E); returnlist; }        returnlist; }}

2, Write Mapper class, custom counter statistics invalid set-top box data, and the effective set-top box data in a compressed format output

 PackageCom.buaa;Importjava.util.List;Importorg.apache.hadoop.conf.Configuration;Importorg.apache.hadoop.conf.Configured;ImportOrg.apache.hadoop.fs.FileSystem;ImportOrg.apache.hadoop.fs.Path;Importorg.apache.hadoop.io.LongWritable;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.io.compress.GzipCodec;ImportOrg.apache.hadoop.mapreduce.Job;ImportOrg.apache.hadoop.mapreduce.Mapper;ImportOrg.apache.hadoop.mapreduce.lib.input.FileInputFormat;ImportOrg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;ImportOrg.apache.hadoop.util.Tool;ImportOrg.apache.hadoop.util.ToolRunner;/*** @ProjectName countsettopboxuserdata* @PackageName com.buaa* @ClassName countuserdata* @Description Statistics of invalid user data in TV set-top box and output in compressed format * @Author Liu Jishu * @Date 2016-05-28 16:11:12*/ Public classCountuserdataextendsConfiguredImplementsTool {//defining enumeration Objects     Public Static enumlog_processor_counter {bad_records}; /*** Parse data, statistic invalid data, and output valid data*/     Public Static classCounterandcompressionmapperextendsmapper<longwritable, text, text, text> {        protected voidMap (longwritable key, Text value, context context)throwsjava.io.IOException, interruptedexception {//parse each set-top box record and return to the list collectionlist<string> list =Parsetvdata.transdata (value.tostring ()); //Invalid record            if(List.isEmpty ()) {//Dynamic Custom CountersContext.getcounter ("Errorrecordcounter", "Error_record_tvdata"). Increment (1); //Enumeration Declaration CountersContext.getcounter (Log_processor_counter. bad_records). Increment (1); } Else {                 for(String validaterecord:list) {//Output parsing DataContext.write (NewText (Validaterecord),NewText ("")); }}} @SuppressWarnings ("Deprecation") @Override Public intRun (string[] args)throwsException {//reading configuration FilesConfiguration conf =NewConfiguration (); //if the output directory exists, delete thePath MyPath =NewPath (args[1]); FileSystem HDFs=mypath.getfilesystem (conf); if(Hdfs.isdirectory (MyPath)) {//Delete a file path that already existsHdfs.delete (MyPath,true); }                //Create a new taskJob Job =NewJob (conf, "Countuserdata"); //Setting the main classJob.setjarbyclass (Countuserdata.class); //MapperJob.setmapperclass (Counterandcompressionmapper.class); //Output Key TypeJob.setoutputkeyclass (Text.class); //Output Value typeJob.setoutputvalueclass (Text.class); //Input PathFileinputformat.addinputpath (Job,NewPath (args[0])); //Output PathFileoutputformat.setoutputpath (Job,NewPath (args[1])); //set compression on output resultsFileoutputformat.setcompressoutput (Job,true); //Set Compression typeFileoutputformat.setoutputcompressorclass (Job, Gzipcodec.class); returnJob.waitforcompletion (true) ? 0:1; }         Public Static voidMain (string[] args)throwsException {string[] Date= {"20120917", "20120918", "20120919", "20120920", "20120921", "20120922", "20120923"}; intresult = 1;  for(String dt:date) {string[] args0= {                     "hdfs://hadoop1:9000/buaa/tv/" + dt + ". txt",                    "hdfs://hadoop1:9000/buaa/tv/out/" +DT}; Result= Toolrunner.run (NewConfiguration (),Newcountuserdata (), ARGS0);    } system.exit (Result); }}

5. Operation Result

1. View invalid data for counter statistics

2. View the compressed files in the output directory

If you think reading this blog gives you something to gain, you might want to click " recommend " in the lower right corner.
If you want to find my new blog more easily, click on " Follow me " in the lower left corner.
If you are interested in what my blog is talking about, please keep following my follow-up blog, I am " Liu Chao ★ljc".

This article is copyright to the author and the blog Park, Welcome to reprint, but without the consent of the author must retain this paragraph, and in the article page obvious location to the original link, otherwise reserves the right to pursue legal responsibility.

Implementing code and data: Downloading

Statistics of invalid user data in TV set-top box and output in compressed format

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.