Public Static voidMain (string[] args)throwsException {Configuration conf=NewConfiguration (); Conf.set ("Mapred.job.tracker", Constants.hadoop_main_ip + Constants.mao_hao +constants.hadoop_main_port); if(Args.length! = 3) {System.err.println ("Usage:data deduplication <in> <out> <reduceNum>"); System.exit (2); } Job Job=NewJob (conf, "Etltld job"); Job.setjarbyclass (Etltldmain.class); Job.setmapperclass (etltldmapper.class); Job.setreducerclass (etltldreducer.class); Job.setinputformatclass (Lzotextinputformat.class); Job.setnumreducetasks (Integer.parseint (args[2])); Job.setoutputkeyclass (Text.class); Job.setoutputvalueclass (Text.class); Fileinputformat.addinputpath (Job,NewPath (args[0])); Fileoutputformat.setoutputpath (Job,NewPath (args[1])); fileoutputformat.setcompressoutput (Job, true); Fileoutputformat.setoutputcompressorclass (Job, Lzopcodec.class); int result = Job.waitforcompletion (true)? 0:1; if (result = = 0) {lzoindexer lzoindexer = new Lzoindexer (conf); Lzoindexer.index (New Path (args[1])); System.exit (result); } else if (result = = 1 ) {system.exit (result); } }
If you already have a Lzo file, you can add an index in the following ways:
Bin/yarn jar/module/cloudera/parcels/gplextras-5.4.0-1.cdh5.4.0.p0.27/lib/hadoop/lib/ Hadoop-lzo-0.4.15-cdh5.4.0.jar com.hadoop.compression.lzo.distributedlzoindexer/user/hive/warehouse/cndns.db/ Ods_cndns_log/dt=20160803/node=alicn/part-r-00000.lzo
The LZO format does not support splitable by default, and you need to add an index file for it to support multiple map parallel processing of lzo files.
"Reference" http://blog.csdn.net/wisgood/article/details/17080361
Hadoop output Lzo file and add index