Hadoop的MultipleOutputFormat使用

來源:互聯網
上載者:User

一、背景

    Hadoop的MapReduce中多檔案輸出預設是TextOutFormat,輸出為part-r- 00000和part-r-00001依次遞增的檔案名稱。hadoop提供了

MultipleOutputFormat類,重寫該類可實現定製自訂的檔案名稱。

二、技術細節

1.環境:hadoop 0.19(目前hadoop 0.20.2對MultipleOutputFormat支援不好),linux。

2.實現MultipleOutputFormat代碼例子如下:

 

public class WordCount {   public static class TokenizerMapper extends MapReduceBase  implements Mapper<LongWritable, Text, Text, IntWritable> {     private final static IntWritable count = new IntWritable(1);     private Text word = new Text();     public void map(LongWritable key, Text value,          OutputCollector<Text, IntWritable> output, Reporter  reporter)          throws IOException {        StringTokenizer itr = new StringTokenizer(value.toString());        while (itr.hasMoreTokens()) {          word.set(itr.nextToken());          output.collect(word, count);        }     }   }   public static class IntSumReducer extends MapReduceBase implements        Reducer<Text, IntWritable, Text, IntWritable> {     private IntWritable result = new IntWritable();     public void reduce(Text key, Iterator<IntWritable> values,          OutputCollector<Text, IntWritable> output, Reporter  reporter)          throws IOException {        int sum = 0;        while (values.hasNext()) {          sum += values.next().get();        }        result.set(sum);        output.collect(key, result);     }   }   public static class WordCountOutputFormat extends        MultipleOutputFormat<Text, IntWritable> {     private TextOutputFormat<Text, IntWritable> output = null;     @Override     protected RecordWriter<Text, IntWritable> getBaseRecordWriter(          FileSystem fs, JobConf job, String name, Progressable arg3)          throws IOException {        if (output == null) {          output = new TextOutputFormat<Text, IntWritable>();        }        return output.getRecordWriter(fs, job, name, arg3);     }     @Override     protected String generateFileNameForKeyValue(Text key,          IntWritable value, String name) {        char c = key.toString().toLowerCase().charAt(0);        if (c >= 'a' && c <= 'z') {          return c + ".txt";        }        return "result.txt";     }   }   public static void main(String[] args) throws Exception {     JobConf job = new JobConf(WordCount.class);     job.setJobName("wordcount");     String[] otherArgs = new GenericOptionsParser(job, args)          .getRemainingArgs();     if (otherArgs.length != 2) {        System.err.println("Usage: wordcount <in> <out>");        System.exit(2);     }     job.setJarByClass(WordCount.class);     job.setMapperClass(TokenizerMapper.class);     job.setCombinerClass(IntSumReducer.class);     job.setReducerClass(IntSumReducer.class);     job.setOutputKeyClass(Text.class);     job.setOutputValueClass(IntWritable.class);     job.setOutputFormat(WordCountOutputFormat.class);// 設定輸出格式     FileInputFormat.addInputPath(job, new Path(otherArgs[0]));     FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));     JobClient.runJob(job);   }}

3.在main函數中設定輸出格式,job.setOutputFormat(WordCountOutputFormat.class);實現WordCountOutputFormat類繼承MultipleOutputFormat類,重寫getBaseRecordWriter和generateFileNameForKeyValue函數,在generateFileNameForKeyValue函數中參數String name為預設的輸出part-00000:

public static class WordCountOutputFormat extends        MultipleOutputFormat<Text, IntWritable> {     private TextOutputFormat<Text, IntWritable> output = null;     @Override     protected RecordWriter<Text, IntWritable> getBaseRecordWriter(          FileSystem fs, JobConf job, String name, Progressable arg3)          throws IOException {        if (output == null) {          output = new TextOutputFormat<Text, IntWritable>();        }        return output.getRecordWriter(fs, job, name, arg3);     }     @Override     protected String generateFileNameForKeyValue(Text key,          IntWritable value, String name) {        char c = key.toString().toLowerCase().charAt(0);        if (c >= 'a' && c <= 'z') {          return c + ".txt";        }        return "result.txt";     }    }

4.程式結果為:

-rw-r--r--   2 root supergroup          7 2010-08-07 17:44  /hua/multipleoutput1/c.txt-rw-r--r--   2 root supergroup          6 2010-08-07 17:44  /hua/multipleoutput1/h.txt-rw-r--r--   2 root supergroup          7 2010-08-07 17:44  /hua/multipleoutput1/k.txt-rw-r--r--   2 root supergroup          6 2010-08-07 17:44  /hua/multipleoutput1/m.txt-rw-r--r--   2 root supergroup         28 2010-08-07 17:44  /hua/multipleoutput1/result.txt-rw-r--r--   2 root supergroup          6 2010-08-07 17:44  /hua/multipleoutput1/t.txt

如果generateFileNameForKeyValue返回return c + "_" + name + ".txt";結果為:

-rw-r--r--   2 root supergroup          7 2010-08-07 17:23  /hua/multipleoutput/c_part-00000.txt-rw-r--r--   2 root supergroup          6 2010-08-07 17:23  /hua/multipleoutput/h_part-00000.txt-rw-r--r--   2 root supergroup          7 2010-08-07 17:23  /hua/multipleoutput/k_part-00000.txt-rw-r--r--   2 root supergroup          6 2010-08-07 17:23  /hua/multipleoutput/m_part-00000.txt-rw-r--r--   2 root supergroup         28 2010-08-07 17:23  /hua/multipleoutput/result.txt-rw-r--r--   2 root supergroup          6 2010-08-07 17:23  /hua/multipleoutput/t_part-00000.txt
三、總結
雖然API用的是0.19的,但是使用0.20的API一樣可用,只是會提示方法已淘汰而已。

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.