Remove Excess tab at end of Hadoop-streaming line

Source: Internet
Author: User

Unit has a group of business has been using streaming compressed text log, is basically set job output to BZ2 format, how to input on how to output, no processing function in the inside. But each line ends with an extra tab. Finally, there is a business that needs to use the last field before the tab, without removing it.

Although it is a small problem, but the internet search a circle, there is no good solution. Many people have met, but the unit's business is very special, only map does not reduce. Http://stackoverflow.com/questions/20137618/hadoop-streaming-api-how-to-remove-unwanted-delimiters this above directly says "as I Discussed with friends, there's no easy-to-achieve the goal,... ".

Streaming has a feature that, by default, is to distinguish between key and value by tab. If you do not set the number of key fields, the default line is the first tab before the key, followed by value. If the tab is not found, it is all the key field and value is empty. There is a tab in the back that is the tab between key and value.

The first is to examine the map of streaming, in Pipemapper.java. The inputwriter handles the output, so try to implement custom output. In the MapReduce job configuration, Stream.map.input.writer.class is responsible for specifying which Inputwriter is, and the default is Textinputwriter. Streaming here to compare pits, adding-dstream.map.input.writer.class=xxx option does not make streaming use custom implementation classes, must implement their own identifierresolver, It then sets different types of inputwriter for different types of input, and the input type must be passed in by the Stream.map.input option. Whether the settings are successful is based on the configuration parameter table of the Jobtracker when the job is run.

Unfortunately, the use of custom inputwriter instead of Textinputwriter, the end of the tab is gone, the beginning of a number of numbers. It is estimated that the key that Hadoop has passed to mapper is printed. Oooorz .... Don't guess, just look at the code.

Fortunately, the code is pretty short or.

Streaming will put itself, as well as user-file-cachefile-cachearchive and other options designated files, into a jar package submitted to the cluster for Mr Jobs. The output of the cluster is implemented as the input of the user implementation Mapper, and the output of the mapper is read as the output of the whole map job. Input/output is Inputwriter and Outputreader, as opposed to user-defined jobs, Writer/reader acts as streaming. In simple terms,

Hadoop-Given (k,v)---streaming---> user-defined mapper >hadoop output---streaming---mapper

Streaming starts the job by Pipemaprunner, asynchronously collects the user job output, and then reports the job progress to Hadoop. The underlying setup and job submission for the entire job is done by the Streamjob class.

The execution of the job is pipemapred/pipemapper/pipreducer/pipcombiner these classes. The solution is here, too. In the Mroutputthread Run method, Outcollector.collect (key, value), before this sentence, add the following code snippet .

if (value instanceof Text) {if (value.tostring (). IsEmpty ()) value = Nullwritable.get (); }

is not very simple.


Why is it possible to do so? or from the Org.apache.hadoop.mapreduce.lib.output.TextOutputFormat. directly on the code.

package org.apache.hadoop.mapreduce.lib.output;import java.io.dataoutputstream;import  java.io.ioexception;import java.io.unsupportedencodingexception;import  org.apache.hadoop.classification.interfaceaudience;import  Org.apache.hadoop.classification.interfacestability;import org.apache.hadoop.conf.configuration;import  org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import  org.apache.hadoop.fs.fsdataoutputstream;import org.apache.hadoop.io.nullwritable;import  org.apache.hadoop.io.text;import org.apache.hadoop.io.compress.compressioncodec;import  org.apache.hadoop.io.compress.gzipcodec;import org.apache.hadoop.mapreduce.outputformat;import  Org.apache.hadoop.mapreduce.recordwriter;import org.apache.hadoop.mapreduce.taskattemptcontext;import  org.apache.hadoop.util.*;/** an {@link  outputformat} that writes plain  text files. */@InterfaceAudience. public@interfacestability.stablepublic class textoutputformat<k, v> extends  fileoutputformat<k, v> {  public static string seperator =  "Mapreduce.output.textoutputformat.separator";  protected static class  linerecordwriter<k, v>    extends recordwriter<k, v> {     private static final String utf8 =  "UTF-8";     private static final byte[] newline;    static {       try {        newline =   "\ n". GetBytes (UTF8);      } catch  (unsupportedencodingexception  uee)  {        throw new illegalargumentexception ( "Can ' t find "  + utf8 +  " encoding");      }    }     protected dataoutputstream out;    private final byte[]  keyvalueseparator;    public linerecordwriter (DataOutputStream out,  String keyvalueseparator)  {      this.out = out;       try {        this.keyValueSeparator  = keyvalueseparator.getbytes (UTF8);      } catch  ( Unsupportedencodingexception uee)  {        throw new  illegalargumentexception ("Can ' t find "  + utf8 +  " encoding");       }    }    public linerecordwriter ( Dataoutputstream out)  {      this (out,  "\ T");    }    /**      * Write the object to the byte stream,  handling text as a special     * case.      *  @param  o the object to print     * @ throws ioexception if the write throws, we pass it on      */    private void writeobject (Object o)  throws  IOException {      if  (O instanceof text)  {         Text to =  (Text)  o;         out.write (To.getbytes (),  0, to.getlength ());       } else { &nbsP;      out.write (O.tostring (). GetBytes (UTF8));       }    }    public synchronized void write (K  Key, v value)       throws IOException {       boolean nullkey = key == null | |  key instanceof nullwritable;      boolean nullvalue =  value == null | |  value instanceof NullWritable;      if  (nullkey & & nullvalue)  {        return;       }      if  (!nullkey)  {         writeobject (Key);      }      if  (! (nullkey  | | nullvalue))  {        out.write (keyValueSeparator);       }      if  (!nullvalue)  {         writeobject (value);      }       out.write (newline);     }    public synchronized      void close (Taskattemptcontext context)  throws IOException  {      out.close ();    }  }   public recordwriter<k, v>           Getrecordwriter (taskattemptcontext job                          )  throws  ioexception, interruptedexception&nbsp {    configuration conf = job.getconfiguration ();     Boolean iscompressed = getcompressoutput (Job);    string  Keyvalueseparator= conf.get (seperator,  "\ t");     compressioncodec codec  = null;    String extension =  "";     if   (iscompressed)  {      class<? extends compressioncodec > codecclass =         getoutputcompressorclass (Job,  gzipcodec.class);      codec =  (COMPRESSIONCODEC)   Reflectionutils.newinstance (codecclass, conf);      extension =  Codec.getdefaultextension ();    }    path file =  Getdefaultworkfile (job, extension);  &Nbsp;  filesystem fs = file.getfilesystem (conf);    if  (! iscompressed)  {      fsdataoutputstream fileout = fs.create ( File, false);      return new linerecordwriter<k, v> ( Fileout, keyvalueseparator);    } else {       fsdataoutputstream fileout = fs.create (File, false);       return new LineRecordWriter<K, V> (new dataoutputstream                                           ( Codec.createoutputstream (fileout)),                                          keyvalueseparator);     }  }}

Did you notice the linerecordwriter.write?


Postscript:

A. There are a lot of ways to change the delimiter and replace the tab with a null character. This is a very crude practice, basically is to bury the pit! Why is it?

The log text content can be very rich, this time the problem is because each line has no tab. If you change the text that contains the tab, turn the delimiter into an empty string, then remove the tab in the journal.

B. The reason for doing so is also inspired by the q&a of StackOverflow. Http://stackoverflow.com/questions/18133290/hadoop-streaming-remove-trailing-tab-from-reducer-output. Similarly, Q&a is also used to modify the delimiter method, is not advisable. But to find out, you can modify the Linerecordwriter.write method in the textoutputformat<k,v> you have rewritten.

Rewriting Textoutputformat is a very elegant solution, seemingly modifying the Hadoop itself, but before streaming the latest version of the fix, prevent the streaming of each version from changing, recompiling, and packaging. In addition, streaming is not a standalone project, compiling it requires compiling hadoop!

Using VIM to write Java packaging is really a bit painful, Monday to try this more elegant way to work.

C. Although the streaming code has been modified, it is not necessary to consider the problem that affects all users of the same machine, and does not need to modify the streaming package under $hadoop_home. Streaming provides this parameter stream.shipped.hadoopstreaming.

D. Some settings seem to refer to the entry into force of the reducer, which does not work for this mapper-only job. Like what

Mapred.textoutputformat.ignoreseparatormapred.textoutputformat.separator

Set up, didn't see what effect.

Then there is the command line option inside if write-dxxx= \ Such statement, it does not seem to set this parameter to the effect of an empty string, write-dxxx= "" is the same.

This article is from the "New Youth" blog, please be sure to keep this source http://luckybins.blog.51cto.com/786164/1601722

Remove Excess tab at end of Hadoop-streaming line

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.