Hive-textinputformat Custom Separators

Source: Internet
Author: User
Tags constructor log log min readline serialization sqoop
Preface

In a test that uses Sqoop to import data from the relational database Oracle into hive, a separator problem occurs. There are fields in Oracle that contain \ n newline characters, because hive defaults to ' \ n ' as a newline delimiter, so importing data from Oracle into hive with Sqoop causes the data entry in hive to be inconsistent with the original database, when the data is processed by replacing the newline nonalphanumeric in the field with Sqoop parameters before importing into HDFs.

Sqoop when importing data from a relational database to HDFs, it is supported to replace \ n with a custom newline character (supports single-character custom line breaks), but the statement <row format delimited lines terminated by> when tables are built in hive Specifying a custom line break will prompt the following error:

The < linesterminated by> parameter currently only supports ' \ n '. You cannot specify custom line breaks so that data for custom line breaks cannot be imported into hive, and based on the above considerations, this article simply explains how to enable hive to customize the line breaks and field separators for multiple characters for your reference. Please criticize if there is any shortage.

Environment

hadoop:2.2 hive:0.12 (Star ring inceptor, native Hive support)

Target

Analyze hive Custom Multi-character Swop line breaks, implement hive custom multi-string field separators, and implement settings for Hivetextinputformat custom encoding formats.

serialization and deserialization of 1.hive

In Hive, Textinputformat is used by default, and one row represents a record. In each record (one row), the fields are split by default using ^a.

In some cases, we tend to face multi-line, structured documents and need to import them into hive processing, where we need to customize InputFormat, OutputFormat, and Serde.

First, to clarify the relationship between the three, we directly quoted Hive official saying:

SerDe is a short name for "serializer and Deserializer."

Hive uses SerDe (and! FileFormat) to read and write table rows.

HDFS files–> inputfileformat–> <key, value>–> deserializer–> Row Object

Row object–> serializer–> <key, value>–> outputfileformat–> HDFS files

To summarize, in the face of a file on HDFs, Hive will handle it as follows (read as an example):

(1) Call InputFormat and cut the file into different documents. Each document is a row (row).

(2) Call Serde's deserializer, dividing a row (row) into individual fields.

When Hive performs an insert operation, the row is written to the file, and the OutputFormat, Serde, Seriliazer are primarily called, in the reverse order of reading.

For HDFs files that contain custom line breaks and field separators, this article only describes modifications to the process of hive read. 2 Hive Default Textinputformat class

First build a simple table, and then use the <describe Extended > Command to view the details of the table.

Transwarp> CREATE TABLE test1 (id int);
OK time Taken:0.062seconds Transwarp>describe extended test1; OK ID int None Detailed table information table ( Tablename:test1, Dbname:default, owner:root,createtime:1409300219, lastaccesstime:0, retention:0, SD: Storagedescriptor (Cols:[fieldschema (Name:id, Type:int,comment:null)],location:hdfs://leezq-vm3:8020/inceptor1/ User/hive/warehouse/test1, Inputformat:org.apache.hadoop.mapred.textinputformat,outputformat:o
Rg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, Compressed:false, Numbuckets:-1, Serdeinfo:serdeinfo ( Name:null, SerializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Parameters:{serialization.format =1}), bucketcols:[], sortcols:[], Parameters:{},skewedinfo:skewedinfo (skewedcolnames:[], skewedColValues:[], skewedcolvaluelocationmaps:{}), Storedassubdirectories:false), partitionkeys:[], Parameters:{transient_ lastddltime=1409300219}, viewOriginaltext:null, viewexpandedtext:null,tabletype:managed_table) time taken:0.121 seconds, Fetched:3 row (s) 

As can be seen from the above, by default, the input and output invocation classes of hive are:

InputFormat:org.apache.hadoop.mapred.TextInputFormat,
Outputformat:o Rg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,

While now Hadoop is now upgrading to 2. Version x, hive still uses the old version of the Mapred interface.

What we want to rewrite is the class Textinputformat. Class 2.1 Textinputformat

Class Textinputformat in Hadoop-mapreduce-client-core-2.2.0.jar.

Focus on the Getrecordreader method in the class that returns the Linerecordreader object. And the method has implemented the code to receive the custom string as a newline character, as long as the table before the establishment of the CLI interface on the hive input set textinputformat.record.delimiter=< custom newline string >; You can implement custom multi-character line breaks.


class 2.2 Linerecordreader

To further review its implementation, we look further at Linerecordreader (package org.apache.hadoop.mapred. Linerecordreader) class.

Look at the constructor for the class that calls Org.apache.hadoop.util.LineReader (in package Hadoop-common-2.2.0.jar) to get the data for each row. Pass parameter Recorddelimiter to class object Linereader, ReadLine in class Linereader (Text str, int maxlinelength, intmaxbytestoconsume) The method is responsible for returning the length of each row according to the user-defined delimiter, and if the user does not set the value of Textinputformat.record.delimiter,the value of Recorddelimiter is null. The ReadLine method then splits each row by default ' \ n '. The code for ReadLine is as follows:


By reading the source can be seen, the original hive can be set by the method of parameters to implement multi-character custom line break (Textfile storage), through the Readcustomline method above to obtain a user-defined newline character string implementation of automatic line-wrapping, Each row can support a maximum size of 2147483648. However, to implement custom multi-character field separators and custom encoding format settings, you will also need to overwrite the source code. Let's talk about the steps for rewriting.

3 Customizing Textinputformat Implementing Custom Multi-string field separators to implement custom encoding format settings

First build an empty Java project, add the necessary five packages


Then create a new two classes Sqptextinputformat and Sqprecordreader, and copy the code for Textinputformat and Linerecordreader separately.

In Sqptextinputformat, add the settings for the custom encoding format. (The parameters of the line break were renamed to change the Textinputformat.record.delimiter to Textinputformat.record.linesep)

======================================================
  String delimiter = Job.get (" Textinputformat.record.linesep ");
  this.encoding = Job.get ("textinputformat.record.encoding", defaultencoding);
  byte[] recorddelimiterbytes = null;
  if (null! = delimiter) {//charsets.utf_8
    recorddelimiterbytes = delimiter.getbytes (this.encoding);
  }
  return new Sqprecordreader (Job, (Filesplit) genericsplit, recorddelimiterbytes);
 

In the Sqprecordreader constructor, add the settings for the field delimiter and encoding format.

====================================================== this
    . Fieldsep = Job.get ("Textinputformat.record.fieldsep", defaultfsep);
this.encoding = Job.get ("textinputformat.record.encoding", defaultencoding);
 

In the next () method of Sqprecordreader, add the substitution of the field separator and the setting of the encoding format.

======================================================
    if (Encoding.compareto (defaultencoding)! = 0) {
              String str = new String (Value.getbytes (), 0,value.getlength (), encoding);
              Value.set (str);
         }
      if (Fieldsep.compareto (defaultfsep)! = 0) {
              String replacedvalue = value.tostring (). replace (FIELDSEP, DEFAULTFSEP) ;
              Value.set (Replacedvalue);

The detailed code is as follows:

Package com.learn.util.hadoop;

Import Com.google.common.base.Charsets;

Import java.io.IOException;
Import Org.apache.hadoop.fs.FileSystem;
Import Org.apache.hadoop.fs.Path;
Import org.apache.hadoop.io.LongWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.io.compress.CompressionCodec;
Import Org.apache.hadoop.io.compress.CompressionCodecFactory;
Import Org.apache.hadoop.io.compress.SplittableCompressionCodec;

Import org.apache.hadoop.mapred.*; public class Sqptextinputformat extends Fileinputformat<longwritable, text> implements jobconfigurable {private
Compressioncodecfactory compressioncodecs = null;
Private final static String defaultencoding = "UTF-8";//"Us-ascii" "Iso-8859-1" "UTF-8" "Utf-16be" "Utf-16le" "UTF-16"

Private String encoding = NULL;

public void Configure (jobconf conf) {this.compressioncodecs = new compressioncodecfactory (conf);} Protected Boolean issplitable (FileSystem FS, Path file) {Compressioncodec codec = THIS.COMPRESSIONCODECS.GEtcodec (file);
  if (null = = codec) {return true;
} return codec instanceof Splittablecompressioncodec; } public recordreader<longwritable, text> Getrecordreader (inputsplit genericsplit, jobconf job, Reporter Reporter
  ) throws IOException {Reporter.setstatus (genericsplit.tostring ());
  String delimiter = job.get ("Textinputformat.record.linesep");
  this.encoding = Job.get ("textinputformat.record.encoding", defaultencoding);
  byte[] recorddelimiterbytes = null;
  if (null! = delimiter) {//charsets.utf_8 recorddelimiterbytes = delimiter.getbytes (this.encoding);
} return new Sqprecordreader (Job, (Filesplit) genericsplit, recorddelimiterbytes); }
}


Package com.learn.util.hadoop;
Import java.io.IOException;

Import Java.io.InputStream;
Import Org.apache.commons.logging.Log;
Import Org.apache.commons.logging.LogFactory;
Import Org.apache.hadoop.classification.InterfaceAudience.LimitedPrivate;
Import org.apache.hadoop.classification.InterfaceStability.Unstable;
Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.fs.FSDataInputStream;
Import Org.apache.hadoop.fs.FileSystem;
Import Org.apache.hadoop.fs.Path;
Import org.apache.hadoop.fs.Seekable;
Import org.apache.hadoop.io.LongWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.io.compress.CodecPool;
Import Org.apache.hadoop.io.compress.CompressionCodec;
Import Org.apache.hadoop.io.compress.CompressionCodecFactory;
Import Org.apache.hadoop.io.compress.Decompressor;
Import Org.apache.hadoop.io.compress.SplitCompressionInputStream;
Import Org.apache.hadoop.io.compress.SplittableCompressionCodec; Import Org.apache.hadoop.io.compress.SplittableCompressiOncodec.read_mode;
Import Org.apache.hadoop.util.LineReader;
Import Org.apache.hadoop.mapred.RecordReader;

Import Org.apache.hadoop.mapred.FileSplit; @InterfaceAudience. Limitedprivate ({"MapReduce", "Pig"})//@InterfaceStability. Unstable public class Sqprecordreader implements Recordreader<longwritable, text> {private static final log log = Logfactory.getlog (S

  QPRecordReader.class.getName ());
  Private compressioncodecfactory compressioncodecs = null;
  Private long start;
  Private long POS;
  private long end;
  Private Linereader in;
  Private Fsdatainputstream Filein;
  Private final seekable fileposition;
  int maxlinelength;
  Private COMPRESSIONCODEC codec;
  Private Decompressor decompressor;			Private String fieldsep;
  Field separator private static final String defaultfsep= "\001";
  Private final static String defaultencoding = "UTF-8";//"Us-ascii" "Iso-8859-1" "UTF-8" "Utf-16be" "Utf-16le" "UTF-16"

  Private String encoding = NULL; Public Sqprecordreader (ConfIguration job, Filesplit split) throws IOException {This (job, split, NULL); } public Sqprecordreader (Configuration job, Filesplit split, byte[] recorddelimiter) throws IOException {this.ma
    Xlinelength = Job.getint ("Mapreduce.input.linerecordreader.line.maxlength", 2147483647); This.
    Fieldsep = Job.get ("Textinputformat.record.fieldsep", defaultfsep);
    this.encoding = Job.get ("textinputformat.record.encoding", defaultencoding);
    This.start = Split.getstart ();
    This.end = (This.start + split.getlength ());
    Path file = Split.getpath ();
    This.compressioncodecs = new Compressioncodecfactory (job);

    This.codec = This.compressionCodecs.getCodec (file);
    FileSystem fs = File.getfilesystem (Job);
    This.filein = Fs.open (file);
      if (Iscompressedinput ()) {this.decompressor = Codecpool.getdecompressor (THIS.CODEC); if ((This.codec instanceof Splittablecompressioncodec)) {Splitcompressioninputstream cIn = ((Splittablecompressio NCodec) this.codec). Createinputstream (This.filein, This.decompressor, This.start, This.end, Splittablecompressioncodec.read_mode.

        BYBLOCK);
        this.in = new Linereader (cIn, Job, recorddelimiter);
        This.start = Cin.getadjustedstart ();
        This.end = Cin.getadjustedend ();
      This.fileposition = CIn; } else {this.in = new Linereader (This.codec.createInputStream (This.filein, This.decompressor), Job, Recorddelimit
        ER);
      This.fileposition = This.filein;
      }} else {This.fileIn.seek (this.start);
      this.in = new Linereader (This.filein, Job, recorddelimiter);
    This.fileposition = This.filein;
    } if (This.start! = 0L) {This.start + = This.in.readLine (new Text (), 0, Maxbytestoconsume (this.start));
  } This.pos = This.start; } public Sqprecordreader (InputStream in, long offset, long endoffset, int. maxlinelength) {This (in, offset, endof
  Fset, maxlinelength, NULL); } public Sqprecordreader (InputStream in, Long offset, long endoffset, int maxlinelength, byte[] recorddelimiter) {this.maxlinelength = Maxlinelength;
    this.in = new Linereader (in, Recorddelimiter);
    This.start = offset;
    This.pos = offset;
    This.end = Endoffset;
  This.fileposition = null;
    Sqprecordreader (InputStream in, long offset, long endoffset, Configuration job) throws IOException {
  This (in, offset, endoffset, job, NULL); } public Sqprecordreader (InputStream in, long offset, long endoffset, Configuration job, byte[] recorddelimiter) th

    Rows IOException {this.maxlinelength = Job.getint ("Mapreduce.input.linerecordreader.line.maxlength", 2147483647);
    this.in = new Linereader (in, Job, recorddelimiter);
    This.start = offset;
    This.pos = offset;
    This.end = Endoffset;
  This.fileposition = null;
  } public longwritable CreateKey () {return new longwritable ();
  } public Text CreateValue () {return new Text (); } Private Boolean IsCoMpressedinput () {return this.codec! = NULL; } private int Maxbytestoconsume (long pos) {return iscompressedinput ()? 2147483647: (int) math.min (2147483647L, th
  Is.end-pos);
    } Private Long Getfileposition () throws IOException {long retVal;
    if ((Iscompressedinput ()) && (null! = this.fileposition)) RetVal = This.filePosition.getPos ();
    else {retVal = This.pos;
  } return retVal; The public synchronized A Boolean next (longwritable key, Text value) throws IOException {while (Getfileposition (

      ) <= this.end) {key.set (this.pos); int newSize = This.in.readLine (value, This.maxlinelength, Math.max (Maxbytestoconsume (This.pos), this.maxlinelength))
      
      ;
      if (newSize = = 0) {return false; } if (Encoding.compareto (defaultencoding)! = 0) {string str = new String (value.getbytes (), 0, Value.getle
			Ngth (), encoding);
		Value.set (str); } if (Fieldsep.compareto (dEFAULTFSEP)! = 0) {String replacedvalue = value.tostring (). replace (fieldsep, DEFAULTFSEP);
		Value.set (Replacedvalue);
      } This.pos + = NewSize;
      if (NewSize < this.maxlinelength) {return true;
    } log.info ("Skipped line of size" + NewSize + "at POS" + (this.pos-newsize));
  } return false; } Public synchronized float getprogress () throws IOException {if (This.start = = this.end) {return 0.0
    F
  } return Math.min (1.0F, (float) (Getfileposition ()-This.start)/(float) (This.end-this.start));
  } Public synchronized Long GetPos () throws IOException {return this.pos;
    Public synchronized void Close () throws IOException {try {if (this.in! = null) this.in.close ();
    } finally {if (this.decompressor! = null) codecpool.returndecompressor (this.decompressor);
 }
  }
}

4 Customizing the use of InputFormat

1. Make the program into a jar package, placed in the/usr/lib/hive/lib and the/usr/lib/hadoop-mapreduce directory of each node.

In the CLI command line interface of Hvie, you can set the following parameters to modify the encoding format, custom field separators, and custom line breaks, respectively.

Set Textinputformat.record.encoding=utf-8;
"Us-ascii" "Iso-8859-1" "UTF-8" "Utf-16be" "Utf-16le" "UTF-16"
set textinputformat.record.fieldsep=;
Set textinputformat.record.linesep=|+|;


2. Build a table that identifies the InputFormat and OutputFormat used, where Org.apach...norekeytextoutputformat is the hive default OutputFormat delimiter.

CREATE TABLE Test
(
ID string,
name string
)
stored as
inputformat ' com. Learn.util.hadoop.SQPTextInputFormat '
OutputFormat ' Org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat '

3. Load statement loading Data instance

Test data:


There is a field in the test data that contains a line break. The field separators and row separators are ', ' and ' |+| ', respectively.

Set the field delimiter and row separator separately, and build the table to specify InputFormat and OutputFormat as shown in the following illustration.

Select * Queries are as follows:

Select COUNT (*) is as follows:

The result is 3 lines, correct.

The Select ID from Test1 is as follows:

Select name from test1:

Select count (name) from Test1:


The results are correct.

Select name,id from test1:

Select id,name from Test1;

The ID and name two fields are not a problem, but the field with ' \ n ' is displayed when the call to MapReduce is checked together.

Select id,name from Test1 where id=13:

Querying each field individually and querying the total number of rows is no problem, indicating that the rewritten InputFormat is working, and that the null issue above should be the problem with the hive display.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.