Hive Chinese Comment garbled solution (2)

Source: Internet
Author: User

This article from the NetEase cloud community

Wang Panyan


Implementation phase

Launchtask back to the driver class Runinternal method, see the following execution process. In the Runinternal method, an Execute method is called by the execution procedure. There is a lot of content in the Execute method, but there are only launchtask methods that are related to us. There are so many key steps in this approach:

Tsk.initialize (conf, plan, CXT);    Taskresult tskres = new Taskresult ();    Taskrunner Tskrun = new Taskrunner (tsk, tskres);    Cxt.launching (Tskrun); Tskrun.runsequential ();

Follow-up runsequential method Discovery invokes the following method:

Exitval = Tsk.executetask ();

Then follow up and discover that the code was executed:

int retval = execute (drivercontext);

This execute method executes the Execute method in the Ddltask class when the show create TABLE XX command is executed.

Follow the Execute method to find the following code:

Showcreatetabledesc showcreatetbl = Work.getshowcreatetbldesc ();      if (showcreatetbl! = null) {return showcreatetable (db, SHOWCREATETBL); }

Looking at the Showcreatetable method, it is found that the fields that are returned are stitched together into a template, and then the content that is taken from the metastore is plugged in and finally written into a temporary file stream. We find that it is finally written to the file stream:

Outstream.writebytes (Createtab_stmt.render ());

Chinese in this place is estimated to be garbled, so it changed to:

Outstream.write (Createtab_stmt.render (). GetBytes ("UTF-8"));

Recompile hive:

MVN Clean package-phadoop-2-dskiptests

Replace the Hive-exec-1.2.1.jar of the Ql/target directory of the compiled hive source with the Lib directory of the running hive, build a test table without JSON serialization deserialization, and discover show create table The field of the XX command has a normal Chinese comment. However, if the test table is still serialized and deserialized with JSON, the comment as from Deserializer still appears.

Let's go back to the code and see how the Showcreatetable method gets the comment information for the field. Find the following code:

list<fieldschema> cols = Tbl.getcols ();

Follow-up to discover that if a custom serialization and deserialization class is set, the line will be executed:

Return Metastoreutils.getfieldsfromdeserializer (Gettablename (), Getdeserializer ());

Follow the Getfieldsfromdeserializer method, we find the following lines of important code:

Objectinspector oi = deserializer.getobjectinspector (); list<?  Extends structfield> fields = ((Structobjectinspector) oi). Getallstructfieldrefs ();    for (int i = 0; i < fields.size (); i++) {Structfield Structfield = Fields.get (i);    String fieldName = Structfield.getfieldname ();    String fieldtypename = Structfield.getfieldobjectinspector (). Gettypename ();    String fieldcomment = determinefieldcomment (Structfield.getfieldcomment ());  Str_fields.add (New FieldSchema (FieldName, Fieldtypename, fieldcomment)); }

This means that the annotations are taken out of the deserializer. So let's go back and see what parameters hive has passed to our JSON deserializer. Back to the last piece of code, we looked at what the Getdeserializer method did:

Deserializer = Getdeserializerfrommetastore (false);

We'd better hit a breakpoint here and see that follow-up code discovery was performed:

Return Metastoreutils.getdeserializer (sessionstate.getsessionconf (), ttable, Skipconferror);

A Deserializer instance is then built through reflection, and its Initialize method is called:

Deserializer Deserializer = reflectionutil.newinstance (Conf.getclassbyname (Lib).       Assubclass (Deserializer.class), conf); Serdeutils.initializeserdewithouterrorcheck (Deserializer, Conf, metastoreutils.gettablemetadata (table), NUL L);

After following the Initializeserdewithouterrorcheck method, it was found that it performed:

Deserializer.initialize (conf, createoverlayedproperties (Tblprops, partprops));

We followed the following metastoreutils.gettablemetadata (table) to discover that it performed the Metastoreutils.getschema method. Follow in, we find the vital code, notice that all the mysteries are here:

     for  (Fieldschema col : tblsd.getcols ())  {       if  (!first)  {         Colnamebuf.append (",");         coltypebuf.append (":");         colcomment.append (';      }   ')     colnamebuf.append (Col.getname ());       coltypebuf.append (Col.gettype ());       colcomment.append ((Null != col.getcomment ())  ? col.getcomment ()  :  "");      first = false;     }    string colnames = colnamebuf.tostring ();     string coltypes = coltypebuf.tostring ();     Schema.setproperty (        org.apache.hadoop.hive.metastore.api.hive_metastoreconstants.meta_table_columns,         colnames);     schema.setproperty (         org.apache.hadoop.hive.metastore.api.hive_metastoreconstants.meta_table_column_ Types,        coltypes);     schema.setproperty (" Columns.comments ",  colcomment.tostring ());

That is, hive is the message that the serialized deserialization class has commented on, first the information that is commented on is divided by \ s, and then it is placed in the property with the key value of columns.comments.


Hive-json-serde Commissioning

And then we're going to open hive-json-serde and see what it does with this information. It is easy to find class Jsonserde. Take a look at its Initialize method:

String Columnnameproperty = Tbl.getproperty (constants.list_columns); String Columntypeproperty = Tbl.getproperty (constants.list_column_types);

It doesn't even have an annotated message! and see how it generates the Rowobjectinspector:

Rowobjectinspector = (structobjectinspector) jsonobjectinspectorfactory. Getjsonobjectinspectorfromtypeinfo (Rowtypeinfo, options);

Follow the Getjsonobjectinspectorfromtypeinfo method to find:

result = Jsonobjectinspectorfactory.getjsonstructobjectinspector (FieldNames, Fieldobjectinspect ORS, Options);

Then follow in and find out:

result = new Jsonstructobjectinspector (Structfieldnames, structfieldobjectinspectors, Options);

Let's look at this jsonstructobjectinspector class, which is an inherited standardstructobjectinspector whose constructor calls the parent class:

Protected Standardstructobjectinspector (list<string> structfieldnames, list<objectinspector>  Structfieldobjectinspectors) {init (structfieldnames, structfieldobjectinspectors, NULL); }

A look at the last parameter passed by the INIT function is null, knowing that the problem is here, and there is actually another way to construct the parent class:

Protected Standardstructobjectinspector (list<string> structfieldnames, list<objectinspector> Structfieldobjectinspectors, list<string> structfieldcomments) {init (structfieldnames, StructFieldObjectIns  Pectors, structfieldcomments); }

In other words, it is allowed to pass in the comment information. Then our thinking is clear, first, the unresolved comments to parse out the information. Second, pass this annotation information into the Jsonstructobjectinspector constructor:

String Columncommentproperty = Tbl.getproperty ("columns.comments"); if (columncommentproperty! = null) {if (columncommentproperty.length () = = 0) {columncomments = new Arra        Ylist<string> ();        } else {columncomments = Arrays.aslist (Columncommentproperty.split ("N", Columnnames.size ())); } }

Here is one thing to note: In the Standardstructobjectinspector class, it will force check the number of fields and the number of comments equal, so in the split operation, it is necessary to pass 2 parameters, the comment is empty field completion, or to a bug. The next thing to do is to pass this annotation to each function, which is not described here.

Then change the Jsonstructobjectinspector constructor to:

Public Jsonstructobjectinspector (list<string> structfieldnames, list<objectinspector> structFieldOb Jectinspectors, list<string> structfieldcomments, Jsonstructoioptions opts) {super (Structfieldnames, StructF                Ieldobjectinspectors, structfieldcomments);    options = opts; }

Finally, recompile the Hive-json-serde:

MVN-PHDP23 Clean Package

This code changes more.


3. Summary

In fact, Hive Chinese comments garbled for two reasons. One is that hive does not convert the encoding format to UTF-8 when it writes comments to the stream. In the third-party plug-in Hive-json-serde, the note is not saved, and if the comment is empty, Hive automatically complements the from Deserializer string.

So just need to change the following a small point can be, first in the source of hive, find the QL directory, find org.apache.hadoop.hive.ql.exec in the Ddltask class, find Showcreatetable method. Modify the code for line No. 2110:

Outstream.writebytes (Createtab_stmt.render ());

For:

Outstream.write (Createtab_stmt.render (). GetBytes ("UTF-8"));

For Hive-json-serde, the code that parses the field comment is added to the Initialize method of its Jsonserde class:

String Columncommentproperty = Tbl.getproperty ("columns.comments"); if (columncommentproperty! = null) {if (columncommentproperty.length () = = 0) {columncomments = new Arra        Ylist<string> ();        } else {columncomments = Arrays.aslist (Columncommentproperty.split ("N", Columnnames.size ())); } }

And when constructing the rowobjectinspector, the annotation information is passed in:

Rowobjectinspector = (structobjectinspector) jsonobjectinspectorfactory. Getjsonobjectinspectorfromtypeinfo (Rowtypeinfo, columncomments, Options);

Then change the Jsonstructobjectinspector constructor to:

Public Jsonstructobjectinspector (list<string> structfieldnames, list<objectinspector> structFieldOb Jectinspectors, list<string> structfieldcomments, Jsonstructoioptions opts) {super (Structfieldnames, StructF                Ieldobjectinspectors, structfieldcomments);    options = opts; }



Related reading: Hive Chinese Comment garbled solution (1)


NetEase Cloud Free Experience Pavilion, 0 cost experience 20+ Cloud products!

More NetEase research and development, product, operation experience Sharing please visit NetEase Cloud community.


Related articles:
"Recommended" Wireshark grab packet Analysis--TCP/IP protocol

Hive Chinese Comment garbled solution (2)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.