Hive (VI.) extended characteristics of –hive

Source: Internet
Author: User
Keywords Java OK user
Tags apache class create custom customization data example file

Hive is a very open system with a lot of content that supports user customization, including:

file format: Text file,sequence data format in file memory: Java integer/string, Hadoop intwritable/text User-supplied Map/reduce script: No matter what language, use Stdin/stdout transfer data user-defined functions: Substr, Trim, 1–1 user-defined aggregate functions: Sum, Average ... n–1 File formattextfilesequencefilercffiledata TypeText onlytext/binarytext/binaryinternal Storage orderrow-basedrow-basedcolumn-basedcompressionfile BasedBlock Basedblock basedsplitableyesyesyessplitable after Compressionnoyesyes CREATE TABLE mylog (user_id BIGINT, Page_url STRING, Unix_time INT) STORED as Textfile;

The file format can be customized when the user's data file format cannot be recognized by the current Hive. You can refer to the example in Contrib/src/java/org/apache/hadoop/hive/contrib/fileformat/base64. When you finish writing a custom format, you can specify the appropriate file format when you create the table:

CREATE TABLE base64_test (col1 string, col2 string) STORED as InputFormat ' Org.apache.hadoop.hive.contrib. Fileformat.base64.Base64TextInputFormat ' OutputFormat ' Org.apache.hadoop.hive.contrib. Fileformat.base64.Base64TextOutputFormat '; Serde

Serde is the abbreviation for Serialize/deserilize and is intended for serialization and deserialization. The serialization format includes:

Separator (tab, comma, ctrl-a) Thrift protocol

Deserialization (in-memory):

Java integer/string/arraylist/hashmap Hadoop Writable class user custom class

The present Serde see the following figure:

Where Lazyobject is only deserialized when accessing the column. Binarysortable: The sorted binary format is preserved.

Consider adding new Serde when the following conditions are available:

The

user's data has a special serialization format that is not supported by the current Hive, and the user does not want to convert the data format before loading the data into Hive. Users have a more efficient way to serialize disk data.

Users can refer to the example in Contrib/src/java/org/apache/hadoop/hive/contrib/serde2/regexserde.java if they want to add a custom Serde to the Text data. Regexserde uses the regular table provided by the user to deserialize the data, for example:

CREATE TABLE Apache_log (host string, identity string, user string, time string, request string, status string, size string, Referer string, agent string) ROW FORMAT serde ' org.apache.hadoop.hive.contrib.serde2.RegexSerDe ' with serdeproperties ( "Input.regex" = ([^]*) ([^]*) ([^]*) (-|\\[[^\\]]*\\]) ([^ \]*|\] [^\ "]*\") (-|[ 0-9]*) (-|[ 0-9]*) (?: ([^ \ "]*|\" [^\ "]*\") ([^ \]*|\] [^\ "]*\"))? "," output.format.string "="%1 $ s%2%3$s%4$s%5$s%6$s%7$s%8$s%9$ S ";) STORED as Textfile;

Users who want to add custom Serde to Binary data can refer to the example: serde/src/java/org/apache/hadoop/hive/serde2/binarysortable, for example:

CREATE TABLE mythrift_table ROW FORMAT serde ' org.apache.hadoop.hive.contrib.serde2.thrift.ThriftSerDe ' with Serdeproperties ("Serialization.class" = "Com.facebook.serde.tprofiles.full", "serialization.format" = " Com.facebook.thrift.protocol.TBinaryProtocol ";); Map/reduce script (Transform)

Users can customize the Map/reduce scripts used by Hive, such as:

From (SELECT TRANSFORM (user_id, Page_url, unix_time) USING ' page_url_to_id.py ' as (user_id, page_id, unix_time) from MyLog Distribute by user_id SORT by user_id, unix_time) mylog2 SELECT TRANSFORM (user_id, page_id, unix_time) USING ' My_python_ Session_cutter.py ' As (user_id, session_info);

The Map/reduce script reads and writes the data through Stdin/stdout, and the debug information is exported to stderr.

UDF (user-defined-function)

The user can customize the function to handle the data, for example:

add jar Build/ql/test/test-udfs.jar; CREATE temporary FUNCTION testlength as ' org.apache.hadoop.hive.ql.udf.UDFTestLength '; SELECT Testlength (src.value) from SRC; DROP temporary FUNCTION testlength;

Udftestlength.java is:

package org.apache.hadoop.hive.ql.udf; public class Udftestlength extends UDF {public Integer evaluate (String s) {if (s = = null) {return null;} return S.length ();}}

Custom functions can Overload:

Add Jar Build/contrib/hive_contrib.jar; CREATE temporary FUNCTION example_add as ' org.apache.hadoop.hive.contrib.udf.example.UDFExampleAdd '; SELECT Example_add (1, 2) from SRC; SELECT Example_add (1.1, 2.2) from SRC;

Udfexampleadd.java:

public class Udfexampleadd extends UDF {public integer evaluate (integer A, integer b) {if (a = NULL/| b = null) return null ; return a + B; Public double evaluate (double A, double b) {if (a = null | | b = NULL) return null; return a + b;}}

%%

When using UDF, type conversions are done automatically, and the type conversions in Java or C are similar, such as:

SELECT Example_add (1, 2.1) from SRC;

The result is 3.1, because the UDF converts the parameter "1″" of type Int to double.

Implicit conversions of types are controlled by udfresolver and can be controlled differently depending on the UDF.

A UDF can also support variable-length parameters, such as Udfexampleadd.java:

public class Udfexampleadd extends UDF {public integer evaluate (Integer ... a) {int total = 0; for (int i=0; i<a.length; i++) if (A[i]!= null) Total + = A[i]; return total; //The Mahouve for double evaluate (double ... a)}

Examples of use are:

SELECT Example_add (1, 2) from SRC; SELECT Example_add (1, 2, 3) from SRC; SELECT Example_add (1, 2, 3, 4.1) from SRC;

In summary, the UDF has the following characteristics:

is easy to write UDF in Java. The writables/text of Hadoop has high performance. UDF can be overloaded. Hive supports implicit type conversions. The UDF supports variable length parameters. GENERICUDF provides better performance (avoids reflection). UDAF (user-defined Aggregation funcation)

Example:

SELECT Page_url, COUNT (1), COUNT (DISTINCT user_id) from MyLog;

Udafcount.java:

public class Udafcount extends Udaf {public static class Evaluator implements Udafevaluator {private int mcount; Init () {mcount = 0;} public boolean iterate (Object o) {if (o!=null) mcount++; {return mcount;} public boolean merge (Integer o) {mcount = O; return true;} public Integer terminate () {return Mcount}}

UDAF Summary:

write Udaf and UDF similar UDAF can overload UDAF can return complex classes can prohibit partial aggregation when using UDAF

Comparison of Udf,udaf and MR scripts:

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.