Hive architecture (iv) considerations and scalability

Source: Internet
Author: User
Tags parse error printable characters

Hive architecture (I) architecture and basic composition
Hive architecture (ii) implementation principle of hive and comparison with relational databases
Hive architecture (iii) metabase and basic operations
Hive architecture (iv) considerations and scalability

1. Notes for using hive
  1. Character Set
    Hadoop and hive are both UTF-8-encoded, so all Chinese must be UTF-8-encoded for normal use.
    Note: When Chinese data is loaded into the table, if the character set is different, it is likely that all garbled characters need to be transcoded, but hive itself does not have a function to do this.
  2. Compression
    Hive.exe C. compress. the default output parameter is false. However, it is usually set to be unique and explicit. Otherwise, the result will be compressed, if you want to directly operate this file in hadoop, you cannot compress it.
  3. Count (distinct)
    The current hive does not support multiple distinct entries in a query statement. To implement multi-distinct in hive query statements, use at least N + 1 query statement (n is the number of distinct). The first n queries deduplicate n columns respectively, the last query statement performs the join operation on n columns after deduplication to obtain the final result.
  4. Join
    Only supports equivalent connections
  5. DML operations
    Only insert/load operations are supported, and no update or delte is supported.
  6. Having
    The having operation is not supported. If you need this function, use the WHERE clause to nest a subquery.
  7. Subquery
    Hive does not support subqueries in the where clause.
  8. Semantic difference in processing null values in join
    In the SQL standard, any null operations (numerical comparison, string operations, etc.) result in null. The logic of hive processing null values is basically the same as that of the standard, except for the special logic during join. The special logic here refers to the comparison of fields as join keys in hive join. null = NULL is meaningful and the return value is true.
  9. Semicolon
    A semicolon is the end mark of an SQL statement. It is also used in hiveql. However, in hiveql, the semicolon recognition is not so intelligent. For example:
select concat(cookie_id,concat(‘;‘,’zoo’)) from c02_clickstat_fatdt1 limit 2;FAILED: Parse Error: line 0:-1 cannot recognize input ‘<EOF>‘ in function specification

It can be inferred that when parsing a hive statement, a semicolon is deemed to end the statement, regardless of whether it is enclosed in quotation marks.
The solution is to use the ASCII code of the semicolon octal to escape, then the above statement should be written:

select concat(cookie_id,concat(‘\073‘,‘zoo‘)) from c02_clickstat_fatdt1 limit 2;

Why is it an octal ASCII code? I tried to use a hexadecimal ASCII code, but hive treats it as a string and does not escape it. It seems that only octal characters are supported. The reason is unknown. This rule also applies to other non-select statements. For example, if you need to define a separator in create table, you must use the octal ASCII code to escape the non-printable characters.
10. insert
The "Overwrite" keyword must be added according to the syntax insert, that is, each insert is overwritten.

2. Extended features of hive

Hive is a very open system. Many content can be customized, including:
* File Format: Text File, Sequence File
* Data Format in the memory: Java integer/string, hadoop intwritable/Text
* User-provided MAP/reduce scripts: Use stdin/stdout to transmit data in whatever language
* User-Defined Functions: substr, trim, 1-1
* User-Defined Aggregate functions: Sum, average ...... N-1

2.1 Data File Format
Textfile Sequencefile Rcffile
Data Type Text only Text/binary
Internal Storage order Row-based Row-based
Compression File Based Block Based
Splitable Yes Yes
Splitable after compression No Yes

For example, you can use the file format to store the created table:

CREATE TABLE mylog ( user_id BIGINT, page_url STRING, unix_time INT) STORED AS TEXTFILE;

When your data file format cannot be recognized by the current hive, you can customize the file format. Refercontrib/src/java/org/apache/hadoop/hive/contrib/fileformat/base64. After writing the custom format, you can specify the corresponding file format when creating the table:

CREATE TABLE base64_test(col1 STRING, col2 STRING)STORED ASINPUTFORMAT ‘org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat‘OUTPUTFORMAT ‘org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextOutputFormat‘;
2.2 serde

Serde is short for serialize/deserilize. It is used for serialization and deserialization.
Serialized formats include: delimiter (tab, comma, CTRL-A), thrift protocol.
Deserialization (in-memory): Java integer/string/arraylist/hashmap, hadoop writable, and custom.
Lazyobject is deserialized only when columns are accessed. Binarysortable retains the binary format of sorting.
Consider adding a new serde when the following conditions exist:
* Your data has a special serialization format, which is not supported by hive. You do not want to convert the data format before loading the data to hive.
* You can more effectively serialize disk data.

If you want to add a custom serde for text data, refercontrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java. Regexserde deserializes data using the regular table provided by the user, for example:

CREATE TABLE apache_log(host STRING,identity STRING,user STRING,time STRING,request STRING,status STRING,size STRING,referer STRING,agent STRING)ROW FORMATSERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe‘WITH SERDEPROPERTIES( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\])([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\")([^ \"]*|\"[^\"]*\"))?","output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s";)STORED AS TEXTFILE;

If you want to add a custom serde for binary data, refer to the example.serde/src/java/org/apache/hadoop/hive/serde2/binarysortableFor example:

CREATE TABLE mythrift_tableROW FORMAT SERDE‘org.apache.hadoop.hive.contrib.serde2.thrift.ThriftSerDe‘WITH SERDEPROPERTIES ("serialization.class" = "com.facebook.serde.tprofiles.full","serialization.format" = "com.facebook.thrift.protocol.TBinaryProtocol";);
2.3 MAP/reduce script (Transform)

You can customize the MAP/reduce script used by hive, for example:

FROM (SELECT TRANSFORM(user_id, page_url, unix_time)USING ‘page_url_to_id.py‘AS (user_id, page_id, unix_time)FROM mylogDISTRIBUTE BY user_idSORT BY user_id, unix_time)mylog2SELECT TRANSFORM(user_id, page_id, unix_time)USING ‘my_python_session_cutter.py‘ AS (user_id, session_info);

The MAP/reduce script reads and writes data through stdin/stdout, And the debugging information is output to stderr.

2.4 UDF (user-defined-function)

You can customize functions to process data, for example:

add jar build/ql/test/test-udfs.jar;CREATE TEMPORARY FUNCTION testlengthAS ‘org.apache.hadoop.hive.ql.udf.UDFTestLength‘;SELECT testlength(src.value) FROM src;DROP TEMPORARY FUNCTION testlength;

Udftestlength. Java is:

package org.apache.hadoop.hive.ql.udf;public class UDFTestLength extends UDF {  public Integer evaluate(String s) {    if (s == null) {      return null;    }    return s.length();  }}

UDF has the following features:
* It is easy to write udfs in Java.
* Hadoop writables/text has high performance.
* Udfs can be overloaded.
* Hive supports implicit type conversion.
* UDF supports variable-length parameters.
* Genericudf provides better performance (avoiding reflection ).

2.5 UDAF (User-Defined aggregation funcation)

Example:

SELECT page_url, count(1), count(DISTINCT user_id) FROM mylog;

The udafcount. Java code is as follows:

public class UDAFCount extends UDAF {  public static class Evaluator implements UDAFEvaluator {    private int mCount;    public void init() {      mcount = 0;    }    public boolean iterate(Object o) {      if (o!=null)        mCount++;      return true;    }    public Integer terminatePartial() {      return mCount;    }    public boolean merge(Integer o) {      mCount += o;      return true;    }    public Integer terminate() {      return mCount;  }}

Hive architecture (iv) considerations and scalability

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.