Hive architecture (I) architecture and basic composition
Hive architecture (ii) implementation principle of hive and comparison with relational databases
Hive architecture (iii) metabase and basic operations
Hive architecture (iv) considerations and scalability
1. Notes for using hive
- Character Set
Hadoop and hive are both UTF-8-encoded, so all Chinese must be UTF-8-encoded for normal use.
Note: When Chinese data is loaded into the table, if the character set is different, it is likely that all garbled characters need to be transcoded, but hive itself does not have a function to do this.
- Compression
Hive.exe C. compress. the default output parameter is false. However, it is usually set to be unique and explicit. Otherwise, the result will be compressed, if you want to directly operate this file in hadoop, you cannot compress it.
- Count (distinct)
The current hive does not support multiple distinct entries in a query statement. To implement multi-distinct in hive query statements, use at least N + 1 query statement (n is the number of distinct). The first n queries deduplicate n columns respectively, the last query statement performs the join operation on n columns after deduplication to obtain the final result.
- Join
Only supports equivalent connections
- DML operations
Only insert/load operations are supported, and no update or delte is supported.
- Having
The having operation is not supported. If you need this function, use the WHERE clause to nest a subquery.
- Subquery
Hive does not support subqueries in the where clause.
- Semantic difference in processing null values in join
In the SQL standard, any null operations (numerical comparison, string operations, etc.) result in null. The logic of hive processing null values is basically the same as that of the standard, except for the special logic during join. The special logic here refers to the comparison of fields as join keys in hive join. null = NULL is meaningful and the return value is true.
- Semicolon
A semicolon is the end mark of an SQL statement. It is also used in hiveql. However, in hiveql, the semicolon recognition is not so intelligent. For example:
select concat(cookie_id,concat(‘;‘,’zoo’)) from c02_clickstat_fatdt1 limit 2;FAILED: Parse Error: line 0:-1 cannot recognize input ‘<EOF>‘ in function specification
It can be inferred that when parsing a hive statement, a semicolon is deemed to end the statement, regardless of whether it is enclosed in quotation marks.
The solution is to use the ASCII code of the semicolon octal to escape, then the above statement should be written:
select concat(cookie_id,concat(‘\073‘,‘zoo‘)) from c02_clickstat_fatdt1 limit 2;
Why is it an octal ASCII code? I tried to use a hexadecimal ASCII code, but hive treats it as a string and does not escape it. It seems that only octal characters are supported. The reason is unknown. This rule also applies to other non-select statements. For example, if you need to define a separator in create table, you must use the octal ASCII code to escape the non-printable characters.
10. insert
The "Overwrite" keyword must be added according to the syntax insert, that is, each insert is overwritten.
2. Extended features of hive
Hive is a very open system. Many content can be customized, including:
* File Format: Text File, Sequence File
* Data Format in the memory: Java integer/string, hadoop intwritable/Text
* User-provided MAP/reduce scripts: Use stdin/stdout to transmit data in whatever language
* User-Defined Functions: substr, trim, 1-1
* User-Defined Aggregate functions: Sum, average ...... N-1
2.1 Data File Format
Textfile |
Sequencefile |
Rcffile |
Data Type |
Text only |
Text/binary |
Internal Storage order |
Row-based |
Row-based |
Compression |
File Based |
Block Based |
Splitable |
Yes |
Yes |
Splitable after compression |
No |
Yes |
For example, you can use the file format to store the created table:
CREATE TABLE mylog ( user_id BIGINT, page_url STRING, unix_time INT) STORED AS TEXTFILE;
When your data file format cannot be recognized by the current hive, you can customize the file format. Refercontrib/src/java/org/apache/hadoop/hive/contrib/fileformat/base64
. After writing the custom format, you can specify the corresponding file format when creating the table:
CREATE TABLE base64_test(col1 STRING, col2 STRING)STORED ASINPUTFORMAT ‘org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat‘OUTPUTFORMAT ‘org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextOutputFormat‘;
2.2 serde
Serde is short for serialize/deserilize. It is used for serialization and deserialization.
Serialized formats include: delimiter (tab, comma, CTRL-A), thrift protocol.
Deserialization (in-memory): Java integer/string/arraylist/hashmap, hadoop writable, and custom.
Lazyobject is deserialized only when columns are accessed. Binarysortable retains the binary format of sorting.
Consider adding a new serde when the following conditions exist:
* Your data has a special serialization format, which is not supported by hive. You do not want to convert the data format before loading the data to hive.
* You can more effectively serialize disk data.
If you want to add a custom serde for text data, refercontrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java
. Regexserde deserializes data using the regular table provided by the user, for example:
CREATE TABLE apache_log(host STRING,identity STRING,user STRING,time STRING,request STRING,status STRING,size STRING,referer STRING,agent STRING)ROW FORMATSERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe‘WITH SERDEPROPERTIES( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\])([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\")([^ \"]*|\"[^\"]*\"))?","output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s";)STORED AS TEXTFILE;
If you want to add a custom serde for binary data, refer to the example.serde/src/java/org/apache/hadoop/hive/serde2/binarysortable
For example:
CREATE TABLE mythrift_tableROW FORMAT SERDE‘org.apache.hadoop.hive.contrib.serde2.thrift.ThriftSerDe‘WITH SERDEPROPERTIES ("serialization.class" = "com.facebook.serde.tprofiles.full","serialization.format" = "com.facebook.thrift.protocol.TBinaryProtocol";);
2.3 MAP/reduce script (Transform)
You can customize the MAP/reduce script used by hive, for example:
FROM (SELECT TRANSFORM(user_id, page_url, unix_time)USING ‘page_url_to_id.py‘AS (user_id, page_id, unix_time)FROM mylogDISTRIBUTE BY user_idSORT BY user_id, unix_time)mylog2SELECT TRANSFORM(user_id, page_id, unix_time)USING ‘my_python_session_cutter.py‘ AS (user_id, session_info);
The MAP/reduce script reads and writes data through stdin/stdout, And the debugging information is output to stderr.
2.4 UDF (user-defined-function)
You can customize functions to process data, for example:
add jar build/ql/test/test-udfs.jar;CREATE TEMPORARY FUNCTION testlengthAS ‘org.apache.hadoop.hive.ql.udf.UDFTestLength‘;SELECT testlength(src.value) FROM src;DROP TEMPORARY FUNCTION testlength;
Udftestlength. Java is:
package org.apache.hadoop.hive.ql.udf;public class UDFTestLength extends UDF { public Integer evaluate(String s) { if (s == null) { return null; } return s.length(); }}
UDF has the following features:
* It is easy to write udfs in Java.
* Hadoop writables/text has high performance.
* Udfs can be overloaded.
* Hive supports implicit type conversion.
* UDF supports variable-length parameters.
* Genericudf provides better performance (avoiding reflection ).
2.5 UDAF (User-Defined aggregation funcation)
Example:
SELECT page_url, count(1), count(DISTINCT user_id) FROM mylog;
The udafcount. Java code is as follows:
public class UDAFCount extends UDAF { public static class Evaluator implements UDAFEvaluator { private int mCount; public void init() { mcount = 0; } public boolean iterate(Object o) { if (o!=null) mCount++; return true; } public Integer terminatePartial() { return mCount; } public boolean merge(Integer o) { mCount += o; return true; } public Integer terminate() { return mCount; }}
Hive architecture (iv) considerations and scalability