Type conversions
The Hive atomic data type can be converted implicitly, and if an expression uses int, then tinyint will be converted to int,hive without a reverse conversion, unless cast is used.
data type conversion rules :
Any integer type will be converted to a wider range of types.
All integer types, FLOAT, and string types can be implicitly converted to double types.
TINYINT, smalliny, int can all be converted to float
Blooean cannot be converted to any other data type
Cast for explicit data-type conversions:
Eg:cast (' 1 ', as INT) converts the string ' 1 ' to an integer value of 1
If cast fails the conversion, the expression returns a null value
Complex data types:
Hive has three complex data types: ARRAY, map, and struct.
Arrays and maps are similar to data types in Java with the same name, a struct is a record type that encapsulates a named collection of fields.
Complex data types allow arbitrary levels of nesting.
A complex data type declaration must use the angle bracket notation to indicate the type of data field within it.
eg
CREATE TABLE complex( col1 ARRAY<INT>, col2 MAP<STRING,INT>, col3 STRUCT<a:STRING,b:INT,c:DOUBLE>);
Each type of field access operation is shown below:
select col1[0],col2[‘b‘],col.c FROM complex;
Operations and functions:
The operations provided by Hive include:
Relational operations: equivalent judgment (x= ' a '), null (x is NULL), pattern match (x like ' A% ')
Arithmetic operations: addition: x+1
Logical operation: Logical OR OR (x or Y)
In MySQL and hive, strings are concatenated with the concat function
Hive built-in functions:
Hive provides a number of built-in functions, grouped into these major categories:
Mathematical and statistical functions, string functions, date functions, conditional functions, aggregation functions, and processing XML (using XPath functions) and JSON functions.
In the Hive shell environment, you can get a list of functions with show functions, to understand the specific function use help using the describe command:
function length;
Table:
The hive table logically consists of stored data and metadata that describes the data format, which is generally stored in HDFs, and can also be stored in other Hadoop file systems, including local systems and S3. Hive stores the metadata in a relational database.
managed tables and external tables
Hive creates the table, by default hive is responsible for managing the data, and hive moves the data into its "repository directory". Another approach is external tables, so that hive can access data outside the warehouse directory.
The difference between the two types of tables is in the semantics of load and drop:
Managed Tables : When you load data into a managed table, hive moves the data to the warehouse directory.
eg
create table t1( dummy String);load data inpath ‘/usr/tom/test.txt‘ into table t1;
Move the file Hdfs://usr/tom/test.txt to the hive's repository directory, which is hdfs://usr/hive/warehouse/t1.
The load operation is only a move of a file in the file system, even if it is a managed table, does not check that the data and the schema declared in the table match, and returns null for the missing field through the query, thus checking that the data is parsed correctly.
To discard a table:
drop table t1;
The table is deleted, including the table's metadata and data, and load is a move operation, and drop is a delete operation. The data disappears completely, which is the definition of the hive managed data.
External Tables : Unlike managed tables, external tables are controlled by you to control the creation and deletion of control data, where the location of external data needs to be specified when creating a table:
create external table t2(dummy String) location ‘/usr/tom/t2‘ ;load data inpath ‘/usr/tom/test.txt‘ into table t2;
With the external keyword, hive knows that the data is not managed by itself and therefore does not move the data to its own repository directory. In fact, when defined, hive does not check for the presence of external locations, so that the creation data can be deferred until the table is created.
Discarding the external table, hive does not touch the data, only deletes the metadata.
how to choose which table to use:
There is not much difference between the two ways except the drop semantics. All processing has hive complete and the managed table should be used. If you use hive and other tools to process unified data. You should use an external table. The common practice is to use the initial data set of HDFs (created by other processes) as an external table, and then move the data to the Managed Hive table using Hive's transformation function, and vice versa. External tables (not necessarily in HDFs) can be used to export data from hive for use by other applications. Another reason to use hive is to correlate different patterns for the same data.
Summary:
The main learning hive in hive Atom type conversion, complex data types, operations and functions, hive table, divided into managed tables and external tables, their definition, differences and when to use, their biggest difference is different from the drop semantics, the managed table drop operation will delete the table's metadata and data, The external table will only delete the metadata.
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Hive Learning Notes (ii)