Hive in the official document of the query language has a very detailed description, please refer to: http://wiki.apache.org/hadoop/Hive/LanguageManual, most of the content of this article is translated from this page, Some of the things that need to be noted during the use process are added.
Create tablecreate [EXTERNAL] TABLE [IF not EXISTS] table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT Table_comment] [Partitioned by (Col_name data_type [COMMENT col_comment], ...)] [CLUSTERED by (Col_name, col_name, ...) [SORTED by (Col_name [asc| DESC], ...) into num_buckets buckets] [ROW FORMAT Row_format] [STORED as File_format] [LOCATION Hdfs_path]
Create table creates a list of specified names. Throws an exception if a table of the same name already exists, and the user can ignore the exception with the IF not EXIST option.
The EXTERNAL keyword allows the user to create an external table, specifying a path to the actual data (LOCATION) while creating the table, Hive moving the data to the path pointed to by the data warehouse when the internal table is created, and if an external table is created, only the path where the data resides is recorded. Do not change the location of the data. When you delete a table, the metadata and data for the internal table are deleted together, and the external table deletes only the metadata and does not delete the data.
Like allows users to copy existing table structures, but does not copy data.
Users can customize Serde or use their own serde when they are building a table. If you do not specify row format or row format delimited, you will use the Serde that you brought. When the table is being built, the user also needs to specify the columns for the table, and the user specifies the columns of the table as well as the custom serde,hive to determine the data of the concrete columns of the table by Serde.
If the file data is plain text, you can use the STORED as textfile. If the data needs to be compressed, use STORED as SEQUENCE.
Tables with partitions can use the partitioned by statement when they are created. A table can have one or more partitions, each with a single directory. Also, tables and partitions can have a CLUSTERED by operation on a column and place several columns into a bucket (bucket). You can also sort data by using sort by. This can improve performance for specific applications.
Table and column names are case-insensitive, and serde and property names are case-sensitive. Comments for tables and columns are strings.
Drop Table
Deleting an internal table also deletes the table's metadata and data. Deletes an external table, deletes only the metadata, and retains the data.
Alter Table
The ALTER TABLE statement allows the user to change the structure of an existing table. Users can add columns/partitions, change Serde, add tables and Serde familiar, and rename the table itself.
ADD partitions
ALTER TABLE table_name ADD partition_spec [LOCATION ' Location1 '] partition_spec [LOCATION ' Location2 '] ... partition_ Spec:: PARTITION (Partition_col = partition_col_value, Partition_col = Partiton_col_value, ...)
Users can add partitions to a table using ALTER table add PARTITION. Quotes when the partition name is a string.
ALTER TABLE page_view ADD PARTITION (dt= ' 2008-08-08 ', country= ' us ') location '/path/to/us/part080808 ' PARTITION (dt= ' 2008-08-09 ', country= ' us ') location '/path/to/us/part080809 ';
DROP PARTITION
ALTER TABLE table_name DROP partition_spec, Partition_spec,...
The user can delete the partition using ALTER TABLE drop PARTITION. The metadata and data for the partition will be deleted together.
ALTER TABLE page_view DROP PARTITION (dt= ' 2008-08-08 ', country= ' us ');
RENAME TABLE
ALTER TABLE table_name RENAME to New_table_name
This command allows the user to rename the table. The location of the data and the name of the partition do not change. In other words, the old table name is not "freed", and changes to cousin change the data for the new table.
Change Column name/type/position/comment
ALTER TABLE table_name change [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [first| After column_name]
This command allows the user to modify the name, data type, comment, or location of a column.
Like what:
CREATE TABLE Test_change (a int, b int, c int);
ALTER TABLE test_change change a A1 INT; Change the name of column A to A1.
ALTER TABLE test_change A A1 STRING after B; Change the name of column A to the A1,a column to string and place it after column B. The new table structure is: B int, a1 string, C Int.
ALTER TABLE test_change Change b B1 INT A; Modifies the name of column B to B1 and places it in the first column. The structure of the new table is: B1 int, a string, C int.
Note: Changes to the column will only modify the Hive metadata, without altering the actual data. The user should determine the consistency of the metadata definition and the actual data structure.
Add/replace Columns
ALTER TABLE table_name add| REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)
Add COLUMNS allows the user to add new columns at the end of the current column, but before partitioning the column.
REPLACE COLUMNS Delete Subsequent columns and add new columns. This can only be done when using native Serde (Dynamicserde or Metadatatypecolumnsetserde).
Alter Table Properties
ALTER TABLE table_name SET tblproperties table_propertiestable_properties:: (property_name = Property_value, Property_name = Property_value, ...)
Users can add metadata to the table using this command, and the Last_modified_user,last_modified_time properties are now automatically managed by Hive. Users can add their own properties to the list. You can use DESCRIBE EXTENDED TABLE to get this information.
ADD Serde Properties
ALTER TABLE table_name set Serde serde_class_name [with serdeproperties serde_properties]alter TABLE table_name SET Serdeproperties serde_propertiesserde_properties: (property_name = property_value, property_name = Property_value, ... )
This command allows the user to add user-defined metadata to the Serde object. Hive to serialize and deserialize the data, the Serde property is initialized and the property is passed to the serde of the table. This allows users to store properties for a custom Serde.
Alter Table File Format and from
ALTER Table table_name SET fileformat file_formatalter table table_name CLUSTERED by (col_name, col_name, ...) [SORTED by (Col_name, ...)] into Num_buckets buckets
This command modifies the physical storage properties of the table.
Loading files into table
When data is loaded into a table, no conversion is made to the data. The Load operation simply copies/moves the data to the location of the Hive table.
Syntax:
LOAD DATA [local] inpath ' filepath ' [OVERWRITE] into TABLE tablename [PARTITION (Partcol1=val1, Partcol2=val2 ...)]
Synopsis:
The Load operation is simply a copy/move operation that moves the data file to the location corresponding to the Hive table.
FilePath can be: A relative path, such as a project/data1 absolute path, such as the complete URI of the/user/hive/project/data1 containing pattern, for example: hdfs://namenode:9000/user/hive/ The target of the project/data1 load can be a table or partition. If the table contains partitions, you must specify the partition name for each partition. FilePath can refer to a file (in which case, Hive moves the file to the table's directory) or a directory (in which case, Hive moves all files in the directory to the table's corresponding directory). If local is specified, then: The load command will look for filepath in the native file system. If the discovery is a relative path, the path is interpreted as the current path to the current user. The user can also specify a full URI for the local file, such as: File:///user/hive/project/data1. The load command copies the files in filepath to the destination file system. The target file system is determined by the location attribute of the table. The copied data file is moved to the location of the table's data. If the local keyword is not specified, this URI is used directly if FilePath is pointing to a complete uri,hive. Otherwise: If schema or authority,hive is not specified, the URI of Namenode is specified using the schema and Authority,fs.default.name defined in the Hadoop configuration file. If the path is not absolute, Hive is interpreted relative to/user/. Hive moves the contents of the file specified in filepath to the path specified by the table (or partition). If the OVERWRITE keyword is used, the content (if any) in the target table (or partition) is deleted, and then the contents of the file/directory to which FilePath points are added to the table/partition. If the target table (partition) already has a file and the file name conflicts with the filename in filepath, the existing file is replaced by the new file. SELECT
Syntax
SELECT [All | DISTINCT] select_expr, select_expr, ... From Table_reference [WHERE where_condition] [GROUP by col_list] [CLUSTER by col_list | [Distribute by col_list] [SORT by Col_list]] [LIMIT number] A SELECT statement can be part of a union query or a subquery. Table_reference is the input of a query, it can be a plain table, a view, a join, or a subquery simple query. For example, the following statement queries all columns for information from the T1 table. SELECT * from T1
WHERE Clause
Where condition is a Boolean expression. For example, the following query statement returns only sales representatives that have a sales record greater than 10 and belong to the United States. Hive does not support in,exist or subqueries in the WHERE clause.
SELECT * FROM sales WHERE amount > Region = "US"
All and DISTINCT clauses
Use the all and distinct options to differentiate the processing of duplicate records. The default is all, which indicates that all records are queried. Distinct indicates that duplicate records are removed.
hive> Select col1, col2 from T1 1 3 1 3 1 4 2 5hive> Select DISTINCT col1, col2 from T1 1 3 1 4 2 5hive> Select Distin CT col1 from T1 1 2
Query based on partition
A general SELECT query scans the entire table (unless it is for a sample query). However, if a table is built using the partitioned by clause, the query can use the features of the partition pruning (input pruning) to scan only one part of the table that it cares about. Hive the current implementation is that partition pruning is enabled only if the partition assertion appears in the WHERE clause closest to the FROM clause. For example, if the Page_views table uses a date column partition, the following statement reads only data that is partitioned as ' 2008-03-01 '.
SELECT page_views.* from page_views WHERE page_views.date >= ' 2008-03-01 ' and page_views.date ' 2008-03-31 ';
Having Clause
Hive now does not support the HAVING clause. You can convert a HAVING clause into a word query, for example:
SELECT col1 from T1 GROUP by col1 has SUM (col2) > 10
You can use the following query to express:
Select col1 from (select col1, SUM (col2) as col2sum to T1 GROUP by col1) T2 WHERE t2.col2sum > 10
LIMIT Clause
Limit can limit the number of records in a query. The results of the query are randomly selected. The following query query randomly queries 5 records from the T1 table:
SELECT * from T1 LIMIT 5
Top K query. The following query statement queries the 5 sales reps with the largest sales record.
SET mapred.reduce.tasks = 1 SELECT * FROM Sales SORT by amount DESC LIMIT 5
REGEX Column Specification
The SELECT statement can use regular expressions to make column selections, and the following statement queries all columns except DS and HR:
SELECT ' (ds|hr) +.+ ' from Salesjoin
Syntax
join_table:table_reference Join Table_factor [join_condition] | table_reference {left| Right| Full} [OUTER] JOIN table_reference join_condition | Table_reference left SEMI JOIN table_reference join_conditiontable_reference:table_factor | Join_tabletable_factor:tbl_name [alias] | Table_subquery Alias | (table_references) Join_condition:on equality_expression (and equality_expression) *equality_expression:expression = Expression
Hive supports only equivalent connections (equality joins), Outer joins (outer joins) and (left semi joins??? )。 Hive does not support all non-equivalent connections, because non-equivalence connections are very difficult to translate into map/reduce tasks. In addition, Hive supports connections with more than 2 tables.
There are several key points to note when writing join queries:
1. Only equivalent joins are supported, for example:
Select a.* from a join B in (a.id = b.id) Select a.* from a join B on (a.id = b.id and a.department = b.department)
is correct, however:
SELECT a.* from a JOIN b on (a.id b.id)
Is wrong.
2. You can join more than 2 tables, such as
SELECT A.val, B.val, c.val from a join B on (A.key = b.key1) join C on (C.key = B.key2)
If the join key for more than one table in a join is the same, the join is converted to a single map/reduce task, for example:
SELECT A.val, B.val, c.val from a join B on (A.key = b.key1) join C on (C.key = b.key1)
is converted to a single map/reduce task because only the b.key1 is used as the join key in the join.
SELECT A.val, B.val, c.val from a join B on (A.key = b.key1) join C on (C.key = B.key2)
This join is translated into 2 map/reduce tasks. Because B.key1 is used for the first join condition, and B.key2 for the second join.
Join, the logic for each map/reduce task is this: reducer caches the records of all the tables in the join sequence except the last table, and then serializes the results to the file system through the last table. This implementation helps reduce the amount of memory used at the reduce end. In practice, the largest table should be written at the end (otherwise it would waste a lot of memory because of caching). For example:
SELECT A.val, B.val, c.val from a join B on (A.key = b.key1) join C on (C.key = b.key1)
All tables use the same join key (calculated using 1 map/reduce tasks). The Reduce end caches records for tables A and B, and then calculates a join result each time a C-table record is made, similar to the following:
SELECT A.val, B.val, c.val from a join B in (A.key = b.key1) join C on (C.key = B.key2)
2 Map/reduce missions were used here. Cache a table for the first time, serialize with B table, second cache the results of the first map/reduce task, then serialize with C table.
The left,right and full OUTER keywords are used to handle the condition of a join hollow record, for example:
SELECT A.val, b.val from left OUTER JOIN B on (a.key=b.key)
There is one record output for all records in Table A. The result of the output should be a.val, B.val, when A.key=b.key, and when the B.key cannot find an equivalent A.key record, it will also output a.val, NULL. "From a left OUTER join B" must be written on the same line-meaning that table A is on the left side of table B, so all records in table A are preserved; "A right OUTER JOIN B" retains all records of Table B. The OUTER JOIN semantics should follow standard SQL spec.
The Join occurs before the WHERE clause. If you want to limit the output of a join, write the filter condition in the WHERE clause-or write it in the join clause. An easy to confuse problem here is the table partitioning scenario:
SELECT A.val, B.val from a left OUTER JOIN B on (a.key=b.key) WHERE a.ds= ' 2009-07-07 ' and b.ds= ' 2009-07-07 '
will join a table to Table B (OUTER join), listing A.val and B.val records. Other columns can be used as filter conditions in the WHERE clause. However, as mentioned earlier, if a record for table A is not found in table B, all columns in table B list NULL, including DS columns. That is, the join filters all records in the B table that match a table join key. In this case, the left OUTER makes the query result irrelevant to the WHERE clause. The workaround is to use the following syntax when OUTER JOIN:
SELECT A.val, b.val from left OUTER JOIN B on (A.key=b.key and b.ds= ' 2009-07-07 ' and a.ds= ' 2009-07-07 ')
The results of this query are filtered in advance in the join phase, so there is no such problem. This logic can also be applied to right and full type joins.
Join is not interchangeable. Either left or right join, it is connected to the right.
SELECT A.val1, A.val2, B.val, c.val from a JOIN b in (A.key = B.key) left OUTER JOIN c on (A.key = C.key)
Join a table to B first, discard all the mismatched records in the join key, and use this intermediate result and C table to join. There is a less obvious problem with this statement, when a key is present in both A and C tables, but not in table B: The entire record is discarded at the first join, a join B (including A.val1,a.val2 and A.key), and then we join the C table , if the c.key is equal to A.key or b.key, the result will be: null, NULL, NULL, C.val.
Left SEMI JOIN is a more efficient implementation of in/exists subqueries. Hive currently does not implement the in/exists subquery, so you can rewrite your subquery with the left SEMI JOIN. The limit to the left SEMI join is that the table to the right of the join clause can only set the filter condition in the ON clause, and not filter in the WHERE clause, the SELECT clause, or elsewhere.
Select A.key, A.value from a WHERE a.key into (select B.key from B);
Can be rewritten as:
SELECT A.key, a.val from left SEMI JOIN b on (A.key = B.key)