Hive discussion (4)-hive QL

Source: Internet
Author: User
Document directory
  • Create Table
  • ALTER TABLE
  • Select
  • Join
Hive discussion (4)-hive QL

Hive official documentation on the query language has a very detailed description, please refer to: http://wiki.apache.org/hadoop/Hive/LanguageManual, most of the content of this article translated from this page, some things need to be noted during use.

Create Table
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name  [(col_name data_type [COMMENT col_comment], ...)]  [COMMENT table_comment]  [PARTITIONED BY (col_name data_type    [COMMENT col_comment], ...)]  [CLUSTERED BY (col_name, col_name, ...)  [SORTED BY (col_name [ASC|DESC], ...)]  INTO num_buckets BUCKETS]  [ROW FORMAT row_format]  [STORED AS file_format]  [LOCATION hdfs_path]

Create Table creates a table with the specified name. If a table with the same name already exists, an exception is thrown. You can use the if not exist option to ignore this exception.

The external keyword allows you to create an external table. When creating a table, you can specify a path pointing to the actual data. When creating an internal table in hive, the data is moved to the path pointed to by the Data Warehouse. If an External table is created, only the path of the data is recorded, and no changes are made to the data location. When you delete a table, the metadata and data of the internal table are deleted together, while the external table only deletes the metadata and does not delete the data.

Like allows users to copy the existing table structure, but does not copy data.

You can customize the serde or use the built-in serde when creating a table. If row format or row format delimited is not specified, the built-in serde will be used. When creating a table, you also need to specify columns for the table. When specifying columns for the table, you also specify custom serde, hive uses serde to determine the data of specific columns of a table.

If the file data is plain text, you can use stored as textfile. If data needs to be compressed, use stored as sequence.

You can use the partitioned by statement when creating a partitioned table. A table can have one or more partitions. Each partition has a directory. In addition, tables and partitions can perform the clustered by operation on a column and put several columns into a bucket. You can also use sort by to sort data. This improves the performance of a specific application.

The table name and column name are not case sensitive. The serde and attribute names are case sensitive. Annotations for tables and columns are strings.

Drop table

Deleting an internal table also deletes the table metadata and data. Delete An External table. Only metadata is deleted and data is retained.

ALTER TABLE

The alter table statement allows you to change the structure of an existing table. You can add columns/partitions, change serde, familiarize yourself with tables and serde, and rename the table itself.

Add partitions

ALTER TABLE table_name ADD  partition_spec [ LOCATION 'location1' ]  partition_spec [ LOCATION 'location2' ] ...partition_spec:  : PARTITION (partition_col = partition_col_value,    partition_col = partiton_col_value, ...)

You can use alter table add partition to add partitions to a table. When the partition name is a string, enclose it with quotation marks.

  ALTER TABLE page_view ADD    PARTITION (dt='2008-08-08', country='us')      location '/path/to/us/part080808'    PARTITION (dt='2008-08-09', country='us')      location '/path/to/us/part080809';

Drop Partition

ALTER TABLE table_name DROP    partition_spec, partition_spec,...

You can use alter table drop partition to delete partitions. The metadata and data of the partition will be deleted.

ALTER TABLE page_view    DROP PARTITION (dt='2008-08-08', country='us');

RENAME TABLE

ALTER TABLE table_name RENAME TO new_table_name

This command renames a table. The Data Location and partition name remain unchanged. In other words, the old table name is not "released", and changes to the old table will change the data of the new table.

Change column name/type/position/comment

ALTER TABLE table_name CHANGE [COLUMN]  col_old_name col_new_name column_type    [COMMENT col_comment]    [FIRST|AFTER column_name]

This command allows you to modify the name, data type, comment, or position of a column.

For example:

CREATE TABLE test_change (a int, b int, c int);

ALTER TABLE test_change CHANGE a a1 INT;Change the name of column A to A1.

ALTER TABLE test_change CHANGE a a1 STRING AFTER b;Change the name of column A to A1, and the Data Type of column A to string, and place it after Column B. The new table structure is: B INT, A1 string, C Int.

ALTER TABLE test_change CHANGE b b1 INT FIRST;The name of Column B is changed to B1 and placed in the first column. The structure of the new table is: B1 int, a string, C Int.

Note: Changing a column only modifies the hive metadata, but does not change the actual data. You should ensure the consistency between the metadata definition and the actual data structure.

Add/replace Columns

ALTER TABLE table_name ADD|REPLACE  COLUMNS (col_name data_type [COMMENT col_comment], ...)

Add columns allows you to add a new column at the end of the current column, but before the partition column.

Replace columns after deletion, add new columns. This can be done only when native serde (dynamicserde or metadatatypecolumnsetserde) is used.

Alter table Properties

ALTER TABLE table_name SET TBLPROPERTIES table_propertiestable_properties:  : (property_name = property_value, property_name = property_value, ... )

You can use this command to add metadata to the table. Currently, the last_modified_user and last_modified_time attributes are automatically managed by hive. You can add attributes to the list. You can use describe extended table to obtain this information.

Add serde Properties

ALTER TABLE table_name    SET SERDE serde_class_name    [WITH SERDEPROPERTIES serde_properties]ALTER TABLE table_name    SET SERDEPROPERTIES serde_propertiesserde_properties:  : (property_name = property_value,    property_name = property_value, ... )

This command allows users to add user-defined metadata to serde objects. To serialize and deserialize data, hive initializes the serde attribute and passes the attribute to the serde of the table. In this way, you can store attributes for custom serde.

Alter table file format and organization

ALTER TABLE table_name SET FILEFORMAT file_formatALTER TABLE table_name CLUSTERED BY (col_name, col_name, ...)  [SORTED BY (col_name, ...)] INTO num_buckets BUCKETS

This command modifies the physical storage attribute of the table.

Loading files into table

When data is loaded into a table, no conversion is performed on the data. The load operation only copies/moves data to the corresponding location of the hive table.

Syntax:

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE]    INTO TABLE tablename    [PARTITION (partcol1=val1, partcol2=val2 ...)]

Synopsis:

The load operation is just a copy/move operation. It moves the data file to the corresponding location of the hive table.

  • Filepath can be:

    • Relative Path, for example, project/data1
    • Absolute path, for example:/user/hive/project/data1
    • Complete uri of the include mode, for example: HDFS: // namenode: 9000/user/hive/project/data1
  • The target can be a table or partition. If the table contains partitions, you must specify the partition name for each partition.
  • Filepath can reference a file (in this case, hive will move the file to the corresponding directory of the table) or a directory (in this case, hive will move all the files in the directory to the corresponding directory of the table ).
  • If local is specified:
    • The load command will find the filepath in the local file system. If a relative path is found, the path is interpreted as the current path relative to the current user. You can also specify a complete URI for a local file, such as file: // user/hive/project/data1.
    • The load command will copy the files in filepath to the target file system. The target file system is determined by the table location attribute. The location where the copied data file is moved to the table's data.
  • If the local keyword is not specified and filepath points to a complete Uri, hive uses this URI directly. Otherwise:
    • If no schema or authority is specified, hive uses the schema and authority defined in the hadoop configuration file, and fs. Default. Name specifies the URI of namenode.
    • If the path is not absolute, hive will explain it to/user.
    • Hive moves the specified file content in filepath to the path specified by table (or partition.
  • If the overwrite keyword is used, the content (if any) in the target table (or partition) will be deleted, then, add the content in the file/directory pointed to by filepath to the table/partition.
  • If the target table (partition) already has a file, and the file name conflicts with the file name in filepath, the existing file will be replaced by the new file.
Select

Syntax

SELECT [ALL | DISTINCT] select_expr, select_expr, ...  FROM table_reference  [WHERE where_condition]  [GROUP BY col_list]  [    CLUSTER BY col_list    | [DISTRIBUTE BY col_list]    [SORT BY col_list]  ][LIMIT number]
  • A select statement can be a Union query or a subquery.
  • Table_reference is the input of a query. It can be a common table, a view, a join, or a subquery.
  • Simple query. For example, the following statement queries information of all columns from Table T1.
SELECT * FROM t1

Where clause

Where condition is a Boolean expression. For example, the following query statement only returns a sales record greater than 10, which belongs to the sales representative in the United States. Hive does not support in, exist, or subqueries in the where clause.

SELECT * FROM sales WHERE amount > 10 AND region = "US"

All and distinct clses

Use the all and distinct options to differentiate the processing of duplicate records. The default value is all, indicating to query all records. Distinct indicates removing duplicate records.

hive> SELECT col1, col2 FROM t1    1 3    1 3    1 4    2 5hive> SELECT DISTINCT col1, col2 FROM t1    1 3    1 4    2 5hive> SELECT DISTINCT col1 FROM t1    1    2

Partition-Based Query

Generally, select queries scan the entire table (unless for sampling queries ). However, if a table is created using the partitioned by clause, the query can use the input pruning feature to scan only the part of the table that it cares about. The current implementation of hive is that partition pruning is enabled only when the partition assertions appear in the WHERE clause closest to the from clause. For example, if the page_views table uses the date column partition, the following statement only reads data with the partition '2017-03-01.

 SELECT page_views.*    FROM page_views    WHERE page_views.date >= '2008-03-01'      AND page_views.date <= '2008-03-31';

Having clause

Hive currently does not support having clauses. The having clause can be converted into a word query, for example:

SELECT col1 FROM t1 GROUP BY col1 HAVING SUM(col2) > 10

You can use the following query:

SELECT col1 FROM (SELECT col1, SUM(col2) AS col2sum  FROM t1 GROUP BY col1) t2  WHERE t2.col2sum > 10

Limit clause

Limit can limit the number of records to be queried. The query results are randomly selected. The following query statement randomly queries five records from Table T1:

SELECT * FROM t1 LIMIT 5

Top K query. The following query statement queries the five sales representatives with the largest sales record.

SET mapred.reduce.tasks = 1  SELECT * FROM sales SORT BY amount DESC LIMIT 5

RegEx column specification

The SELECT statement can use regular expressions for column selection. The following statement queries all columns except DS and HR:

SELECT `(ds|hr)?+.+` FROM sales
Join

Syntax

join_table:    table_reference JOIN table_factor [join_condition]  | table_reference {LEFT|RIGHT|FULL} [OUTER]    JOIN table_reference join_condition  | table_reference LEFT SEMI JOIN      table_reference join_conditiontable_reference:    table_factor  | join_tabletable_factor:    tbl_name [alias]  | table_subquery alias  | ( table_references )join_condition:    ON equality_expression ( AND equality_expression )*equality_expression:    expression = expression

Hive only supports equality joins, outer joins, and left semi joins ???). Hive does not support all non-equivalent connections, because non-equivalent connections are difficult to convert to map/reduce tasks. In addition, hive supports connections to more than two tables.

Note the following key points when writing a join query:
1. Only equivalent join is supported, for example:

  SELECT a.* FROM a JOIN b ON (a.id = b.id)  SELECT a.* FROM a JOIN b    ON (a.id = b.id AND a.department = b.department)

Yes, however:

  SELECT a.* FROM a JOIN b ON (a.id  b.id)

Yes.

2. You can join more than two tables, for example

  SELECT a.val, b.val, c.val FROM a JOIN b    ON (a.key = b.key1) JOIN c ON (c.key = b.key2)

If the join key of multiple tables in a join operation is the same, the join operation is converted to a single map/reduce task. For example:

  SELECT a.val, b.val, c.val FROM a JOIN b    ON (a.key = b.key1) JOIN c    ON (c.key = b.key1)

Converted to a single map/reduce task, because only B. key1 is used as the join key in join.

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1)  JOIN c ON (c.key = b.key2)

This join is converted into two map/reduce tasks. Because B. key1 is used for the first join condition, and B. key2 is used for the second join.

During join, the logic of each map/reduce task is as follows: CER caches records of all tables except the last table in the join sequence, then serialize the result to the file system through the last table. This implementation helps reduce the memory usage on the reduce side. In practice, the largest table should be written at the end (otherwise, a large amount of memory will be wasted due to cache ). For example:

 SELECT a.val, b.val, c.val FROM a    JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)

All Tables use the same join key (one map/reduce task ). The reduce end caches the records of table A and Table B, and calculates the join result every time a record of table C is obtained, similar to the following:

  SELECT a.val, b.val, c.val FROM a    JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)

Two MAP/reduce tasks are used here. Table A is cached for the first time and serialized with table B. The results of the first map/reduce task are cached for the second time, and serialized with Table C.

The left, right, and full outer keywords are used to process join null records. For example:

  SELECT a.val, b.val FROM a LEFT OUTER    JOIN b ON (a.key=b.key)

Each record in Table A has a record output. The output result is. val, B. val, when. key = B. key, while B. the Equivalent. a. val, null. The sentence "from a left Outer Join B" must be written in the same row-meaning that table A is in Table BLeftTherefore, all records in Table A are retained. "A right Outer Join B" retains all records in table B. The outer join semantics should follow the standard SQL spec.

Join occurs in the WHERE clauseBefore. If you want to limit the join output, you should write filtering conditions in the WHERE clause -- or write in the join clause. A confusing problem is Table Partitioning:

  SELECT a.val, b.val FROM a  LEFT OUTER JOIN b ON (a.key=b.key)  WHERE a.ds='2009-07-07' AND b.ds='2009-07-07'

Join Table A to table B (Outer Join) to list records of A. Val and B. Val. Other columns can be used as filter conditions in the WHERE clause. However, as described above, if the records corresponding to Table A cannot be found in Table B, all columns in Table B will list null,Includes DS Columns. That is to say, join filters out all records that cannot match the join key of Table A in table B. In this case, left outer makes the query result irrelevant to the WHERE clause. The solution is to use the following syntax during outer join:

  SELECT a.val, b.val FROM a LEFT OUTER JOIN b  ON (a.key=b.key AND      b.ds='2009-07-07' AND      a.ds='2009-07-07')

The results of this query are filtered out in the join stage in advance, so the above problems do not exist. This logic can also be applied to right and full join operations.

Join cannot exchange locations. Both left and right join are connected on the left.

  SELECT a.val1, a.val2, b.val, c.val  FROM a  JOIN b ON (a.key = b.key)  LEFT OUTER JOIN c ON (a.key = c.key)

Join Table A to table B first, discard all records that do not match the join key, and then join the table C with the result of this middle. This statement is not obvious. When a key exists in both table A and Table C but does not exist in Table B: the entire record is joined for the first time, that is, a join B is lost (including. val1,. val2 and. key), and then when we join the C table, if C. key and. key or B. if the key is equal, the following result is obtained: NULL, null, null, C. val.

Left semi join is a more efficient implementation of in/exists subqueries. Hive does not currently implement in/exists subqueries, so you can use left semi join to override your subquery statements. The restriction of left semi join is that the table on the right of the join clause can only set filtering conditions in the on clause, but cannot be filtered in the WHERE clause, select clause, or other places.

  SELECT a.key, a.value  FROM a  WHERE a.key in   (SELECT b.key    FROM B);

It can be rewritten as follows:

   SELECT a.key, a.val   FROM a LEFT SEMI JOIN b on (a.key = b.key)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.