Hive (create, alter, etc)

Source: Internet
Author: User
ArticleDirectory
    • Drop table
    • ALTER TABLE
    • Loading files into table
    • Join

Hive official documentation on the query language has a very detailed description, please refer to: http://wiki.apaCHE.org/hadoop/hive/?agemanual. Most of the content in this article is translated from this page. Some things to be noted during use are added.

C Reate table
CReate [External] Table [if not exists] table_name [(COl_name data_type [COmmentCOl _COmment],...)] [COmment table _COmment] [partitioned (COl_name data_type [COmmentCOl _COmment],...)] [CLustered (COl_name,COl_name,...) [sorted (COl_name [C| DesC],...)] Into num_buCKets BuCKets] [row format row_format] [stored as file_format] [LoCAtion hdfs_path]

CReate table creates a table with the specified name. If a table with the same name already exists, an exception is thrown. You can use the if not exist option to ignore this exception.

The external keyword allows you to create an external table and specify a path pointing to the actual data (LoCWhen creating an internal table, hive moves the data to the path pointed to by the Data Warehouse. If an External table is created, only the path of the data is recorded, and no changes are made to the data location. When you delete a table, the metadata and data of the internal table are deleted together, while the external table only deletes the metadata and does not delete the data.

Like allows users to copy the existing table structure, but does not copy data.

You can customize the serde or use the built-in serde when creating a table. If row format or row format delimited is not specified, the built-in serde will be used. When creating a table, you also need to specify columns for the table. When specifying columns for the table, you also specify custom serde, hive uses serde to determine the data of specific columns of a table.

If the file data is plain text, you can use stored as textfile. If data needs to be compressed, use stored as sequenCE.

You can use the partitioned by statement when creating a partitioned table. A table can have one or more partitions. Each partition has a directory. In addition, both tables and partitions can operate on a column.CThe lustered by operation puts several columns into a bucket (BuCKet. You can also use sort by to sort data. This improves the performance of a specific application.

The table name and column name are not case sensitive. The serde and attribute names are case sensitive. Annotations for tables and columns are strings.

Drop table

Deleting an internal table also deletes the table metadata and data. Delete An External table. Only metadata is deleted and data is retained.

ALTER TABLE

The alter table statement allows you to change the structure of an existing table. You can add columns/partitions, change serde, familiarize yourself with tables and serde, and rename the table itself.

Add partitions

Alter table table_name add partition_speC[LoCAtion 'loCAtion1'] partition_speC[LoCAtion 'loCAtion2']... partition_speC: Partition (partition _COL = partition _COl_value, partition _COL = partiton _COl_value ,...)

You can use alter table add partition to add partitions to a table. When the partition name is a string, enclose it with quotation marks.

 
Alter table page_view add partition (Dt = '2017-08-08 ',COuntry = 'us') LoCAtion '/path/to/US/part080808' partition (Dt = '2017-08-09 ',COuntry = 'us') LoCAtion '/path/to/US/part080809 ';

Drop Partition

 
Alter table table_name drop partition_speC, Partition_speC,...

You can use alter table drop partition to delete partitions. The metadata and data of the partition will be deleted.

Alter table page_view drop partition (Dt = '2017-08-08 ',COuntry = 'us ');

RENAME TABLE

 
Alter table table_name Rename to new_table_name

This command renames a table. The Data Location and partition name remain unchanged. In other words, the old table name is not "released", and changes to the old table will change the data of the new table.

CHaNGE COlumn name/type/position/COmment

Alter table table_nameCHaNGE[COlumn]COl_old_nameCOl_new_nameCOlumn_type [COmmentCOl _COmment] [first | afterCOlumn_name]

This command allows you to modify the name, data type, comment, or position of a column.

For example:

CReate table test _CHaNGE(A int, B INT,CINT );

Alter table test _CHaNGE CHaNGEA A1 int;Change the name of column A to A1.

Alter table test _CHaNGE CHaNGEA A1 string after B;Change the name of column A to A1, and the Data Type of column A to string, and place it after Column B. The new table structure is: B INT, A1 string,CInt.

Alter table test _CHaNGE CHaNGEB B1 int first;The name of Column B is changed to B1 and placed in the first column. The structure of the new table is B1 int, a string,CInt.

Note: Changing a column only modifies the hive metadata, but does not change the actual data. You should ensure the consistency between the metadata definition and the actual data structure.

Add/replaCECOlumns

 
Alter table table_name add | replaCECOlumns (COl_name data_type [COmmentCOl _COmment],...)

AddCOlumns allows you to add a new column at the end of the current column, but before the partition column.

ReplaCECAdd a new column to the column after the olumns is deleted. Only when native serde (dynamiCSerde or metadatatypeCOlumnsetserde.

Alter table Properties

 
Alter table table_name set tblproperties table_propertiestable_properties: (property_name = property_value, property_name = property_value ,...)

You can use this command to add metadata to the table. Currently, the last_modified_user and last_modified_time attributes are automatically managed by hive. You can add attributes to the list. Des can be used.CRibe extended table to obtain this information.

Add serde Properties

 
Alter table table_name set serde _CLass_name [with serdeproperties serde_properties] alter table table_name set serdeproperties serde_propertiesserde_properties: (property_name = property_value, property_name = property_value ,...)

This command allows users to add user-defined metadata to serde objects. To serialize and deserialize data, hive initializes the serde attribute and passes the attribute to the serde of the table. In this way, you can store attributes for custom serde.

Alter table file format and organization

 
Alter table table_name set fileformat file_formatalter table table_nameCLustered (COl_name,COl_name,...) [sorted (COl_name,...)] into num_buCKets BuCKets

This command modifies the physical storage attribute of the table.

Loading files into table

When data is loaded into a table, no conversion is performed on the data. The load operation only copies/moves data to the corresponding location of the hive table.

Syntax:

 
Load data [LoCAl] inpath 'filepath' [overwrite] into Table tablename [partition (PartCOl1 = val1, partCOl2 = val2...)]

Synopsis:

The load operation is just a copy/move operation. It moves the data file to the corresponding location of the hive table.

  • Filepath can be:

    • Relative Path, for example, projeCT/data1
    • Absolute path, for example:/user/hive/projeCT/data1
    • Complete uri of the include mode, such as HDFS: // namenode: 9000/user/hive/projeCT/data1
  • The target can be a table or partition. If the table contains partitions, you must specify the partition name for each partition.
  • Filepath can reference a file (in this case, hive will move the file to the corresponding directory of the table) or a directory (in this case, hive will move all the files in the directory to the corresponding directory of the table ).
  • If Lo is specifiedCAl, then:
    • The load command will find the filepath in the local file system. If a relative path is found, the path is interpreted as the current path relative to the current user. You can also specify a complete URI for a local file, such as file: // user/hive/proje.CT/data1.
    • The load command will copy the files in filepath to the target file system. The target file system is determined by the table location attribute. The location where the copied data file is moved to the table's data.
  • If Lo is not specifiedCAl keyword. If filepath points to a complete Uri, hive uses this URI directly. Otherwise:
    • If S is not specifiedCHema or authority, hive will use the S defined in the hadoop configuration fileCHema and authority, FS. Default. Name specify the namenode Uri.
    • If the path is not absolute, hive will explain it to/user.
    • Hive moves the specified file content in filepath to the path specified by table (or partition.
  • If the overwrite keyword is used, the content (if any) in the target table (or partition) will be deleted, then, add the content in the file/directory pointed to by filepath to the table/partition.
  • If the target table (partition) already has a file, and the file name conflicts with the file name in filepath, the existing file will be replaced by the new file.
Sele C T

Syntax

SeleCT [All | distinCT] SeleCT_expr, SeleCT_expr,... from table_referenCE [where _COndition] [groupCOl_list] [CLusterCOl_list | [distrisponbyCOl_list] [sortCOl_list] [limit number]
    • One SeleCThe T statement can be a Union query or a subquery.
    • Table_referenCE is the input of a query. It can be a common table, a view, a join, or a subquery.
    • Simple query. For example, the following statement queries information of all columns from Table T1.
 
SeleCT * from T1

WhereCLause

WhereCOndition is a Boolean expression. For example, the following query statement only returns a sales record greater than 10, which belongs to the sales representative in the United States. Hive does not support in, exist, or subqueries in the where clause.

 
SeleCT * from sales where amount> 10 and region = "us"

All and distinCTCLauses

Use all and distinCT option distinguishes the processing of Repeated Records. The default value is all, indicating to query all records. DistinCT indicates removing duplicate records.

 
Hive> SeleCTCOl1,COl2 from T1 1 3 1 3 1 4 2 5 hive> SeleCT distinCTCOl1,COl2 from T1 1 3 1 4 2 5 hive> SeleCT distinCTCOl1 from T1 1 2

Partition-Based Query

General SeleCT queries scan the entire table (unless for sampling queries ). However, if a table is created using the partitioned by clause, the query can use the input pruning feature to scan only the part of the table that it cares about. The current implementation of hive is that partition pruning is enabled only when the partition assertions appear in the WHERE clause closest to the from clause. For example, if the page_views table uses the date column partition, the following statement only reads data with the partition '2017-03-01.

 
SeleCT page_views. * From page_views where page_views.date> = '2017-03-01 'and page_views.date <= '2017-03-31 ';

HavingCLause

Hive currently does not support having clauses. The having clause can be converted into a word query, for example:

 
SeleCTCOl1 from T1 groupCOl1 having sum (COl2)> 10

You can use the following query:

 
SeleCTCOl1 from (SeleCTCOl1, sum (COl2)COl2sum from T1 groupCOl1) T2 where T2.COl2sum> 10

LimitCLause

Limit can limit the number of records to be queried. The query results are randomly selected. The following query statement randomly queries five records from Table T1:

 
SeleCT * from T1 limit 5

Top K query. The following query statement queries the five sales representatives with the largest sales record.

 
Set mapred. reduCE. Tasks = 1 SeleCT * from sales sort by amount desCLimit 5

RegExCOlumn SPECIfICAtion

SeleCThe T statement can use regular expressions for column selection. The following statement queries all columns except DS and HR:

 
SeleCT' (DS | hr )? +. + 'From sales
Join

Syntax

 
Join_table: table_referenCE join table_faCTor [join _COndition] | table_referenCE {left | right | full} [outer] Join table_referenCE join _COndition | table_referenCE left semi join table_referenCE join _COnditiontable_referenCE: table_faCTor | join_tabletable_faCTOR: tbl_name [alias] | table_subquery alias | (table_referenCEs) join _COndition: On equality_expression (and equality_expression) * equality_expression: expression = expression

Hive only supports equality joins, outer joins, and left semi joins ???). Hive does not support all non-equivalent connections, because non-equivalent connections are difficult to convert to map/redu.CE task. In addition, hive supports connections to more than two tables.

Note the following key points when writing a join query:
1. Only equivalent join is supported, for example:

 
SeleCT a. * from a join B on (A. ID = B. ID) SeleCT a. * from a join B on (A. ID = B. ID and A. Department = B. Department)

Yes, however:

 
SeleCT a. * from a join B on (A. id B. ID)

Yes.

2. You can join more than two tables, for example

 
SeleCT a. Val, B. Val,C. Val from a join B on (A. Key = B. key1) joinCOn (C. Key = B. key2)

If the join key of multiple tables in a join operation is the same, the join operation is converted to a single map/redu.CE task, for example:

SeleCT a. Val, B. Val,C. Val from a join B on (A. Key = B. key1) joinCOn (C. Key = B. key1)

Converted to a single map/reduCE task, because only B. key1 is used as the join key in join.

 
SeleCT a. Val, B. Val,C. Val from a join B on (A. Key = B. key1) joinCOn (C. Key = B. key2)

This join is converted to two map/reduCE task. Because B. key1 is used for the first join condition, and B. key2 is used for the second join.

During join, each map/reduCThe logic of task e is as follows: reduCEr caches records of all tables except the last table in the join sequence, and then serializes the results to the file system through the last table. This implementation helps in reduCE-end reduces memory usage. In practice, the largest table should be written at the end (otherwise, a large amount of memory will be wasted due to cache ). For example:

 
SeleCT a. Val, B. Val,C. Val from a join B on (A. Key = B. key1) joinCOn (C. Key = B. key1)

All Tables use the same join key (using 1 MAP/redu)CE ). ReduCOn the e side, the records of table A and table B are cached, andCA join result is calculated for a table record, similar to the following:

 
SeleCT a. Val, B. Val,C. Val from a join B on (A. Key = B. key1) joinCOn (C. Key = B. key2)

Here we use map/redu twiceCE task. Table A is cached for the first time, and table B is serialized. The first map/redu is cached for the second time.CE task result, and then useCTable serialization.

The left, right, and full outer keywords are used to process join null records. For example:

 
SeleCT a. Val, B. Val from a left Outer Join B on (A. Key = B. Key)

Each record in Table A has a record output. The output result is. val, B. val, when. key = B. key, while B. the Equivalent. a. val, null. The sentence "from a left Outer Join B" must be written in the same row-meaning that table A is in Table BLeftTherefore, all records in Table A are retained. "A right Outer Join B" retains all records in table B. Outer Join semantics should follow standard SQL SPEC.

Join occurs in the WHERE clauseBefore. If you want to limit the join output, you should write filtering conditions in the WHERE clause -- or write in the join clause. A confusing problem is Table Partitioning:

 
SeleCT. val, B. val from a left Outer Join B on (. key = B. key) where. DS = '2017-07-07 'and B. DS = '2017-07-07'

Join Table A to table B (Outer Join) to list records of A. Val and B. Val. Other columns can be used as filter conditions in the WHERE clause. However, as described above, if the records corresponding to Table A cannot be found in Table B, all columns in Table B will list null,Includes DS Columns. That is to say, join filters out all records that cannot match the join key of Table A in table B. In this case, left outer makes the query result irrelevant to the WHERE clause. The solution is to use the following syntax during outer join:

 
SeleCT. val, B. val from a left Outer Join B on (. key = B. key and B. DS = '2017-07-07 'and. DS = '2017-07-07 ')

The results of this query are filtered out in the join stage in advance, so the above problems do not exist. This logic can also be applied to right and full join operations.

Join cannot exchange locations. Both left and right join are connected on the left.

SeleCT a. val1, A. val2, B. Val,C. Val from a join B on (A. Key = B. Key) left Outer JoinCOn (A. Key =C. Key)

First join Table A to table B, discard all records that do not match in the join key, and then use the resultCJoin tables. This statement is not obvious, that is, when a key is in Table A andCWhen all tables exist, but the B table does not exist: the entire record is lost during the first join, that is, a join B (including. val1,. val2 and. key), and then weCWhen a table is joined, ifCIf. Key is equal to a. Key or B. Key, the following result is displayed: NULL, null, null,C. Val.

Left semi join is a more efficient implementation of in/exists subqueries. Hive does not currently implement in/exists subqueries, so you can use left semi join to override your subquery statements. The restriction of left semi join is that the table on the right of the join clause can only set filter conditions in the on clause. In the WHERE clause, SeleCT clause or other places cannot be filtered.

 
SeleCT a. Key, A. value from a where a. Key in (SeleCT B. Key from B );

It can be rewritten as follows:

 
SeleCT a. Key, A. Val from a left semi join B on (A. Key = B. Key)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.