Detailed description of the latest Hive Data Operations (super detailed)

Last Update:2015-06-04 Source: Internet

Author: User

Tags mathematical functions hdfs dfs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Detailed description of the latest Hive Data Operations (super detailed)
The data operation capability of zookeeper is crucial to big data analysis. Data operations include exchange, moving, sorting, and transforming ). Hive provides many query statements, keywords, operations, and methods for data operations.I. Data ChangeData changes include LOAD, INSERT, IMPORT, and EXPORT.1. LOAD DATAThe load keyword is used to move data to hive. If data is loaded from HDFS, the source data will be deleted after the load is successful. If the data is loaded locally, the source data will not be deleted after the load is successful.
Data: Maid http://pan.baidu.com/s/1c0D9TpI
Example:
Hive> (you do NOT need to enter this parameter. In this case, enter the following command in the Hive shell.) create table if not exists employee_hr (name string, employee_id int, sin_number string, start_date timestamp) row format delimited fields terminated by '| 'stored as textfile;
Example:Hive> load data local inpath'/apps/ca/yanh/employee_hr.txt 'overwrite into table employee_hr; Note 1: The LOCAL keyword in the command is used to specify that DATA is loaded from the LOCAL machine, if this keyword is removed, it is loaded from HDFS by default! The OVERWRITE keyword specifies to use the OVERWRITE method to load data. Otherwise, use the additional method to load data.
NOTE 2: if data is loaded to a partition table, you must specify the partition column.
2. INSERTLike RDBMS, Hive also supports extracting data from other hive tables and inserting data into the specified table, using the INSERT keyword. INSERT is the most common operation for filling existing data into a specified table in Hive data processing. In Hive, INSERT can be used together with OVERWRITE to implement OVERWRITE insertion. Multiple tables can be inserted, dynamic partition insertion, and data extraction can be performed to HDFS or local.
Example:Hive> create table ctas_employee as select * FROM employee; truncate table employee; // Delete the data in employee and retain the TABLE structure.
Example:
Hive> insert into table employee SELECT * FROM ctas_employee; note: we use the beeline tool provided by Hive to connect the data TABLE so that the data TABLE can be clearly displayed.
For example, insert data from CTEHive> WITH a AS (SELECT * FROM ctas_employee) FROM a insert overwrite table employee SELECT *; // The effect is the same AS that in the previous example. Note: Hive supports CTE since version 0.6.2.
Example: insert multiple tablesHive> create table employee_internal LIKE employee; FROM ctas_employee insert overwrite table employee SELECT * insert overwrite table employee_internal SELECT *; SELECT * FROM employee_internal;
In addition to inserting static data into static partitions, Hive also supports inserting dynamic data, such as date
For example, insert a dynamic partition.Dynamic partitions are disabled by default. You can enable them by setting hive.exe c. dynamic. partition = true; by default, Hive requires at least one partition column to be static. You can disable it using the following settings: SET hive.exe c. dynamic. partition. mode = nonstrict; hive> insert into table employee_partitioned PARTITION (year, month) SELECT name, array ('toronto ') AS work_place, named_struct ("sex", "Male ", "age", 30) AS sex_age, map ("Python", 90) AS skills_score, map ("R & D", array ('developer ') AS depart_title, year (start_date) AS year, month (start_date) AS month FROM employee_hr eh WHERE eh. maid = 102;
Example:
Hive> SELECT * FROM employee_partitioned;
For example, extract data to a local device.(The ^ A column is used by default, and the line break is used to separate rows.) Note: only OVERWRITE and INTO can be used to extract data from Hive. Note: In some Hadoop versions, the directory depth is limited to 2 layers. You can use the following settings to fix the issue: SET hive. insert. into. multilevel. dirs = true; hive> insert overwrite local directory '/apps/ca' SELECT * FROM employee; Note: by default, Hive generates multiple output files based on the number of reducers, run the following command to merge hdfs dfs-getmerge hdfs :// : Port/user/output/directory

For example, use a specific separator to separate rows.Hive> insert overwrite local directory '/apps/ca/yanh/data' row format delimited fields terminated by' | 'select * FROM employee;

For example, Hive can output files in multiple directories.Hive> FROM employee insert overwrite local directory '/apps/ca/yanh/data1' SELECT * insert overwrite local directory '/apps/ca/yanh/data2' SELECT *;

3. EXPORT and IMPORTThese two commands are used by Hive to migrate data or back up data with HDFS. They are available starting from Hive0.8.0. EXPORT can EXPORT data and metadata to HDFS. The metadata is named _ metadata. The data is placed in the directory named data.

Example:Hive>

Export table employee TO '/apps/ca/yanh/data'; note: the output directory cannot already exist.

For example, import the output data to Hive.(If you import data to an existing table, an error is returned.) hive>

Import from '/apps/ca/yanh/data ';

Example: import to a new table(It can also be an EXTERNAL table) hive>

Import table employee_imported FROM '/apps/ca/yanh/data ';

Example: partition Table Import and ExportHive>

Export table employee_partitioned PARTITION (year = 2015, month = 05) TO '/apps/ca/yanh/data1 ';

Example:

Hive>

Import table employee_partitioned_imported FROM '/apps/ca/yanh/data1 ';

Ii. Data SortingData Sorting mainly includes ORDER, and SORT. This operation is also frequently used to generate sorted tables and perform subsequent value operations including top N, maximum, and minimum. The main operations include order by (ASC | DESC), sort by (ASC | DESC), distribute by, and cluster.

1. order by (ASC | DESC)Similar to the order by operation of RDBMS, this operation outputs a global sorting result, so there is only one reducer output result, so the process is very long if there is a large amount of data! In this case, the LIMIT keyword can improve the output efficiency. If Hive sets hive. mapred. mode = strict, the LIMIT keyword is unavailable (available by default ).

For example, sort by name from large to small(If the data volume is large, you can add LIMIT n at the end to display the first n rows.) hive>

SELECT name FROM employee order by name DESC;

2. sort by (ASC | DESC)Unlike the order by (ASC | DESC) operation, the sort by (ASC | DESC) operation only outputs partial ordered results (that is, multiple CER outputs, each of which is ordered ). To output a global ORDER, SET mapred. reduce. tasks = 1; to SET the number of reducers to 1. The effect is the same as that of order by (ASC | DESC. Sort by specifies the column sorting, which can be completed before all data is transferred from the mapper side (as long as the column is transmitted ).

Example:Hive>

SET mapred. reduce. tasks = 2; SELECT name FROM employee sort by name DESC; SET two reducers. The result is not arranged FROM large to small.

Example:Hive>

SET mapred. reduce. tasks = 1; SELECT name FROM employee sort by name DESC; SET 1 CER Cer. The result is the same as that of order!

3. distrisponbyThis operation is similar to group by in RDBMS. The mapper output GROUP is sent to Cer CER Based on the specified columns, rather than grouping Data Based on partition. NOTE: If sort by is used, the column to be distributed must be after distridistributed BY and appear in the selected column (because of the nature of sort ).

For example, an error occurred while selecting employee_id.Hive>

SELECT name FROM maid;

Example:Hive> SELECT name, employee_id FROM employee_hr distribute by employee_id sort by name;

4. CLUSTERCluster by is similar to the combination of distribute by and sort by (acting on the same column), but unlike order by, it only sorts Each CER, rather than global, ASC and DESC are not supported. To implement global sorting, You can first sort cluster by and then order.

Example:Hive> SELECT name, employee_id FROM employee_hr cluster by name;

The difference between order by and cluster by is shown in:

Iii. Data Operations and MethodsFor further data operations, we can perform Hive operations such as expressions, operations, and methods to convert data. Go to Hive wiki Repository. At the same time, Hive has defined some relational operations, arithmetic operations, logical operations, complex type constructors, and complex types of operations. Relational operations, arithmetic operations, and logical operations are similar to standard operations in SQL/Java. Methods In Hive can be roughly divided into the following types:

Mathematical functions: these methods are mainly used for mathematical computation, such as RAND () He E ().
Aggregate functions: these methods are mainly used to query complex types such as size, key, and value, such as SIZE (Array ).
Type conversion functions: these methods are mainly used to convert data types, such as CAST and BINARY.
Date function: used to operate on date, such as YEAR (string date) and MONTH (string date ).
Conditional functions: return functions filtered by specific conditions, such as COALESCE, IF, and case when.
String functions: these functions are mainly used for string-related operations, such as UPPER (string A) and TRIM (string ).
Aggregate functions: these functions are mainly used for data aggregation, such as SUM () and COUNT (*).
List generation functions: these functions are mainly used to convert single-row input to multiple-row output, such as EXPLODE (MAP) and JSON_TUPLE (jsonString, k1, k2 ,...).
Custom functions: These functions generated by Java are extended as Hive extension functions.
You can use the following statements in Hive CLI to query Hive built-in FUNCTIONS: show functions; // list all Hive FUNCTIONS: DESCRIBE FUNCTION ; // FUNCTION description DESCRIBE FUNCTION EXTENDED ; // More details
Example:
1. complex data type function tip: The SIZE function is used to calculate MAP, ARRAY, or nested MAP/ARRAY. -1 is returned if the size is unknown.
Example:Hive> SELECT work_place, skills_sore, depart_title FROM employee;
Example:Hive> select size (work_place) AS array_size, SIZE (skills_score) AS map_size, SIZE (depart_title) AS complex_size, SIZE (depart_title ["Product"]) AS nest_size FROM employee;
The ARRAY_CONTAINS declaration is used to check whether the specified column contains the specified value using the TRUE or FALSE return values. SORT_ARRAY declaration is used to sort arrays in ascending order.
Example:Hive> SELECT ARRAY_CONTAINS (work_place, 'toronto ') AS is_Toronto, SORT_ARRAY (work_place) AS sorted_array FROM employee;
2. Date function prompt: The FROM_UNIXTIME (UNIX_TIMESTAMP () statement is the same as the SYSDATE function in Oracle, and the current time of the Hive server is dynamically returned.
Example:Hive> SELECT FROM_UNIXTIME (UNIX_TIMESTAMP () AS current_time FROM employee limit 1;
TO_DATE is used to intercept the acquired system time
Example:Hive> SELECT TO_DATE (FROM_UNIXTIME (UNIX_TIMESTAMP () AS current_date FROM employee limit 1;
3. Multiple cases with different data types: The data types after THEN or ELSE must be the same before hive0.20. Otherwise, an exception may occur. For example, the data type after ELSE must be the same as THEN. This problem is fixed after 0.20.
For example, an exception is returned due to different data types.Hive> select case when 1 is null then 'true' ELSE 0 end as case_result FROM employee LIMIT 1;
4. parsing and searching: lateral view is used to generate a user-defined table to show the map or array values in an expanded form, just like EXPLODE (), but it ignores columns whose values are NULL, to display these columns, you can use lateral view outer (Hive0.12.0 and later versions)
Example:Hive> insert into table employee SELECT 'steven 'AS name, array (null) AS work_place, named_struct ("sex", "Male", "age", 30) AS sex_age, map ("Python", 90) AS skills_score, map ("R & D", array ('developer ') AS depart_title FROM employee LIMIT 1; SELECT name, work_place, skills_score FROM employee;
Example:Hive> SELECT name, workplace, skills, score FROM employee lateral view explode (work_place) wp AS workplace lateral view explode (skills_score) ss AS skills, score;
Example:Hive> SELECT name, workplace, skills, score FROM employee lateral view outer explode (work_place) wp AS workplace lateral view explode (skills_score) ss AS skills, score;
REVERSE is used to REVERSE the specified string, and SPLIT is used to separate the string by the specified separator.
Example:Hive> SELECT reverse (split (reverse ('/apps/ca/yanh/employee.txt'), '/') [0]) AS linux_file_name FROM employee LIMIT 1;
REVERSE converts the output to a separate element, while COLLECT_SET and COLLECT_LIST combine the elements into a set for output. The difference between COLLECT_SET and COLLECT_LIST does not include repeated elements in the Set returned by COLLECT_SET, while COLLECT_LIST can contain repeated elements.
Example:Hive> SELECT collect_set (work_place [0]) AS flat_wprkplace FROM employee;
Example:
Hive> SELECT collect_list (work_place [0]) AS flat_wprkplace FROM employee; Note: Hive0.11.0 and previously do not support collect_list
5. Virtual columns: Virtual columns are special function types of special columns in Hive. Currently, Hive only supports two virtual columns: INPUT_FILE_NAME and BLOCK_OFFSET_INSIDE_FILE. The INPUT_FILE_NAME column is the input file name of ER er, and BLOCK_OFFSET_INSIDE_FILE is the current location of all files or the block offset of the current compressed file.
Example:Hive> SELECT INPUT_FILE_NAME, BLOCK_OFFSET_INSIDE_FILE as offside from employee_id; Note: The test fails on hive0.133, and this function does not exist.
6. functions not mentioned in wiki:
For example, isnull is used to check whether the value is null.Hive> SELECT work_place, isnull (work_place) is_null, isnotnull (work_place) is_not_null FROM employee;
For example, assert_true. If the condition is false, an exception is thrown.Hive> SELECT assert_true (work_place is null) FROM employee;
For example: elt, return the nth stringHive> SELECT elt (2, 'New York ', 'beijing', 'toronto') FROM employee LIMIT 1;
Example: current_database. The current database name is returned.Hive> SELECT current_database (); Note: Hive0.11.0 and previously do not have this function
Iv. Data ConversionBefore hive0.133, row-level data conversion is not supported. Therefore, data rows cannot be updated, inserted, or deleted. Therefore, data rewriting can only happen in tables or partitions, which makes Hive difficult to process concurrent read/write and data cleansing. But starting from 0.6.2, Hive provides row-level data processing capabilities for Atomicity, consistency, isolation, and durability (ACID. Currently, all conversion Operations Support ORC (optimized column layout, supported from Hive0.11.0) files and data in the bucket list.
You need to configure the following parameters to enable Hive conversion: SET hive. support. concurrency = true; SET hive. enforce. bucketing = true; SET hive.exe c. dynamic. partition. mode = nonstrict; SET hive. txn. manager = org. apache. hadoop. hive. ql. lockmgr. dbTxnManager; SET hive. compactor. initiator. on = true; SET hive. compactor. worker. threads = 1;
Show transactions: hive> show transactions;
From Hive0.14.0, row-level INSERT values, update, and delete can be implemented using the following syntax rules: insert into table tablename [PARTITION (partcol1 [= val1], partcol2 [= val2]...)] VALUES values_row [, values_row…]; UPDATE tablename SET column = value [, column = value…] [WHERE expression]; delete from tablename [WHERE expression];
ConclusionThe above is the specific data operations of all Hive. I believe that the general data operations of Hive can be easily used so far. All of the above examples can be tested by myself. The test environment is Hive0.11.0. Some hive0..0 features are tested under 0.20., which are described below.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More