Hive The latest data operation detailed (super detail)

Source: Internet
Author: User
Tags arithmetic explode mathematical functions hdfs dfs
The ability of data manipulation is the key to large data analysis. Data operations mainly include: Change (Exchange), move (moving), sort (sorting), transform (transforming). Hive provides a variety of query statements, keywords, operations, and methods for data manipulation. Data change data changes mainly include: LOAD, INSERT, IMPORT, and EXPORT 1. The load data load keyword is useful for moving data into hive. If the data is loaded from HDFs, the source data is deleted when the load succeeds, and if it is loaded locally, the source data is not deleted when the load succeeds.
Data: Employee_hr.txt Http://pan.baidu.com/s/1c0D9TpI
Cases:
Hive> (without input, this means that the hive shell enters the following command, below) CREATE TABLE IF not EXISTS employee_hr (name string, employee_id int, sin_number s Tring, start_date timestamp) ROW FORMAT delimited FIELDS terminated by ' | ' STORED as Textfile;
Example: hive> LOAD DATA local inpath '/apps/ca/yanh/employee_hr.txt ' OVERWRITE into TABLE employee_hr;

Note 1: In the directive the local keyword is used to specify that the data is loaded locally, and if this keyword is removed, the default is to load from HDFs. The Overwrite keyword specifies that the data be loaded using overlay, otherwise it is loaded with additional means.
NOTE 2: If the data is loaded into the partitioned table, you must specify the partition column.
2. Insert, like an RDBMS, hive also supports inserting data from other hive tables into a specified table, using the Insert keyword. Insert operations are the most commonly used hive data processing to populate the specified table with the existing dataset. In hive, inserts can be used with overwrite for overlay insertion, multiple table inserts, dynamic partitioning inserts, and fetching data to HDFs or local.
Example: hive> CREATE TABLE Ctas_employee as SELECT * from employee; TRUNCATE table employee;//delete data in employee, preserve table structure
Cases:
Hive> INSERT into TABLE employee SELECT * from Ctas_employee; Note: This is connected using the Beeline tool provided by hive to clearly display the datasheet.
Example: Inserting data from a CTE hive> with a as (SELECT * from Ctas_employee) Insert OVERWRITE TABLE employee SELECT *; The effect is the same as the previous example: Hive supports CTE starting with version 0.13.0
Example: Multiple tables Insert hive> CREATE table employee_internal like employee; From Ctas_employee Insert OVERWRITE table Employee SELECT * INSERT OVERWRITE table employee_internal SELECT *; SELECT * from Employee_internal;
Hive also supports inserting dynamic data, such as date, in addition to supporting static data being inserted into static partitions
Example: Dynamic partitioning is turned off by default, and can be opened by the following settings: Set hive.exec.dynamic.partition=true; Hive The default of at least one partition column is static and can be turned off by the following settings: Set hive.exec.dynamic.partition.mode=nonstrict; Hive> INSERT into TABLE employee_partitioned PARTITION (year,month) SELECT name,array (' Toronto ') as Work_place, Named_ struct ("Sex", "Male", "age",) as Sex_age, map ("Python,") as Skills_score, map ("R&d", Array (' Developer ')) as Depart_title, Year (start_date) as year, month (start_date) as month from employee_hr eh WHERE eh.employee_id = 102;
Cases:
Hive> SELECT * from employee_partitioned;
Example: Extract data to local (by default ^a column, newline character separate row) Note: Hive Extract data can only use overwrite, cannot use into. Note: The directory depth in some versions of Hadoop is only supported to 2 levels, and can be repaired using the following settings: Set hive.insert.into.multilevel.dirs=true; hive> INSERT OVERWRITE local DIRECTORY '/apps/ca ' SELECT * from employee; Note: The default hive will generate multiple output files by reducer quantity, which can be merged using the following command: HDFs dfs-getmerge hdfs://:p ort/user/output/directory

Example: Using a specific delimiter separator row hive> INSERT OVERWRITE local DIRECTORY '/apps/ca/yanh/data ' ROW FORMAT delimited FIELDS terminated by ' | ' SELECT * from employee;

Example: Hive also able to multiple catalog output files hive> from employee inserts OVERWRITE local DIRECTORY '/apps/ca/yanh/data1 ' SELECT * Insert OVERWRITE L ocal DIRECTORY '/apps/ca/yanh/data2 ' SELECT *;

3. EXPORT and IMPORT These two commands are hive used for data migration or data backup by HDFs, and are available from Hive0.8.0. Exports can export data and metadata to HDFs, and the metadata is named _metadata. The data is placed in a directory named

Example: hive>

EXPORT TABLE employee to '/apps/ca/yanh/data ';

Note: The output directory cannot already exist

Example: Importing the output data into hive (importing data into an existing table will report an error) hive>

IMPORT from '/apps/ca/yanh/data ';

Example: Import to a new table (also can be a external table) hive>

IMPORT TABLE employee_imported from '/apps/ca/yanh/data ';

Example: Partition table Import Export hive>

EXPORT TABLE employee_partitioned PARTITION (year=2015,month=05) to '/apps/ca/yanh/data1 ';

Cases:

Hive>

IMPORT TABLE employee_partitioned_imported from '/apps/ca/yanh/data1 ';

Second, data sorting data sorting mainly includes: order, and sort. This is also frequently used to generate sorted tables to perform subsequent values such as top N, maximum, minimum, and so on. The main actions include order by (asc| DESC), SORT by (asc| DESC), distribute by, CLUSTER by

1. Order BY (asc| DESC) is similar to an RDBMS order by operation, which outputs a global sort result, so the output of the reducer is only one, so the process is very lengthy if the large amount of data is measured. At this time can limit keyword to improve output efficiency. If the hive setting Hive.mapred.mode = strict, the Limit keyword will not be available (the default is available).

Example: By name from large to small sort (if the amount of data is large, you can add limit N to display the first n rows) hive>

SELECT name from employee order by name DESC;

2. SORT by (asc| DESC) and order by (asc| DESC) operation is different, SORT by (asc| DESC) operation outputs only locally ordered results (i.e. multiple reducer outputs, each output ordered). If you want to output global order, you can use set Mapred.reduce.tasks=1 to make the number of reducer 1. The effect is at this time with order by (asc| DESC) the same. Sort by specifies column sorting, which completes sorting before data is passed from the mapper end (as long as the column is transferred).

Example: hive>

SET mapred.reduce.tasks=2; SELECT name from employee SORT by name DESC; Set 2 reducer, and you can see that the results are not arranged from large to small.



Example: hive>

SET Mapred.reduce.tasks=1; SELECT name from employee SORT by name DESC; Set 1 reducer, which is the same as the order by result.

3. Distribute by this operation is similar to group by in an RDBMS, sending mapper output groupings to reducer based on the established columns rather than grouping the data according to partition. Note: If you use Sort by, the column to be distributed must be in the selected columns (because of the nature of the sort by) after distribute by.

Example: To select employee_id, an error hive>

SELECT name from employee_hr distribute by employee_id;

Example: Hive> SELECT name,employee_id from employee_hr distribute by employee_id SORT by name;

4. CLUSTER by CLUSTER by a combination of distribute by and sort by (acting on the same column), but unlike order by, it is only sorted in each reducer, not global, and ASC and DESC are not supported. If you want to implement a global sort, you can first cluster by and then order by.

Example: hive> SELECT name, employee_id from Employee_hr CLUSTER by name;



The order by and cluster by differ as shown in the following illustration:



Iii. data manipulation and methods in order to further the data operation, we can carry on the hive such as the expression, the operation, the method and so on to transform the data. Each expression and method has been normalized on the Hive wiki Https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF. At the same time, hive has defined some relational operations, arithmetic operations, logical operations, complex type constructors, and complex type operations. It is similar to the standard operations in Sql/java for the operation of relational types, arithmetic operations, and logical operations. The methods in hive can be roughly divided into the following categories: Mathematical functions: These methods are mainly used for mathematical calculations, such as Rand () e ().
Summary functions: These methods are used primarily for querying the size, key, value, and so on for complex types, such as size (Array).
Type Conversion functions: These methods are used primarily for converting data types, such as cast and binary.
Date functions: Used to manipulate date-related actions, such as year (string date) and month (string date).
Conditional function: Used to return a filtered function of a specific condition, such as coalesce, if, and case when.
String functions: This type of function is primarily used for string-related operations, such as upper (string a) and trim (string a).
Aggregate functions: This type of function is used primarily for data aggregation, such as SUM (), COUNT (*).
List generation functions: This class of functions is primarily used to convert single-line input to multiline output, such as explode (MAP) and Json_tuple (jsonstring, K1, K2, ...).
Custom functions: This class extends the hive function as a hive extension function by Java-generated functions.
The following statement can be used in the Hive CLI for hive query: show functions; Lists all hive functions DESCRIBE function; Functions describe DESCRIBE function EXTENDED in detail; More details
Detailed Sample:
1. Complex data type function hint: The size function is used to compute a map, ARRAY, or nested Map/array. Returns-1 if size is unknown.
Example: Hive> SELECT work_place, Skills_socre, depart_title from employee;
Example: Hive> SELECT SIZE (work_place) as array_size, size (Skills_score) as map_size, size (depart_title) as Complex_size, siz E (depart_title["Product"]) as nest_size from employee;
The Array_contains declaration uses the return value of TRUE or FALSE to verify that the specified column contains the specified value. Sort_array declarations are used for ascending sorting of arrays.
Example: Hive> SELECT array_contains (work_place, ' Toronto ') as Is_toronto, Sort_array (work_place) as Sorted_array from Emplo Yee
2. Date function Hint: From_unixtime (Unix_timestamp ()) declares the same as the Sysdate function in Oracle, dynamically returning the current time of the hive server.
Example: Hive> SELECT from_unixtime (Unix_timestamp ()) as Current_time from employee limit 1;
To_date is used to intercept the date of the acquired system time
Example: Hive> SELECT to_date (From_unixtime (Unix_timestamp ())) as current_date from employee limit 1;
3. Case of many different data types: The data types must be the same after then or else before the Hive0.13.0 version. Otherwise, an exception value may be generated, such as the data type after else must be the same as then. This issue was repaired after 0.13.0.
Example: Because of the different data types, the exception returns hive> SELECT case, when 1 is NULL THEN ' TRUE ' ELSE 0-as Case_result from employee LIMIT 1;
4. Parsing and searching: lateral view is used to generate a user-defined table to display the values of a map or array in expanded form, as explode (), but it ignores columns with null values, and if you want to display these columns, you can use lateral VIEW OUTER ( After Hive0.12.0 version)
Example: hive> INSERT into TABLE employee SELECT ' Steven ' as name, array (null) as Work_place, named_struct ("Sex", "Male", "Age" As Sex_age, map ("Python") as Skills_score, map ("R&d", Array (' Developer ') as Depart_title from employee LIMIT 1; SELECT name, Work_place, Skills_score from employee;
Example: hive> SELECT name, workplace, skills, score from employee lateral VIEW explode (work_place) WP as workplace lateral V Iew Explode (Skills_score) SS as skills, score;
Example: hive> SELECT name, workplace, skills, score from employee lateral VIEW OUTER explode (work_place) WP as workplace LAT eral VIEW Explode (skills_score) SS as skills, score;
Reverse is used to reverse the specified string, split to separate the string by the specified delimiter.
Example: hive> SELECT reverse (Split (reverse ('/apps/ca/yanh/employee.txt '), '/') [0]) as Linux_file_name from employee LIMIT 1;
Reverse converts the output to a separate element, while Collect_set and collect_list combine the elements into a collection for output. The difference between Collect_set and collect_list the collection returned in Collect_set does not contain duplicate elements, and collect_list can contain duplicate elements.
Example: Hive> SELECT collect_set (work_place[0]) as flat_wprkplace from employee;
Cases:
Hive> SELECT collect_list (work_place[0]) as flat_wprkplace from employee; Note: Hive0.11.0 and previously did not support collect_list
5. Virtual columns: Virtual columns are special function types for special columns in hive. Hive supports only 2 virtual columns so far: Input_file_name and Block_offset_inside_file. The Input_file_name column is the input file name of the Mapper, Block_offset_inside_file is the current total file location or the block offset of the current compressed file.
Example: Hive> SELECT input_file_name, block_offset_inside_file as offside from employee_id; Note: The test failed on Hive0.13.0 and there is no such function.
6. Functions not mentioned in Wiki:
Example: IsNull, used to verify that the value is empty hive> SELECT work_place, IsNull (work_place) is_null, Isnotnull (work_place) Is_not_null from Employee
Example: Assert_true throws an exception hive> SELECT assert_true (Work_place is NULL) from employee if the condition is false;
For example: ELT, returns the nth string hive> SELECT elt (2, ' New York ', ' Beijing ', ' Toronto ') from employee LIMIT 1;
Example: Current_database, returns the current database name hive> SELECT current_database (); Note: Hive0.11.0 and previously did not have this function
Iv. Data Conversion does not support row-level data conversion before Hive0.13.0. Therefore, updating, inserting, and deleting data rows are not possible. Thus, data rewriting can occur only in tables or partitions, which makes it difficult for hive to handle concurrent reads and writes and data cleansing. But starting with 0.13.0, Hive provides line-level data processing capabilities for atomicity, consistency, isolation, and persistence (ACID). All conversion operations today support ORC (optimized column-like, starting from Hive0.11.0) data in the file and bucket list.
The following configuration parameters require the appropriate configuration to turn on the Hive conversion feature: SET hive.support.concurrency = true; SET hive.enforce.bucketing = true; SET Hive.exec.dynamic.partition.mode = nonstrict; SET Hive.txn.manager = Org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; SET Hive.compactor.initiator.on = true; SET hive.compactor.worker.threads = 1;
Show transactions can query the currently open conversion operation: hive> show transactions;
Starting with Hive0.14.0, row-level inserts, updates, and deletions can be implemented using the following syntax rules: INSERT INTO TABLE tablename [PARTITION (Partcol1[=val1], Partcol2[=val2] ... )] VALUES Values_row [, Values_row ...]; UPDATE tablename SET column = value [, column = value ...] [WHERE expression]; DELETE from TableName [WHERE expression];

The conclusion above is all hive on the specific operation of the data, I believe that this should be able to hive routine data operations more handy to use it. The above screenshot of the use cases are measured by my own feasible, test environment for Hive0.11.0, part of the Hive0.13.0 characteristics of the 0.13.0 under the test, in the screenshot are described.


QQ Technology Group: 513848061

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.