1. Hive semicolon
A semicolon is the end mark of an SQL statement. It is also used in hiveql. However, in hiveql, the semicolon recognition is not so intelligent. For example:
Select Concat (cookie_id, Concat (';', 'zoo') fromc02_clickstat_fatdt1 limit 2;
Failed: Parse error: line 0:-1 cannot recognize input '<EOF>' in Function Specification
It can be inferred that when parsing a hive statement, a semicolon is deemed to end the statement, regardless of whether it is enclosed in quotation marks.
The solution is to use the ASCII code of the semicolon octal to escape, then the above statement should be written:
Select Concat (cookie_id, Concat ('\ 073', 'zoo') fromc02_clickstat_fatdt1 limit 2;
Why is it an octal ASCII code?
I tried to use a hexadecimal ASCII code, but hive treats it as a string and does not escape it. It seems that only octal characters are supported. The reason is unknown. This rule also applies to other non-select statements. For example, if you need to define a separator in create table, you must use the octal ASCII code to escape the non-printable characters.
Ii. Insert new data
The "Overwrite" keyword must be added according to the syntax insert, that is, each insert is overwritten. How can we add data to a table?
Suppose there is a table manbu in hive,
Hive> describe manbu;
Id int
Value int
Hive> select * From manbu;
3 4
1 2
2 3
Add a record:
Hive> insert overwrite table manbu
Select ID, value from (
Select ID, value from manbu
Union all
Select 4 as ID, 5 as value from manbu limit 1
) U;
The result is:
Hive> select * from P1;
3 4
4 5
2 3
1 2
The key lies in the application of the keyword 'Union all', which combines the original dataset with the new dataset and then overwrites the table.
Iii. Initial Values
When inserting data in the insert overwrite table, the initial values of the following fields should be consistent with those in the table definition. For example, when a string type field is initially NULL:
Null as field_name // This may be prompted to define the type as string, but here it is void
Cast (null as string) as field_name // This is correct
For example, when a field of the bigint type is initially 0:
Cast (0 as bigint) as field_name
Iv. Optimization of orderby sort by distribute
The sorting keyword of hive is sort by, which is intentionally different from the order by of traditional databases to emphasize the difference between the two-sort by can only be sorted within the Single Machine range.
For example:
Set mapred. Reduce. Tasks = 2; (set the reduce quantity to 2)
Original Value:
1. selectcookie_id, page_id, ID from c02_clickstat_fatdt1
Where cookie_idin ('1. 193.131.218.1288611279693.0 ', '1. 193.148.164.1288609861509.2 ')
1.193.148.164.1288609861509.2 113181412886099008861288609901078194082403 684000005
1.193.148.164.1288609861509.2 127001128860563972141288609859828580660473 684000015
1.193.148.164.1288609861509.2 113181412886099165721288609915890452725326 684000018
1.193.131.218.1288611279693.0 01c183da6e4bc50712881288611540109914561053 684000114
1.193.131.218.1288611279693.0 01c183da6e4bc22412881288611414343558272134 684000118
1. 193.131.218.1288611279693.0 01c183da6e4bc50712881288611511781996667988 684000121
1.193.131.218.1288611279693.0 01c183da6e4bc22412881288611523640691739999 684000126
1.193.131.218.1288611279693.0 01c183da6e4bc50712881288611540109914561053 684000128
2. selectcookie_id, page_id, ID from c02_clickstat_fatdt1 where
Cookie_idin ('1. 193.131.218.1288611279693.0 ', '1. 193.148.164.1288609861509.2 ')
Sort bycookie_id, page_id;
Value after sorting by sort
1.193.131.218.1288611279693.0 684000118 01c183da6e4bc22412881288611414343558272134 684000118
1.193.131.218.1288611279693.0 684000114 01c183da6e4bc50712881288611540109914561053 684000114
1.193.131.218.1288611279693.0 684000128 01c183da6e4bc50712881288611540109914561053 684000128
1. 193.148.164.1288609861509.2 684000005 113181412886099008861288609901078194082403 684000005
1. 193.148.164.1288609861509.2 684000018 113181412886099165721288609915890452725326 684000018
1.193.131.218.1288611279693.0 684000126 01c183da6e4bc22412881288611523640691739999 684000126
1.193.131.218.1288611279693.0 684000121 01c183da6e4bc50712881288611511781996667988 684000121
1. 193.148.164.1288609861509.2 684000015 127001128860563972141288609859828580660473 684000015
Selectcookie_id, page_id, ID from c02_clickstat_fatdt1
Where cookie_idin ('1. 193.131.218.1288611279693.0 ', '1. 193.148.164.1288609861509.2 ')
Order bypage_id, cookie_id;
1.193.131.218.1288611279693.0 684000118 01c183da6e4bc22412881288611414343558272134 684000118
1.193.131.218.1288611279693.0 684000126 01c183da6e4bc22412881288611523640691739999 684000126
1.193.131.218.1288611279693.0 684000121 01c183da6e4bc50712881288611511781996667988 684000121
1.193.131.218.1288611279693.0 684000114 01c183da6e4bc50712881288611540109914561053 684000114
1.193.131.218.1288611279693.0 684000128 01c183da6e4bc50712881288611540109914561053 684000128
1. 193.148.164.1288609861509.2 684000005 113181412886099008861288609901078194082403 684000005
1. 193.148.164.1288609861509.2 684000018 113181412886099165721288609915890452725326 684000018
1. 193.148.164.1288609861509.2 684000015 127001128860563972141288609859828580660473 684000015
We can see that the values sorted by sort and order are different. At the beginning, I specified two reducers for data distribution (sorted separately ). The main reason for different results is that the preceding query does not have a reduce key, and hive generates a random number as the Reduce Key. In this way, the input records are randomly distributed to different reducers. To ensure that there are no duplicate cookie_id records between reducers, you can use the distribute by keyword to specify the distribution key as cookie_id.
Selectcookie_id, country, ID, page_id, ID from c02_clickstat_fatdt1 where cookie_idin ('1. taobao', '1. taobao') distribute by cookie_id sort by cookie_id, page_id;
1.193.131.218.1288611279693.0 684000118 01c183da6e4bc22412881288611414343558272134 684000118
1.193.131.218.1288611279693.0 684000126 01c183da6e4bc22412881288611523640691739999 684000126
1.193.131.218.1288611279693.0 684000121 01c183da6e4bc50712881288611511781996667988 684000121
1.193.131.218.1288611279693.0 684000114 01c183da6e4bc50712881288611540109914561053 684000114
1.193.131.218.1288611279693.0 684000128 01c183da6e4bc50712881288611540109914561053 684000128
1. 193.148.164.1288609861509.2 684000005 113181412886099008861288609901078194082403 684000005
1. 193.148.164.1288609861509.2 684000018 113181412886099165721288609915890452725326 684000018
1. 193.148.164.1288609861509.2 684000015 127001128860563972141288609859828580660473 684000015
Example 2:
Createtable if not exists t_order (
Id int, -- number
Sale_idint, -- sid
Customer_idint, -- CID
Product_id int, -- PID
Amountint -- quantity
) Partitioned by (DS string );
Query all records in the table and sort the records by PID and quantity:
Setmapred. Reduce. Tasks = 2;
Selectsale_id, amount from t_order
Sort bysale_id, amount;
This query may be sorted in an unexpected order. The data distributed by the specified two reducers may be (sorted separately ):
Reducer1:
Sale_id | amount
0 | 100
1 | 30
1 | 50
2 | 20
Reducer2:
Sale_id | amount
0 | 110
0 | 120
3 | 50
4 | 20
Use the distribute by keyword to specify the distribution key as sale_id. The transformed hql is as follows:
Setmapred. Reduce. Tasks = 2;
Selectsale_id, amount from t_order
Distributeby sale_id
Sort bysale_id, amount;
This ensures that the number of sales IDs in the queried sales record set is correctly sorted, but the sales IDs cannot be correctly sorted because hive uses hadoop's default hashpartitioner to distribute data.
This involves a full Sorting Problem. There are two solutions:
1) Data is not distributed. Use a single CER:
Setmapred. Reduce. Tasks = 1;
The drawback of this method is that the reduce end has become a performance bottleneck, and the results are generally unavailable when the data volume is large. However, in practice, this is still the most commonly used method, because the sorting query is usually used to obtain a number of top-ranking results. Therefore, the limit clause can be used to greatly reduce the data volume. After limit N is used, the number of data records transmitted to the reduce end (Single Machine) is reduced to N * (number of maps ).
2) modify the partitioner. This method can be fully sorted. Here we can use the totalorderpartitioner that comes with hadoop (from Yahoo! This is a partitioner that supports distributed and ordered data development across reducers. It requires a file in sequencefile format to specify the data distribution interval. If we have generated this file (stored in/tmp/range_key_list, which is divided into 100 reducers), we can rewrite the above query
Setmapred. Reduce. Tasks = 100;
Sethive. mapred. partitioner = org. Apache. hadoop. mapred. Lib. totalorderpartitioner;
Settotal. Order. partitioner. Path =/tmp/range_key_list;
Selectsale_id, amount from t_order
Clusterby sale_id
Sort byamount;
There are many ways to generate this interval file (for example, the O. A. H. mapreduce. Lib. Partition. inputsampler tool that comes with hadoop ). This section describes how to use hive to generate a t_sale table ordered by ID:
Createtable if not exists t_sale (
Id int,
Namestring,
Locstring
);
The method for generating the interval files distributed by sale_id is:
Createexternal table range_keys (sale_id INT)
Rowformat serde
'Org. Apache. hadoop. hive. serde2.binarysortable. binarysortableserde'
Stored
Inputformat
'Org. Apache. hadoop. mapred. textinputformat'
Outputformat
'Org. Apache. hadoop. hive. QL. Io. hivenullvaluesequencefileoutputformat'
Location '/tmp/range_key_list ';
Insertoverwrite table range_keys
Selectdistinct sale_id
Fromsource t_sale sampletable (bucket 100 out of 100 on rand () S
Sort bysale_id;
The generated file (under the/tmp/range_key_list directory) allows totalorderpartitioner to distribute data processed by reduce in an orderly manner by sale_id.
The main issue to be considered for Interval files is the balance of data distribution, which relies on a deep understanding of the data.
Optimization of hive separator and orderby sort by distribute