Optimization of hive separator and orderby sort by distribute

Source: Internet
Author: User
Tags parse error printable characters table definition

1. Hive semicolon

A semicolon is the end mark of an SQL statement. It is also used in hiveql. However, in hiveql, the semicolon recognition is not so intelligent. For example:

Select Concat (cookie_id, Concat (';', 'zoo') fromc02_clickstat_fatdt1 limit 2;

Failed: Parse error: line 0:-1 cannot recognize input '<EOF>' in Function Specification

It can be inferred that when parsing a hive statement, a semicolon is deemed to end the statement, regardless of whether it is enclosed in quotation marks.

The solution is to use the ASCII code of the semicolon octal to escape, then the above statement should be written:

Select Concat (cookie_id, Concat ('\ 073', 'zoo') fromc02_clickstat_fatdt1 limit 2;

Why is it an octal ASCII code?

I tried to use a hexadecimal ASCII code, but hive treats it as a string and does not escape it. It seems that only octal characters are supported. The reason is unknown. This rule also applies to other non-select statements. For example, if you need to define a separator in create table, you must use the octal ASCII code to escape the non-printable characters.

Ii. Insert new data

The "Overwrite" keyword must be added according to the syntax insert, that is, each insert is overwritten. How can we add data to a table?

Suppose there is a table manbu in hive,

Hive> describe manbu;

Id int

Value int

Hive> select * From manbu;

3 4

1 2

2 3

Add a record:

Hive> insert overwrite table manbu

Select ID, value from (

Select ID, value from manbu

Union all

Select 4 as ID, 5 as value from manbu limit 1

) U;

The result is:

Hive> select * from P1;

3 4

4 5

2 3

1 2

The key lies in the application of the keyword 'Union all', which combines the original dataset with the new dataset and then overwrites the table.

Iii. Initial Values

When inserting data in the insert overwrite table, the initial values of the following fields should be consistent with those in the table definition. For example, when a string type field is initially NULL:

Null as field_name // This may be prompted to define the type as string, but here it is void

Cast (null as string) as field_name // This is correct

For example, when a field of the bigint type is initially 0:

Cast (0 as bigint) as field_name

Iv. Optimization of orderby sort by distribute

The sorting keyword of hive is sort by, which is intentionally different from the order by of traditional databases to emphasize the difference between the two-sort by can only be sorted within the Single Machine range.

For example:

Set mapred. Reduce. Tasks = 2; (set the reduce quantity to 2)

Original Value:

1. selectcookie_id, page_id, ID from c02_clickstat_fatdt1

Where cookie_idin ('1. 193.131.218.1288611279693.0 ', '1. 193.148.164.1288609861509.2 ')

1.193.148.164.1288609861509.2 113181412886099008861288609901078194082403 684000005

1.193.148.164.1288609861509.2 127001128860563972141288609859828580660473 684000015

1.193.148.164.1288609861509.2 113181412886099165721288609915890452725326 684000018

1.193.131.218.1288611279693.0 01c183da6e4bc50712881288611540109914561053 684000114

1.193.131.218.1288611279693.0 01c183da6e4bc22412881288611414343558272134 684000118

1. 193.131.218.1288611279693.0 01c183da6e4bc50712881288611511781996667988 684000121

1.193.131.218.1288611279693.0 01c183da6e4bc22412881288611523640691739999 684000126

1.193.131.218.1288611279693.0 01c183da6e4bc50712881288611540109914561053 684000128

2. selectcookie_id, page_id, ID from c02_clickstat_fatdt1 where

Cookie_idin ('1. 193.131.218.1288611279693.0 ', '1. 193.148.164.1288609861509.2 ')

Sort bycookie_id, page_id;

Value after sorting by sort

1.193.131.218.1288611279693.0 684000118 01c183da6e4bc22412881288611414343558272134 684000118

1.193.131.218.1288611279693.0 684000114 01c183da6e4bc50712881288611540109914561053 684000114

1.193.131.218.1288611279693.0 684000128 01c183da6e4bc50712881288611540109914561053 684000128

1. 193.148.164.1288609861509.2 684000005 113181412886099008861288609901078194082403 684000005

1. 193.148.164.1288609861509.2 684000018 113181412886099165721288609915890452725326 684000018

1.193.131.218.1288611279693.0 684000126 01c183da6e4bc22412881288611523640691739999 684000126

1.193.131.218.1288611279693.0 684000121 01c183da6e4bc50712881288611511781996667988 684000121

1. 193.148.164.1288609861509.2 684000015 127001128860563972141288609859828580660473 684000015

Selectcookie_id, page_id, ID from c02_clickstat_fatdt1

Where cookie_idin ('1. 193.131.218.1288611279693.0 ', '1. 193.148.164.1288609861509.2 ')

Order bypage_id, cookie_id;

1.193.131.218.1288611279693.0 684000118 01c183da6e4bc22412881288611414343558272134 684000118

1.193.131.218.1288611279693.0 684000126 01c183da6e4bc22412881288611523640691739999 684000126

1.193.131.218.1288611279693.0 684000121 01c183da6e4bc50712881288611511781996667988 684000121

1.193.131.218.1288611279693.0 684000114 01c183da6e4bc50712881288611540109914561053 684000114

1.193.131.218.1288611279693.0 684000128 01c183da6e4bc50712881288611540109914561053 684000128

1. 193.148.164.1288609861509.2 684000005 113181412886099008861288609901078194082403 684000005

1. 193.148.164.1288609861509.2 684000018 113181412886099165721288609915890452725326 684000018

1. 193.148.164.1288609861509.2 684000015 127001128860563972141288609859828580660473 684000015

We can see that the values sorted by sort and order are different. At the beginning, I specified two reducers for data distribution (sorted separately ). The main reason for different results is that the preceding query does not have a reduce key, and hive generates a random number as the Reduce Key. In this way, the input records are randomly distributed to different reducers. To ensure that there are no duplicate cookie_id records between reducers, you can use the distribute by keyword to specify the distribution key as cookie_id.

Selectcookie_id, country, ID, page_id, ID from c02_clickstat_fatdt1 where cookie_idin ('1. taobao', '1. taobao') distribute by cookie_id sort by cookie_id, page_id;

1.193.131.218.1288611279693.0 684000118 01c183da6e4bc22412881288611414343558272134 684000118

1.193.131.218.1288611279693.0 684000126 01c183da6e4bc22412881288611523640691739999 684000126

1.193.131.218.1288611279693.0 684000121 01c183da6e4bc50712881288611511781996667988 684000121

1.193.131.218.1288611279693.0 684000114 01c183da6e4bc50712881288611540109914561053 684000114

1.193.131.218.1288611279693.0 684000128 01c183da6e4bc50712881288611540109914561053 684000128

1. 193.148.164.1288609861509.2 684000005 113181412886099008861288609901078194082403 684000005

1. 193.148.164.1288609861509.2 684000018 113181412886099165721288609915890452725326 684000018

1. 193.148.164.1288609861509.2 684000015 127001128860563972141288609859828580660473 684000015

Example 2:

Createtable if not exists t_order (

Id int, -- number

Sale_idint, -- sid

Customer_idint, -- CID

Product_id int, -- PID

Amountint -- quantity

) Partitioned by (DS string );

Query all records in the table and sort the records by PID and quantity:

Setmapred. Reduce. Tasks = 2;

Selectsale_id, amount from t_order

Sort bysale_id, amount;

This query may be sorted in an unexpected order. The data distributed by the specified two reducers may be (sorted separately ):

Reducer1:

Sale_id | amount

0 | 100

1 | 30

1 | 50

2 | 20

Reducer2:

Sale_id | amount

0 | 110

0 | 120

3 | 50

4 | 20

Use the distribute by keyword to specify the distribution key as sale_id. The transformed hql is as follows:

Setmapred. Reduce. Tasks = 2;

Selectsale_id, amount from t_order

Distributeby sale_id

Sort bysale_id, amount;

This ensures that the number of sales IDs in the queried sales record set is correctly sorted, but the sales IDs cannot be correctly sorted because hive uses hadoop's default hashpartitioner to distribute data.

This involves a full Sorting Problem. There are two solutions:

1) Data is not distributed. Use a single CER:

Setmapred. Reduce. Tasks = 1;

The drawback of this method is that the reduce end has become a performance bottleneck, and the results are generally unavailable when the data volume is large. However, in practice, this is still the most commonly used method, because the sorting query is usually used to obtain a number of top-ranking results. Therefore, the limit clause can be used to greatly reduce the data volume. After limit N is used, the number of data records transmitted to the reduce end (Single Machine) is reduced to N * (number of maps ).

2) modify the partitioner. This method can be fully sorted. Here we can use the totalorderpartitioner that comes with hadoop (from Yahoo! This is a partitioner that supports distributed and ordered data development across reducers. It requires a file in sequencefile format to specify the data distribution interval. If we have generated this file (stored in/tmp/range_key_list, which is divided into 100 reducers), we can rewrite the above query

Setmapred. Reduce. Tasks = 100;

Sethive. mapred. partitioner = org. Apache. hadoop. mapred. Lib. totalorderpartitioner;

Settotal. Order. partitioner. Path =/tmp/range_key_list;

Selectsale_id, amount from t_order

Clusterby sale_id

Sort byamount;

There are many ways to generate this interval file (for example, the O. A. H. mapreduce. Lib. Partition. inputsampler tool that comes with hadoop ). This section describes how to use hive to generate a t_sale table ordered by ID:

Createtable if not exists t_sale (

Id int,

Namestring,

Locstring

);

The method for generating the interval files distributed by sale_id is:

Createexternal table range_keys (sale_id INT)

Rowformat serde

'Org. Apache. hadoop. hive. serde2.binarysortable. binarysortableserde'

Stored

Inputformat

'Org. Apache. hadoop. mapred. textinputformat'

Outputformat

'Org. Apache. hadoop. hive. QL. Io. hivenullvaluesequencefileoutputformat'

Location '/tmp/range_key_list ';

Insertoverwrite table range_keys

Selectdistinct sale_id

Fromsource t_sale sampletable (bucket 100 out of 100 on rand () S

Sort bysale_id;

The generated file (under the/tmp/range_key_list directory) allows totalorderpartitioner to distribute data processed by reduce in an orderly manner by sale_id.

The main issue to be considered for Interval files is the balance of data distribution, which relies on a deep understanding of the data.

Optimization of hive separator and orderby sort by distribute

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.