Hive Full Order

Last Update:2015-05-28 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1.1 Full Sort

The sorting key for hive is sort by, which intentionally distinguishes between the order by of the traditional database and also to emphasize the difference between the two –sort by can only be sorted within a single machine scope. [/URL] [Wang Li 1]

1.1.1 Example 1Set mapred.reduce.tasks=2; original value Select Cookie_id,page_id,id fromc02_clickstat_fatdt1where cookie_idin (' 1.193.131.218.1288611279693.0 ', ' 1.193.148.164.1288609861509.2 ') 1.193.148.164.1288609861509.2 113181412886099008861288609901078194082403 6840000051.193.148.164.1288609861509.2 127001128860563972141288609859828580660473 6840000151.193.148.164.1288609861509.2 113181412886099165721288609915890452725326 6840000181.193.131.218.1288611279693.0 01c183da6e4bc50712881288611540109914561053 6840001141.193.131.218.1288611279693.0 01c183da6e4bc22412881288611414343558274174 6840001181.193.131.218.1288611279693.0 01c183da6e4bc50712881288611511781996667988 6840001211.193.131.218.1288611279693.0 01c183da6e4bc22412881288611523640691739999 6840001261.193.131.218.1288611279693.0 01c183da6e4bc50712881288611540109914561053 684000128hive > select Cookie_id,page_id,id fromc02_clickstat_fatdt1 wherecookie_idin (' 1.193.131.218.1288611279693.0 ', ' 1.193.148.164.1288609861509.2 ') SORT by cookie_id,page_id; Sort sorted by value 1.193.131.218.1288611279693.0 684000118 01c183da6e4bc22412881288611414343558274174 6840001181.193.131.218.1288611279693.0 684000114 01c183da6e4bc50712881288611540109914561053 6840001141.193.131.218.1288611279693.0 684000128 01c183da6e4bc50712881288611540109914561053 6840001281.193.148.164.1288609861509.2 684000005 113181412886099008861288609901078194082403 6840000051.193.148.164.1288609861509.2 684000018 113181412886099165721288609915890452725326 6840000181.193.131.218.1288611279693.0 684000126 01c183da6e4bc22412881288611523640691739999 6840001261.193.131.218.1288611279693.0 684000121 01c183da6e4bc50712881288611511781996667988 6840001211.193.148.164.1288609861509.2 684000015 127001128860563972141288609859828580660473 684000015select cookie_id,page_id , id fromc02_clickstat_fatdt1where cookie_idin (' 1.193.131.218.1288611279693.0 ', ' 1.193.148.164.1288609861509.2 ') ORDER by page_id,cookie_id;1.193.131.218.1288611279693.0 684000118 01c183da6e4bc22412881288611414343558274174 6840001181.193.131.218.1288611279693.0 684000126 01c183da6e4bc22412881288611523640691739999 6840001261.193.131.218.1288611279693.0 684000121 01c183da6e4bc50712881288611511781996667988 6840001211.193.131.218.1288611279693.0 684000114 01c183da6e4bc50712881288611540109914561053 6840001141.193.131.218.1288611279693.0 684000128 01c183da6e4bc50712881288611540109914561053 6840001281.193.148.164.1288609861509.2 &nbsp 684000005 113181412886099008861288609901078194082403 6840000051.193.148.164.1288609861509.2 684000018 113181412886099165721288609915890452725326 6840000181.193.148.164.1288609861509.2 684000015 127001128860563972141288609859828580660473 684000015 You can see that the sort and order values are different. I initially specified 2 reduce for data distribution (sorted separately). The main reason for the difference is that the above query does not reduce key,hive generates a random number as the reduce key. In this case, the input records are randomly distributed to different reducer machines. To ensure that there are no duplicate cookie_id records between reducer, you can use the Distribute by keyword to specify that the distribution key is cookie_id. Select Cookie_id,country,id,page_id,id fromc02_clickstat_fatdt1 where Cookie_idin (' 1.193.131.218.1288611279693.0 ', ' 1.193.148.164.1288609861509.2 ') distribute by cookie_id SORT bycookie_id,page_id; 1.193.131.218.1288611279693.0 684000118 01c183da6e4bc22412881288611414343558274174 6840001181.193.131.218.1288611279693.0 684000126 01c183da6e4bc22412881288611523640691739999 6840001261.193.131.218.1288611279693.0 684000121 01c183da6e4bc50712881288611511781996667988 6840001211.193.131.218.1288611279693.0 684000114 01c183da6e4bc50712881288611540109914561053 6840001141.193.131.218.1288611279693.0 684000128 01c183da6e4bc50712881288611540109914561053 6840001281.193.148.164.1288609861509.2 684000005 & NBsP 113181412886099008861288609901078194082403 6840000051.193.148.164.1288609861509.2 684000018 113181412886099165721288609915890452725326 6840000181.193.148.164.1288609861509.2 684000015 127001128860563972141288609859828580660473 6840000151.1.2 2 CREATE TABLE if not exists t_order (id int,--order number sale_id int,--Sales idcustomer_id int,--Customer idproduct _id int,--Product Idamou NT INT--number) partitioned by (DS STRING); All sales records are queried in the table and sorted by sales ID and quantity: set mapred.reduce.tasks=2; Select sale_id, amount from T_ordersort by sale_id, amount; This query may get a non-expected sort. The data distributed to the 2 specified reducer may be (sorted separately): reducer1:sale_id | amount0 | 1001 | 301 | 502 | 20reducer2:sale_id | amount0 | 1100 | 1203 | 504 | 20 use the Distribute by keyword to specify that the distribution key is sale_id. After the transformation of the HQL as follows: Set mapred.reduce.tasks=2; Select sale_id, amount from T_orderdistribute to Sale_idsort by sale_id, amount; This ensures that the number of sales IDs in the query's sales record collection is correctly sorted, However, the sales ID is not sorted correctly because hive uses Hadoop's default Hashpartitioner to distribute data. This involves a full-ranking problem. There are two ways to solve this problem: 1.) Do not distribute data, use a single reducer:set mapred.reduce.tasks=1; The drawback of this approach is that the reduce end becomes a performance bottleneck and is generally not available when the volume of data is large. In practice, however, this is still the most common approach, because the usual sort of query is to get some of the top results, so you can use the limit clause to significantly reduce the amount of data. With limit N, the number of data records transferred to the reduce side (stand-alone) is reduced to n (number of maps). 2.) Modify the Partitioner, this method can be done in full order. Here you can use the Totalorderpartitioner from Hadoop (the Terasort project from Yahoo!), a partitione to support the development of ordered data across reducer distributionR, which requires a sequencefile format file to specify the data interval for the distribution. If we have generated this file (stored in/tmp/range_key_list, divided into 100 reducer), you can rewrite the above query as set Mapred.reduce.tasks=100;set hive.mapred.partitioner=org.apache.hadoop.mapred.lib.totalorderpartitioner;settotal.order.partitioner.path=/ Tmp/range_key_list; Select sale_id, amount from t_ordercluster to sale_idsort by amount; There are many ways to generate this interval file (for example, the O.a.h.mapreduce.lib.partition.inputsampler tool that comes with Hadoop). Here are the methods generated with hive, such as a t_sale table with an orderly ID: CREATE TABLE if not exists T_sale (ID int,name string,loc string); Generate Press Sale_ The interval file for ID distribution is: Create external tablerange_keys (sale_id int) row format Serde ' Org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe ' Stored Asinputformat ' Org.apache.hadoop.mapred.TextInputFormat ' OutputFormat ' Org.apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat ' location '/tmp/range_key_list '; insert Overwrite table range_keysselect distinct sale_idfrom source t_salesampletable (buckets out of the on Rand ()) Ssort by S ale_id; The resulting file (/tmp/range_key_list directory) allows TotalorderpartItioner distributes the data of reduce processing in an orderly manner by sale_id. The main issue to be considered in a zone file is the balance of data distribution, which relies on a deep understanding of the data. More exciting content please pay attention to:http://bbs.superwu.cn attention to Superman academy QR Code: Focus on the Superman college Java Free Learning Exchange Group: 455085757

Hive Full Order

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hive Full Order

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hive Full Order

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support