Hive Full Order

Source: Internet
Author: User

1.1 Full Sort

The sorting key for hive is sort by, which intentionally distinguishes between the order by of the traditional database and also to emphasize the difference between the two –sort by can only be sorted within a single machine scope. [/URL] [Wang Li 1]

1.1.1 Example 1Set mapred.reduce.tasks=2; original value Select Cookie_id,page_id,id fromc02_clickstat_fatdt1where cookie_idin (' 1.193.131.218.1288611279693.0 ', ' 1.193.148.164.1288609861509.2 ') 1.193.148.164.1288609861509.2   113181412886099008861288609901078194082403      6840000051.193.148.164.1288609861509.2   127001128860563972141288609859828580660473      6840000151.193.148.164.1288609861509.2   113181412886099165721288609915890452725326       6840000181.193.131.218.1288611279693.0  01c183da6e4bc50712881288611540109914561053      6840001141.193.131.218.1288611279693.0  01c183da6e4bc22412881288611414343558274174       6840001181.193.131.218.1288611279693.0  01c183da6e4bc50712881288611511781996667988      6840001211.193.131.218.1288611279693.0  01c183da6e4bc22412881288611523640691739999       6840001261.193.131.218.1288611279693.0  01c183da6e4bc50712881288611540109914561053      684000128hive > select Cookie_id,page_id,id fromc02_clickstat_fatdt1 wherecookie_idin (' 1.193.131.218.1288611279693.0 ', ' 1.193.148.164.1288609861509.2 ') SORT by cookie_id,page_id; Sort sorted by value 1.193.131.218.1288611279693.0           684000118       01c183da6e4bc22412881288611414343558274174       6840001181.193.131.218.1288611279693.0           684000114       01c183da6e4bc50712881288611540109914561053       6840001141.193.131.218.1288611279693.0           684000128       01c183da6e4bc50712881288611540109914561053       6840001281.193.148.164.1288609861509.2           684000005        113181412886099008861288609901078194082403      6840000051.193.148.164.1288609861509.2           684000018      113181412886099165721288609915890452725326       6840000181.193.131.218.1288611279693.0           684000126        01c183da6e4bc22412881288611523640691739999       6840001261.193.131.218.1288611279693.0           684000121       01c183da6e4bc50712881288611511781996667988       6840001211.193.148.164.1288609861509.2           684000015       127001128860563972141288609859828580660473      684000015select cookie_id,page_id , id fromc02_clickstat_fatdt1where cookie_idin (' 1.193.131.218.1288611279693.0 ', ' 1.193.148.164.1288609861509.2 ') ORDER by page_id,cookie_id;1.193.131.218.1288611279693.0           684000118       01c183da6e4bc22412881288611414343558274174      6840001181.193.131.218.1288611279693.0           684000126       01c183da6e4bc22412881288611523640691739999      6840001261.193.131.218.1288611279693.0           684000121       01c183da6e4bc50712881288611511781996667988      6840001211.193.131.218.1288611279693.0           684000114       01c183da6e4bc50712881288611540109914561053      6840001141.193.131.218.1288611279693.0           684000128       01c183da6e4bc50712881288611540109914561053      6840001281.193.148.164.1288609861509.2      &nbsp    684000005      113181412886099008861288609901078194082403       6840000051.193.148.164.1288609861509.2           684000018       113181412886099165721288609915890452725326       6840000181.193.148.164.1288609861509.2           684000015        127001128860563972141288609859828580660473      684000015 You can see that the sort and order values are different. I initially specified 2 reduce for data distribution (sorted separately). The main reason for the difference is that the above query does not reduce key,hive generates a random number as the reduce key. In this case, the input records are randomly distributed to different reducer machines. To ensure that there are no duplicate cookie_id records between reducer, you can use the Distribute by keyword to specify that the distribution key is cookie_id. Select Cookie_id,country,id,page_id,id fromc02_clickstat_fatdt1 where Cookie_idin (' 1.193.131.218.1288611279693.0 ', ' 1.193.148.164.1288609861509.2 ')   distribute by cookie_id SORT bycookie_id,page_id; 1.193.131.218.1288611279693.0           684000118       01c183da6e4bc22412881288611414343558274174      6840001181.193.131.218.1288611279693.0           684000126      01c183da6e4bc22412881288611523640691739999       6840001261.193.131.218.1288611279693.0           684000121        01c183da6e4bc50712881288611511781996667988       6840001211.193.131.218.1288611279693.0           684000114       01c183da6e4bc50712881288611540109914561053       6840001141.193.131.218.1288611279693.0           684000128       01c183da6e4bc50712881288611540109914561053       6840001281.193.148.164.1288609861509.2           684000005   & NBsP  113181412886099008861288609901078194082403      6840000051.193.148.164.1288609861509.2            684000018       113181412886099165721288609915890452725326      6840000181.193.148.164.1288609861509.2           684000015       127001128860563972141288609859828580660473      6840000151.1.2      2 CREATE TABLE if not exists t_order (id int,--order number sale_id int,--Sales idcustomer_id int,--Customer idproduct _id int,--Product Idamou NT INT--number) partitioned by (DS STRING); All sales records are queried in the table and sorted by sales ID and quantity: set mapred.reduce.tasks=2; Select sale_id, amount from T_ordersort by sale_id, amount; This query may get a non-expected sort. The data distributed to the 2 specified reducer may be (sorted separately): reducer1:sale_id | amount0 | 1001 | 301 | 502 | 20reducer2:sale_id | amount0 | 1100 | 1203 | 504 | 20 use the Distribute by keyword to specify that the distribution key is sale_id. After the transformation of the HQL as follows: Set mapred.reduce.tasks=2; Select sale_id, amount from T_orderdistribute to Sale_idsort by sale_id, amount; This ensures that the number of sales IDs in the query's sales record collection is correctly sorted, However, the sales ID is not sorted correctly because hive uses Hadoop's default Hashpartitioner to distribute data. This involves a full-ranking problem. There are two ways to solve this problem: 1.) Do not distribute data, use a single reducer:set mapred.reduce.tasks=1; The drawback of this approach is that the reduce end becomes a performance bottleneck and is generally not available when the volume of data is large. In practice, however, this is still the most common approach, because the usual sort of query is to get some of the top results, so you can use the limit clause to significantly reduce the amount of data. With limit N, the number of data records transferred to the reduce side (stand-alone) is reduced to n (number of maps). 2.) Modify the Partitioner, this method can be done in full order. Here you can use the Totalorderpartitioner from Hadoop (the Terasort project from Yahoo!), a partitione to support the development of ordered data across reducer distributionR, which requires a sequencefile format file to specify the data interval for the distribution. If we have generated this file (stored in/tmp/range_key_list, divided into 100 reducer), you can rewrite the above query as set Mapred.reduce.tasks=100;set hive.mapred.partitioner=org.apache.hadoop.mapred.lib.totalorderpartitioner;settotal.order.partitioner.path=/ Tmp/range_key_list; Select sale_id, amount from t_ordercluster to sale_idsort by amount; There are many ways to generate this interval file (for example, the O.a.h.mapreduce.lib.partition.inputsampler tool that comes with Hadoop). Here are the methods generated with hive, such as a t_sale table with an orderly ID: CREATE TABLE if not exists T_sale (ID int,name string,loc string); Generate Press Sale_ The interval file for ID distribution is: Create external tablerange_keys (sale_id int) row format Serde ' Org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe ' Stored Asinputformat ' Org.apache.hadoop.mapred.TextInputFormat ' OutputFormat ' Org.apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat ' location '/tmp/range_key_list '; insert Overwrite table range_keysselect distinct sale_idfrom source t_salesampletable (buckets out of the on Rand ()) Ssort by S ale_id; The resulting file (/tmp/range_key_list directory) allows TotalorderpartItioner distributes the data of reduce processing in an orderly manner by sale_id. The main issue to be considered in a zone file is the balance of data distribution, which relies on a deep understanding of the data. More exciting content please pay attention to:http://bbs.superwu.cn  attention to Superman academy QR Code: Focus on the Superman college Java Free Learning Exchange Group: 455085757

Hive Full Order

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.