As a query and Analysis Language for SQL, HQL has not been as perfect as other popular SQL statements (such as Transact-SQL and MySQL) so far. In the company's production environment, I think it should not keep up with the pace of Hive version updates, always deploying the latest version of Hive; may lag behind one or two major versions of Shenma; after all, although open-source tools
As a query and Analysis Language for SQL, HQL has not been as perfect as other popular SQL statements (such as Transact-SQL and MySQL) so far. In the company's production environment, I think it should not keep up with the pace of Hive version updates, always deploying the latest version of Hive; may lag behind one or two major versions of Shenma; after all, although open-source tools
As a query and Analysis Language for SQL, HQL has not been as perfect as other popular SQL statements (such as Transact-SQL and MySQL) so far. In the company's production environment, I think it should not keep up with the pace of Hive version updates, always deploying the latest version of Hive; it may lag behind one or two major versions of Shenma; after all, although the transparency of open-source tools is good, the security of open-source tools is weak compared with commercial tools with closed-source tools.
When using HQL to analyze user data offline, even if the data has been aggregated, we may only be interested in the highlighted quantitative indicators or the incremental changes of these indicators, so sort the aggregated data (in descending order of a column? Ascending Order ?) This is a basic requirement. In an immature language like HQL, combined with orderby, the limit clause can be completed without any pears.
However, even if we can put multiple fields into the order by clause and specify the ascending or descending order of each field, such:
order by fieldA desc, fieldB [asc], fieldC desc
However, sorting is always global. Sometimes we want to sort by group, that is, sort by fieldA and then sort by fieldB and/or fieldC corresponding to each value of fieldA, instead of sorting all fieldA values by fieldB and (or) fieldC like order.
To meet this requirement, Transact-SQL provides over, partition by sentence and row_number () functions, while Hive also introduces over, partition by clause and rank functions in 0.11, this provides a convenient Window Analysis (group analysis) function.
Can we sort Hive before version 0.11 by group? The answer is yes, but it looks less straightforward.
To achieve this, we need to have two important roles distribute by and sort by. distribute by can execute the grouping function we need and combine the MapReduce Job features of Hive queries, sort by can also be used to perform partial sorting in the group.
Of course, if we only have them, we can only get a bunch of sorted data, but we cannot know the ranking of each piece of data. Therefore, we need to write a UDF function to determine and return the ranking, this function seems to be widely used on the network:
public final class Rank extends UDF { private int counter; private String last_key =""; public int evaluate(final String key) { if (key == null) { this.last_key= ""; this.counter= 0; return counter; } if(!key.equalsIgnoreCase(this.last_key)) { this.counter= 0; this.last_key= key; } return this.counter++; }}
Here we ignore the registration process of custom UDF... After grouping, the Rank function is applied. This function keeps track of the latest parameter values. When the parameter values are the same consecutively, the counter field is auto-incrementing and the Count value is returned; if a parameter value is different from the value called by the upper limit function, the Rank function resets its Count value field and key field (corresponding to the parameter value) to obtain an int type ranking value.
Hive calls this a user-defined function. In fact, each user-defined function is a class that implements the evaluate method.
With the distribute by, sort by, and Rank functions, we can sort groups. before writing an HQL query script, we need to make it clear:
1. Grouping field: Which of the following is the distribute by field )?
2. sort by: Which of the following is the sort by field )?
3. function parameters: the Rank function requires the String parameter. What should we pass to the Rank function as a real parameter?
I don't know because the above three problems are indeed "easy to prove" as we occasionally see in textbooks, or because the bloggers just thought about this problem and did not practice it, in any case, the blog posts on the network that I see about group sorting (TOP) do not explicitly raise these three questions. According to my practical experience, even if I read these blog posts and read the correct results for the first time, I should have encountered various puzzles, next we will look at the actual scenarios.
For example, we need to query the data like this: the UV of the app myAPP is used daily for the TOP 30 devices, and the traffic of each device in the TOP 10 devices (VV) top 10 Forum content (separated by Content ID );
Assume that the Hive table required for the query is hiveTab_useraction.
Outline: first, use a subquery to find the access content of each device. At the same time, use a subquery to find the top 30 devices, and then join the two tables ), then, fields and data columns are extracted from the outer query.
In this process:
1) finding out the top 10 devices does not seem to involve grouping, but we still need to consider the above three issues because we want to get the ranking, order by does not seem to be able to work in a friendly way with Rank functions (or maybe the method I used is not scientific). In addition, in the script shown below, we also generate a constant string distribute_key;
2) The above three issues need to be considered in the outer loop.
See HQL script:
selectdevice_rank,device_info,vv_rank,pageID,act_vvfrom(selectdevice_rank,device_info,(Rank(device_rank) + 1) as vv_rank,pageID,act_vvfrom(selectt2.device_rank,t2.device_info,t1.pageID,t1.act_vvfrom(selectfieldA as device_info,pageID,count(1) as act_vvfrom hiveTab_useractionwhere `date` >= dateStart and `date` <= dateEndgroup by fieldA, pageID) t1join(select(Rank(distribute_key) + 1) as device_rank,device_info,act_uvfrom(selectdistribute_key,device_info,act_uvfrom(select'topdevice' as distribute_key,device_info,act_uvfrom(selectfieldA as device_info,count(distinct uid) as act_uvfrom hiveTab_useractionwhere `date` >= dateStart and `date` <= dateEndgroup by fieldA) torder by act_uv desclimit 10) tdistribute by distribute_key sort by act_uv desc) t) t2 on (t1.device_info = t2.device_info)distribute by t2.device_ranksort by t2.device_rank, t1.act_vv desc) t) twhere vv_rank <= 10
From the Script test, the three issues that need to be clarified above are really important. In addition, the Rank function returns a ranking starting from 0, so we need to process it with + 1.
Hive has also opened itself to common users to write, register, and use user-defined functions, which indeed brings great scalability.