Hive face Test-hive Application Thinking
Question: There is a very large table: Trlog The table is about 2T.
Trlog:
CREATE TABLE trlog
(PLATFORM string,
user_id int,
click_time string,
Click_url string)
row format delimited fields terminated by ' \ t ';
Data:
PLATFORM user_id click_time click_url WEB 12332321 2013-03-21 13:48:3 1.324/home/web 12332321 2013-03-21 13:48:32.954/selectcat/er/web 12332321 2013-03-21 13:48:46.365/er/viewad/12.html WEB 12332321 2013-03-21 13:48:53.651/er/v iewad/13.html Web 12332321 2013-03-21 13:49:13.435/er/viewad/24.html Web 12332321 2013-03-21 13:49:35.876/selectcat/che/web 12332321 2013-03-21 13:49:56.398/che/view ad/93.html Web 12332321 2013-03-21 13:50:03.143/che/viewad/10.html Web 12332321 2013-03-21 13:50:34.265/home/wap 32483923 2013-03-21 23:58:41.123/m/home/wap 32483923 2013-03-21 23:59:16.123/m/selectcat/fang/wap 32483923 2013-03-21 23:59:45. 123/m/fang/33.htmL WAP 32483923 2013-03-22 00:00:23.984/m/fang/54.html WAP 32483923 2013-03-22 00:00:54.043/m/selectcat/er/wap 32483923 2013-03-22 00:01:16.576/m/er/49.html ...
...... ...... ......
This data processing is required to be a table allog of the following structure:
CREATE TABLE allog
(PLATFORM string,
user_id int,
SEQ int,
from_url string,
to_url string)
row format delimited fields terminated by ' \ t ';
The data structure after finishing:
PLATFORM user_id SEQ from_url to_url WEB 12332321 1 Null/home/web 12332321 2/home//sel Ectcat/er/web 12332321 3/selectcat/er//er/viewad/12.html WEB 12332321 4/er/viewad/12.html/er/viewad/13.html WEB 12332321 5 /er/viewad/13.html/er/viewad/24.html WEB 12332321 6/er/viewad/24.html /selectcat/che/web 12332321 7/selectcat/che//che/viewad/93.html WEB 12332321 8/che/viewad/93.html/che/viewad/10.html WEB 12332321 9 /che/viewad/10.html/home/wap 32483923 1 NULL/ M/home/wap 32483923 2/m/home//m/selectcat/fang/wap 32483923 3 /m/selectcat/fang//m/fang/33.html WAP 32483923 4/m/fang/33.html /m/fang/54.html WAP 32483923 5/m/fang/54.html/m/selectcat/er/wap 32483923 6/m/selectcat/er//m/er/49.html ........ .....
...... ......
Description: Platform and USER_ID also represent platform and user id;seq fields that represent the order in which users are sorted by time, and From_url and To_url on which page the user jumps to on behalf of each page. For the first access record for a user on a platform, its from_url is null (NULL).
Problem-solving requirements: need to be done in two ways:
1. Implement a hive Generic UDF that speeds up the process described above, and give hive SQL that uses this UDF to implement the ETL process.
2. Realize the ETL process based on pure hive SQL, generate Allog table from Trlog table; (The result is a set of hive SQL statements).
Explanation: The first question efficiency must be high, because the table has 2TB, the second question does not matter, as long as can use hive SQL realization on line.
The solution Idea ******************
Problem one: Write UDF simple, as long as a custom RowNumber method, loaded into the hive can produce results.
The following is a Java-written RowNumber method provided by a user to get the value of the "SEQ" column (the SEQ field represents the order in which users are sorted by time), for informational purposes only.
public class RowNumber extends Org.apache.hadoop.hive.ql.exec.UDF {private static int max_value = 50;
private static String comparedcolumn[] = new String[max_value];
private static int rownum = 1;
public int Evaluate (Object ... args) {String columnvalue[] = new String[args.length];
for (int i = 0; i < args.length i++) columnvalue[i] = args[i].tostring (); if (rownum = = 1) {for (int i = 0; i < columnvalue.length i++) Comparedcolumn[i]
= Columnvalue[i]; for (int i = 0; i < columnvalue.length i++) {if!comparedcolumn[i].equals (Columnva
Lue[i]) {for (int j = 0; J < Columnvalue.length J) {
COMPAREDCOLUMN[J] = Columnvalue[j];
} rownum = 1;
return rownum++;
} return rownum++;
}
}
Pack the Java class above and compile it into a jar package, such as Rownumber.jar. Then put it in the appropriate directory on the hive machine.
Then execute the following two statements in the hive shell:
Add Jar/root/rownumber.jar; #把RowNumber. Jar is loaded into the classpath of hive to
create temporary function row_number as ' rownumber ';
#在HIVE里创建一个新函数, called Row_number, refers to the class of RowNumber in Java code.
After the prompt is successful, execute the following hive SQL statement:
#INSERT OVERWRITE Table Allog If you want to write to the Allog table, you can remove the annotation
from SELECT t1.platform,t1.user_id,row_number (t1.user_id) SEQ, T2.click_url From_url,t1.click_url To_url from
(select *,row_number (user_id) seq-trlog) T1 left
OUTER join< c6/> (select *,row_number (user_id) seq from Trlog) t2 on
t1.user_id = t2.user_id and t1.seq = t2.seq + 1;
Successfully implemented.
Question: The first problem is solved, but one thing does not understand t1\t2 since there is row_number (user_id) Seq, why at the outermost also have to set a row_number (user_id) Seq.
The cow replied: The RN in the first question appears to be hive translates the SQL bug, you can remove the outer row_number, with T1 seq, you can find the problem. The specific situation has yet to be analyzed, interested in the foreign community to check the relevant bug LIST.
The idea of problem solving two:
INSERT OVERWRITE TABLE allog
SELECT t1.platform,t1.user_id,t1.seq,t2.click_url from_url,t1.click_url To_url from
(select Platform,user_id,click_time,click_url,count (1) Seq from (select A.*,b.click_time click_time1,b.click_ URL click_url2 from trlog a left outer join Trlog b on a.user_id = b.user_id) T WHERE click_time>=click_time1 GROUP by Platform,user_id,click_time,click_url) T1
the left OUTER JOIN
(SELECT Platform,user_id,click_time,click_url, Count (1) Seq from (SELECT a.*,b.click_time click_time1,b.click_url click_url2 from trlog a left outer join Trlog B on a.user_id = b.user_id) T WHERE click_time>=click_time1 GROUP by Platform,user_id,click_time,click_url) T2 on
t1.user_id = t2.user_id and t1.seq = t2.seq + 1;
Analysis: This method has no efficiency at all, MapReduce job to run at least 5 times, do the experiment is OK, run production environment will be free.
This blog is organized from CSDN forum: http://bbs.csdn.net/topics/390414342