Http://www.aboutyun.com/thread-7450-1-1.html
There is a very large table: Trlog The table is about 2T.
Trlog:
CREATE TABLE Trlog
(PLATFORM string,
user_id int,
Click_time String,
Click_url string)
Row format delimited
Fields terminated by ' t ';
Data:
PLATFORM user_id Click_time Click_url
WEB 12332321 2013-03-21 13:48:31.324/home/
WEB 12332321 2013-03-21 13:48:32.954/selectcat/er/
WEB 12332321 2013-03-21 13:48:46.365/er/viewad/12.html
WEB 12332321 2013-03-21 13:48:53.651/er/viewad/13.html
WEB 12332321 2013-03-21 13:49:13.435/er/viewad/24.html
WEB 12332321 2013-03-21 13:49:35.876/selectcat/che/
WEB 12332321 2013-03-21 13:49:56.398/che/viewad/93.html
WEB 12332321 2013-03-21 13:50:03.143/che/viewad/10.html
WEB 12332321 2013-03-21 13:50:34.265/home/
WAP 32483923 2013-03-21 23:58:41.123/m/home/
WAP 32483923 2013-03-21 23:59:16.123/m/selectcat/fang/
WAP 32483923 2013-03-21 23:59:45.123/m/fang/33.html
WAP 32483923 2013-03-22 00:00:23.984/m/fang/54.html
WAP 32483923 2013-03-22 00:00:54.043/m/selectcat/er/
WAP 32483923 2013-03-22 00:01:16.576/m/er/49.html
...... ...... ...... ......
The above data processing needs to be the following structure table Allog:
CREATE TABLE Allog
(PLATFORM string,
user_id int,
SEQ int,
From_url String,
To_url string)
Row format delimited
Fields terminated by ' t ';
The data structure after finishing:
PLATFORM user_id SEQ From_url to_url
WEB 12332321 1 null/home/
WEB 12332321 2/home//selectcat/er/
WEB 12332321 3/selectcat/er//er/viewad/12.html
WEB 12332321 4/er/viewad/12.html/er/viewad/13.html
WEB 12332321 5/er/viewad/13.html/er/viewad/24.html
WEB 12332321 6/er/viewad/24.html/selectcat/che/
WEB 12332321 7/selectcat/che//che/viewad/93.html
WEB 12332321 8/che/viewad/93.html/che/viewad/10.html
WEB 12332321 9/che/viewad/10.html/home/
WAP 32483923 1 null/m/home/
WAP 32483923 2/m/home//m/selectcat/fang/
WAP 32483923 3/m/selectcat/fang//m/fang/33.html
WAP 32483923 4/m/fang/33.html/m/fang/54.html
WAP 32483923 5/m/fang/54.html/m/selectcat/er/
WAP 32483923 6/m/selectcat/er//m/er/49.html
...... ...... ...... ......
Platform and USER_ID also represent platform and user id;seq fields that represent the order in which users are sorted by time, and From_url and To_url respectively on which page the user jumps from. For the first access record for a user on a platform, its from_url is null (NULL).
The interviewer says it needs to be done in two ways:
1. Implement a hive Generic UDF that speeds up the process described above, and give Hive SQL to implement the ETL process using this UDF
2. Realize the ETL process based on pure hive SQL, generate Allog table from Trlog table; (The result is a set of SQL)
I'll give you a Java-written RowNumber method.
public class RowNumber extends Org.apache.hadoop.hive.ql.exec.UDF {
private static int max_value = 50;
private static String comparedcolumn[] = new String[max_value];
private static int rownum = 1;
public int Evaluate (Object ... args) {
String columnvalue[] = new String[args.length];
for (int i = 0; i < args.length; i++)
Columnvalue[i] = args[i].tostring ();
if (rownum = 1)
{
for (int i = 0; i < columnvalue.length; i++)
Comparedcolumn[i] = Columnvalue[i];
}
for (int i = 0; i < columnvalue.length; i++)
{
if (!comparedcolumn[i].equals (Columnvalue[i]))
{
for (int j = 0; J < Columnvalue.length; J + +)
{
COMPAREDCOLUMN[J] = Columnvalue[j];
}
RowNum = 1;
return rownum++;
}
}
return rownum++;
}
}
The copy code packs this Java and compiles it into a jar package, such as Rownumber.jar. This you will always ~ ~ ~
And put it on the hive machine.
Execute the following two statements in the hive shell:
Add Jar/root/rownumber.jar;
#把RowNumber. Jar loaded into the classpath of hive
Create temporary function Row_number as ' rownumber ';
#在HIVE里创建一个新函数, called Row_number, the referenced class is the RowNumber in the Java code.
Copy Code
After the prompt is successful, execute the following hive SQL
#INSERT OVERWRITE Table Allog If you want to write to Allog tables, you can remove the annotations
SELECT T1.platform,t1.user_id,row_number (t1.user_id) Seq,t2.click_url From_url,t1.click_url TO_URL from
(select *,row_number (user_id) seq from trlog) t1
Left OUTER JOIN
(select *,row_number (user_id) seq from Trlog) T2
On t1.user_id = t2.user_id and t1.seq = t2.seq + 1;
Copy Code
The RN in the first question appears to be hive translate the SQL bug, you can remove the outer row_number, with T1 seq, you can find the problem.
Second question:
INSERT OVERWRITE TABLE Allog
SELECT T1.platform,t1.user_id,t1.seq,t2.click_url From_url,t1.click_url To_url from
(select Platform,user_id,click_time,click_url,count (1) Seq from (select A.*,b.click_time click_time1,b.click_url Click_url2 from Trlog a left outer join Trlog b in a.user_id = b.user_id) T WHERE click_time>=click_time1 GROUP by plat Form,user_id,click_time,click_url) T1
Left OUTER JOIN
(select Platform,user_id,click_time,click_url,count (1) Seq from (select A.*,b.click_time click_time1,b.click_url Click_url2 from Trlog a left outer join Trlog b in a.user_id = b.user_id) T WHERE click_time>=click_time1 GROUP by plat Form,user_id,click_time,click_url) T2
On t1.user_id = t2.user_id and t1.seq = t2.seq + 1;
Copy Code