Hive Interview topic: Table about 2T, the table data conversion __ Business Intelligence (PENTAHO)

Source: Internet
Author: User

Http://www.aboutyun.com/thread-7450-1-1.html


There is a very large table: Trlog The table is about 2T.
Trlog:
CREATE TABLE Trlog
(PLATFORM string,
user_id int,
Click_time String,
Click_url string)
Row format delimited
Fields terminated by ' t ';


Data:
PLATFORM user_id Click_time Click_url
WEB 12332321 2013-03-21 13:48:31.324/home/
WEB 12332321 2013-03-21 13:48:32.954/selectcat/er/
WEB 12332321 2013-03-21 13:48:46.365/er/viewad/12.html
WEB 12332321 2013-03-21 13:48:53.651/er/viewad/13.html
WEB 12332321 2013-03-21 13:49:13.435/er/viewad/24.html
WEB 12332321 2013-03-21 13:49:35.876/selectcat/che/
WEB 12332321 2013-03-21 13:49:56.398/che/viewad/93.html
WEB 12332321 2013-03-21 13:50:03.143/che/viewad/10.html
WEB 12332321 2013-03-21 13:50:34.265/home/
WAP 32483923 2013-03-21 23:58:41.123/m/home/
WAP 32483923 2013-03-21 23:59:16.123/m/selectcat/fang/
WAP 32483923 2013-03-21 23:59:45.123/m/fang/33.html
WAP 32483923 2013-03-22 00:00:23.984/m/fang/54.html
WAP 32483923 2013-03-22 00:00:54.043/m/selectcat/er/
WAP 32483923 2013-03-22 00:01:16.576/m/er/49.html
...... ...... ...... ......

The above data processing needs to be the following structure table Allog:
CREATE TABLE Allog
(PLATFORM string,
user_id int,
SEQ int,
From_url String,
To_url string)
Row format delimited
Fields terminated by ' t ';

The data structure after finishing:
PLATFORM user_id SEQ From_url to_url
WEB 12332321 1 null/home/
WEB 12332321 2/home//selectcat/er/
WEB 12332321 3/selectcat/er//er/viewad/12.html
WEB 12332321 4/er/viewad/12.html/er/viewad/13.html
WEB 12332321 5/er/viewad/13.html/er/viewad/24.html
WEB 12332321 6/er/viewad/24.html/selectcat/che/
WEB 12332321 7/selectcat/che//che/viewad/93.html
WEB 12332321 8/che/viewad/93.html/che/viewad/10.html
WEB 12332321 9/che/viewad/10.html/home/
WAP 32483923 1 null/m/home/
WAP 32483923 2/m/home//m/selectcat/fang/
WAP 32483923 3/m/selectcat/fang//m/fang/33.html
WAP 32483923 4/m/fang/33.html/m/fang/54.html
WAP 32483923 5/m/fang/54.html/m/selectcat/er/
WAP 32483923 6/m/selectcat/er//m/er/49.html
...... ...... ...... ......

Platform and USER_ID also represent platform and user id;seq fields that represent the order in which users are sorted by time, and From_url and To_url respectively on which page the user jumps from. For the first access record for a user on a platform, its from_url is null (NULL).


The interviewer says it needs to be done in two ways:
1. Implement a hive Generic UDF that speeds up the process described above, and give Hive SQL to implement the ETL process using this UDF

2. Realize the ETL process based on pure hive SQL, generate Allog table from Trlog table; (The result is a set of SQL)


I'll give you a Java-written RowNumber method.

public class RowNumber extends Org.apache.hadoop.hive.ql.exec.UDF {

private static int max_value = 50;
private static String comparedcolumn[] = new String[max_value];
private static int rownum = 1;

public int Evaluate (Object ... args) {
String columnvalue[] = new String[args.length];
for (int i = 0; i < args.length; i++)
Columnvalue[i] = args[i].tostring ();
if (rownum = 1)
{

for (int i = 0; i < columnvalue.length; i++)
Comparedcolumn[i] = Columnvalue[i];
}

for (int i = 0; i < columnvalue.length; i++)
{

if (!comparedcolumn[i].equals (Columnvalue[i]))
{
for (int j = 0; J < Columnvalue.length; J + +)
{
COMPAREDCOLUMN[J] = Columnvalue[j];
}
RowNum = 1;
return rownum++;
}
}
return rownum++;
}
}
The copy code packs this Java and compiles it into a jar package, such as Rownumber.jar. This you will always ~ ~ ~
And put it on the hive machine.
Execute the following two statements in the hive shell:
Add Jar/root/rownumber.jar;
#把RowNumber. Jar loaded into the classpath of hive
Create temporary function Row_number as ' rownumber ';
#在HIVE里创建一个新函数, called Row_number, the referenced class is the RowNumber in the Java code.
Copy Code
After the prompt is successful, execute the following hive SQL

#INSERT OVERWRITE Table Allog If you want to write to Allog tables, you can remove the annotations
SELECT T1.platform,t1.user_id,row_number (t1.user_id) Seq,t2.click_url From_url,t1.click_url TO_URL from
(select *,row_number (user_id) seq from trlog) t1
Left OUTER JOIN
(select *,row_number (user_id) seq from Trlog) T2
On t1.user_id = t2.user_id and t1.seq = t2.seq + 1;
Copy Code
The RN in the first question appears to be hive translate the SQL bug, you can remove the outer row_number, with T1 seq, you can find the problem.


Second question:

INSERT OVERWRITE TABLE Allog
SELECT T1.platform,t1.user_id,t1.seq,t2.click_url From_url,t1.click_url To_url from
(select Platform,user_id,click_time,click_url,count (1) Seq from (select A.*,b.click_time click_time1,b.click_url Click_url2 from Trlog a left outer join Trlog b in a.user_id = b.user_id) T WHERE click_time>=click_time1 GROUP by plat Form,user_id,click_time,click_url) T1
Left OUTER JOIN
(select Platform,user_id,click_time,click_url,count (1) Seq from (select A.*,b.click_time click_time1,b.click_url Click_url2 from Trlog a left outer join Trlog b in a.user_id = b.user_id) T WHERE click_time>=click_time1 GROUP by plat Form,user_id,click_time,click_url) T2
On t1.user_id = t2.user_id and t1.seq = t2.seq + 1;
Copy Code

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.