Hive Interview topic: Table about 2T, the table data conversion _

Hive Interview topic: Table about 2T, the table data conversion __ Business Intelligence (PENTAHO)

Last Update:2018-08-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Http://www.aboutyun.com/thread-7450-1-1.html

There is a very large table: Trlog The table is about 2T.
Trlog:
CREATE TABLE Trlog
(PLATFORM string,
user_id int,
Click_time String,
Click_url string)
Row format delimited
Fields terminated by ' t ';

Data:
PLATFORM user_id Click_time Click_url
WEB 12332321 2013-03-21 13:48:31.324/home/
WEB 12332321 2013-03-21 13:48:32.954/selectcat/er/
WEB 12332321 2013-03-21 13:48:46.365/er/viewad/12.html
WEB 12332321 2013-03-21 13:48:53.651/er/viewad/13.html
WEB 12332321 2013-03-21 13:49:13.435/er/viewad/24.html
WEB 12332321 2013-03-21 13:49:35.876/selectcat/che/
WEB 12332321 2013-03-21 13:49:56.398/che/viewad/93.html
WEB 12332321 2013-03-21 13:50:03.143/che/viewad/10.html
WEB 12332321 2013-03-21 13:50:34.265/home/
WAP 32483923 2013-03-21 23:58:41.123/m/home/
WAP 32483923 2013-03-21 23:59:16.123/m/selectcat/fang/
WAP 32483923 2013-03-21 23:59:45.123/m/fang/33.html
WAP 32483923 2013-03-22 00:00:23.984/m/fang/54.html
WAP 32483923 2013-03-22 00:00:54.043/m/selectcat/er/
WAP 32483923 2013-03-22 00:01:16.576/m/er/49.html
...... ...... ...... ......

The above data processing needs to be the following structure table Allog:
CREATE TABLE Allog
(PLATFORM string,
user_id int,
SEQ int,
From_url String,
To_url string)
Row format delimited
Fields terminated by ' t ';

The data structure after finishing:
PLATFORM user_id SEQ From_url to_url
WEB 12332321 1 null/home/
WEB 12332321 2/home//selectcat/er/
WEB 12332321 3/selectcat/er//er/viewad/12.html
WEB 12332321 4/er/viewad/12.html/er/viewad/13.html
WEB 12332321 5/er/viewad/13.html/er/viewad/24.html
WEB 12332321 6/er/viewad/24.html/selectcat/che/
WEB 12332321 7/selectcat/che//che/viewad/93.html
WEB 12332321 8/che/viewad/93.html/che/viewad/10.html
WEB 12332321 9/che/viewad/10.html/home/
WAP 32483923 1 null/m/home/
WAP 32483923 2/m/home//m/selectcat/fang/
WAP 32483923 3/m/selectcat/fang//m/fang/33.html
WAP 32483923 4/m/fang/33.html/m/fang/54.html
WAP 32483923 5/m/fang/54.html/m/selectcat/er/
WAP 32483923 6/m/selectcat/er//m/er/49.html
...... ...... ...... ......

Platform and USER_ID also represent platform and user id;seq fields that represent the order in which users are sorted by time, and From_url and To_url respectively on which page the user jumps from. For the first access record for a user on a platform, its from_url is null (NULL).

The interviewer says it needs to be done in two ways:
1. Implement a hive Generic UDF that speeds up the process described above, and give Hive SQL to implement the ETL process using this UDF

2. Realize the ETL process based on pure hive SQL, generate Allog table from Trlog table; (The result is a set of SQL)

I'll give you a Java-written RowNumber method.

public class RowNumber extends Org.apache.hadoop.hive.ql.exec.UDF {

private static int max_value = 50;
private static String comparedcolumn[] = new String[max_value];
private static int rownum = 1;

public int Evaluate (Object ... args) {
String columnvalue[] = new String[args.length];
for (int i = 0; i < args.length; i++)
Columnvalue[i] = args[i].tostring ();
if (rownum = 1)
{

for (int i = 0; i < columnvalue.length; i++)
Comparedcolumn[i] = Columnvalue[i];
}

for (int i = 0; i < columnvalue.length; i++)
{

if (!comparedcolumn[i].equals (Columnvalue[i]))
{
for (int j = 0; J < Columnvalue.length; J + +)
{
COMPAREDCOLUMN[J] = Columnvalue[j];
}
RowNum = 1;
return rownum++;
}
}
return rownum++;
}
}
The copy code packs this Java and compiles it into a jar package, such as Rownumber.jar. This you will always ~ ~ ~
And put it on the hive machine.
Execute the following two statements in the hive shell:
Add Jar/root/rownumber.jar;
#把RowNumber. Jar loaded into the classpath of hive
Create temporary function Row_number as ' rownumber ';
#在HIVE里创建一个新函数, called Row_number, the referenced class is the RowNumber in the Java code.
Copy Code
After the prompt is successful, execute the following hive SQL

#INSERT OVERWRITE Table Allog If you want to write to Allog tables, you can remove the annotations
SELECT T1.platform,t1.user_id,row_number (t1.user_id) Seq,t2.click_url From_url,t1.click_url TO_URL from
(select *,row_number (user_id) seq from trlog) t1
Left OUTER JOIN
(select *,row_number (user_id) seq from Trlog) T2
On t1.user_id = t2.user_id and t1.seq = t2.seq + 1;
Copy Code
The RN in the first question appears to be hive translate the SQL bug, you can remove the outer row_number, with T1 seq, you can find the problem.

Second question:

INSERT OVERWRITE TABLE Allog
SELECT T1.platform,t1.user_id,t1.seq,t2.click_url From_url,t1.click_url To_url from
(select Platform,user_id,click_time,click_url,count (1) Seq from (select A.*,b.click_time click_time1,b.click_url Click_url2 from Trlog a left outer join Trlog b in a.user_id = b.user_id) T WHERE click_time>=click_time1 GROUP by plat Form,user_id,click_time,click_url) T1
Left OUTER JOIN
(select Platform,user_id,click_time,click_url,count (1) Seq from (select A.*,b.click_time click_time1,b.click_url Click_url2 from Trlog a left outer join Trlog b in a.user_id = b.user_id) T WHERE click_time>=click_time1 GROUP by plat Form,user_id,click_time,click_url) T2
On t1.user_id = t2.user_id and t1.seq = t2.seq + 1;
Copy Code

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hive Interview topic: Table about 2T, the table data conversion __ Business Intelligence (PENTAHO)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hive Interview topic: Table about 2T, the table data conversion __ Business Intelligence (PENTAHO)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support