1. What is the LAG function? 2. What is the difference between LEAD and LAG functions? 3. What functions does FIRST_VALUE and LAST_VALUE provide? Continue to learn these four analysis functions. Note: These functions do not support the WINDOW clause. Hive version is apache-hive-0.13.1 data preparation: utilities cookie1, 2015-04-, url2
1. What is the LAG function? 2. What is the difference between LEAD and LAG functions? 3. What functions does FIRST_VALUE and LAST_VALUE provide? Continue to learn these four analysis functions. Note: These functions do not support the WINDOW clause. Hive version is apache-hive-0.13.1 data preparation: utilities cookie1, 10:00:02, url2
1. What is the LAG function?
2. What are the similarities between LEAD and LAG functions?
3. What functions does FIRST_VALUE and LAST_VALUE provide?
Continue to learn these four analysis functions. Note: These functions do not support the WINDOW clause.
Hive version: apache-hive-0.13.1Data preparation:
Utilities
Cookie1, 10:00:02, url2 cookie1, 10:00:00, url1 cookie1, 2015-04-10 10:03:04, 1url3 cookie1, 10:50:05, interval cookie1, 11:00:00, interval cookie1, 10:10:00, url4 cookie1, url4 cookie1, 10:50:01, url5 cookie2, 10:00:02, hour cookie2, 10:00:00, url11 cookie2, 10:03:04, hour cookie2, 10:50:05, url66 cookie2, 11:00:00, hour cookie2, 10:10:00, url44 cookie2, 10:50:01, url55 create external table lxw1234 (cookieid string, createtime string, -- page access time url STRING -- accessed page) row format delimited fields terminated ', 'stored as textfile location'/tmp/lxw11/'; hive> select * from lxw1234; OK 10:00:02 url2 cookie1 10:00:00 url1 cookie1 2015-04-10 10:03:04 recipe cookie1 10:50:05 11:00:00 10:10:00 1url33 cookie2 2015-04-10 10:50:05 url66 cookie2 2015-04-10 11:00:00 url77 cookie2 2015-04-10 10:10:00 url44 cookie2 2015-04-10 10:50:01 url55
LAG
LAG (col, n, DEFAULT) is used to calculate the n-th row in the window.
The first parameter is the column name, the second parameter is the nth row (optional, default: 1), and the third parameter is the default value (when the nth row is NULL, the default value is used, if this parameter is not specified, it is NULL)
SELECT cookieid, createtime, url, ROW_NUMBER () OVER (partition by cookieid order by createtime) AS rn, LAG (createtime, 1, '2017-01-01 00:00:00 ') OVER (partition by cookieid order by createtime) AS last_interval time, LAG (createtime, 2) OVER (partition by cookieid order by createtime) AS last_2_time FROM lxw1234; cookieid createtime url rn has been written into cookie1 10:00:00 url1 1 minute 00:00:00 NULL cookie1 10:00:02 url2 2 10:00:00 NULL cookie1 10:03:04 10:00:02 10:00:00 url4 4 10:10:00 10:03:04 10:00:02 scheduled 10:50:01 url5 5 10:10:00 10:03:04 cookie1 10:50:05 scheduled 6 10:50:01 10:10:00 cookie1 11:00:00 scheduled 7 10:50:05 10:50:01 cookie2 10:00:00 scheduled 1 scheduled 00:00:00 NULL scheduled 10:00:02 scheduled 2 10:00:00 NULL 10:03:04 10:00:02 10:00:00 10:10:00 10:03:04 cookie2 11:00:00 url77 7 10:50:05 10:50:01 last_1_time: the value of the first 1st rows is specified. The default value is '1970-01-01 00:00:00 '. The first row of cookie1 is NULL. Therefore, the default value is 1970 00:00:00 cookie1. The third row, the value of the first row is the value of the second row, the sixth row of cookie1 at 10:00:02 on April 10, the value of the first row is the value of the fifth row, and the value of last_2_time at 10:50:01 on April 10, 2nd is specified, specify the first row of cookie1 by default. The second row of NULL cookie1 goes up to 2, the second row of NULL cookie1 goes up to 2, and the second row of NULL cookie1 goes up to 2. The second row of cookie1 goes up to 10:00:02 cookie1, the fifth line of the above 2 behavior, 10:50:01
LEAD
Opposite to LAG
LEAD (col, n, DEFAULT) is used to count the n rows down in the window
The first parameter is the column name, the second parameter is the next n rows (optional, the default value is 1), and the third parameter is the default value (when the next n behavior is NULL, take the default value, if this parameter is not specified, it is NULL)
SELECT cookieid, createtime, url, ROW_NUMBER () OVER (partition by cookieid order by createtime) AS rn, LEAD (createtime, 1, '2017-01-01 00:00:00 ') OVER (partition by cookieid order by createtime) AS next_interval time, LEAD (createtime, 2) OVER (partition by cookieid order by createtime) AS next_2_time FROM lxw1234; cookieid createtime url rn next_interval time limit ----------------------------------------------- 10:00:00 10:00:02 10:03:04 url1 2 10:00:02 10:03:04 10:10:00 10:03:04 url5 5 2015-04-10 10:50:05 11:00:00 10:50:05 scheduled 6 11:00:00 NULL cookie1 11:00:00 scheduled 7 scheduled 00:00:00 NULL cookie2 10:00:00 scheduled 1 10:00:02 10:03:04 cookie2 10:00:02 scheduled 2 10:03:04 10:10:00 10:03:04 cookie2 scheduled 3 10:10:00 10:50:01 cookie2 10:10:00 ur L44 4 2015-04-10 10:50:01 2015-04-10 10:50:05 cookie2 2015-04-10 10:50:01 url55 5 2015-04-10 10:50:05 2015-04-10 11:00:00 cookie2 2015-04-10 10:50:05 url66 6 2015-04-10 11:00:00 NULL cookie2 2015-04-10 11:00:00 url77 7 minutes 00:00:00 NULL -- logic is the same as LAG, only the LAG is up, and the LEAD is down.
FIRST_VALUE
After sorting in the group, the first value ends in the current row.
SELECT cookieid, createtime, url, ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn, FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1 FROM lxw1234; cookieid createtime url rn first1 --------------------------------------------------------- cookie1 2015-04-10 10:00:00 url1 1 url1 cookie1 2015-04-10 10:00:02 url2 2 url1 cookie1 2015-04-10 10:03:04 1url3 3 url1 cookie1 2015-04-10 10:10:00 url4 4 url1 cookie1 2015-04-10 10:50:01 url5 5 url1 cookie1 2015-04-10 10:50:05 url6 6 url1 cookie1 2015-04-10 11:00:00 url7 7 url1 cookie2 2015-04-10 10:00:00 url11 1 url11 cookie2 2015-04-10 10:00:02 url22 2 url11 cookie2 2015-04-10 10:03:04 1url33 3 url11 cookie2 2015-04-10 10:10:00 url44 4 url11 cookie2 2015-04-10 10:50:01 url55 5 url11 cookie2 2015-04-10 10:50:05 url66 6 url11 cookie2 2015-04-10 11:00:00 url77 7 url11
LAST_VALUE
After sorting in the group, the last value of the current row ends.
SELECT cookieid, createtime, url, ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn, LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1 FROM lxw1234; cookieid createtime url rn last1 ----------------------------------------------------------------- cookie1 2015-04-10 10:00:00 url1 1 url1 cookie1 2015-04-10 10:00:02 url2 2 url2 cookie1 2015-04-10 10:03:04 1url3 3 1url3 cookie1 2015-04-10 10:10:00 url4 4 url4 cookie1 2015-04-10 10:50:01 url5 5 url5 cookie1 2015-04-10 10:50:05 url6 6 url6 cookie1 2015-04-10 11:00:00 url7 7 url7 cookie2 2015-04-10 10:00:00 url11 1 url11 cookie2 2015-04-10 10:00:02 url22 2 url22 cookie2 2015-04-10 10:03:04 1url33 3 1url33 cookie2 2015-04-10 10:10:00 url44 4 url44 cookie2 2015-04-10 10:50:01 url55 5 url55 cookie2 2015-04-10 10:50:05 url66 6 url66 cookie2 2015-04-10 11:00:00 url77 7 url77
If order by is not specified, the ORDER is sorted BY the offset of the record in the file BY default, and an error occurs.
SELECT cookieid, createtime, url, FIRST_VALUE(url) OVER(PARTITION BY cookieid) AS first2 FROM lxw1234; cookieid createtime url first2 ---------------------------------------------- cookie1 2015-04-10 10:00:02 url2 url2 cookie1 2015-04-10 10:00:00 url1 url2 cookie1 2015-04-10 10:03:04 1url3 url2 cookie1 2015-04-10 10:50:05 url6 url2 cookie1 2015-04-10 11:00:00 url7 url2 cookie1 2015-04-10 10:10:00 url4 url2 cookie1 2015-04-10 10:50:01 url5 url2 cookie2 2015-04-10 10:00:02 url22 url22 cookie2 2015-04-10 10:00:00 url11 url22 cookie2 2015-04-10 10:03:04 1url33 url22 cookie2 2015-04-10 10:50:05 url66 url22 cookie2 2015-04-10 11:00:00 url77 url22 cookie2 2015-04-10 10:10:00 url44 url22 cookie2 2015-04-10 10:50:01 url55 url22 SELECT cookieid, createtime, url, LAST_VALUE(url) OVER(PARTITION BY cookieid) AS last2 FROM lxw1234; cookieid createtime url last2 ---------------------------------------------- cookie1 2015-04-10 10:00:02 url2 url5 cookie1 2015-04-10 10:00:00 url1 url5 cookie1 2015-04-10 10:03:04 1url3 url5 cookie1 2015-04-10 10:50:05 url6 url5 cookie1 2015-04-10 11:00:00 url7 url5 cookie1 2015-04-10 10:10:00 url4 url5 cookie1 2015-04-10 10:50:01 url5 url5 cookie2 2015-04-10 10:00:02 url22 url55 cookie2 2015-04-10 10:00:00 url11 url55 cookie2 2015-04-10 10:03:04 1url33 url55 cookie2 2015-04-10 10:50:05 url66 url55 cookie2 2015-04-10 11:00:00 url77 url55 cookie2 2015-04-10 10:10:00 url44 url55 cookie2 2015-04-10 10:50:01 url55 url55
If you want to obtain the last value after sorting in the group, you need to modify it as follows:
SELECT cookieid, createtime, url, ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn, LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1, FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last2 FROM lxw1234 ORDER BY cookieid,createtime; cookieid createtime url rn last1 last2 ------------------------------------------------------------- cookie1 2015-04-10 10:00:00 url1 1 url1 url7 cookie1 2015-04-10 10:00:02 url2 2 url2 url7 cookie1 2015-04-10 10:03:04 1url3 3 1url3 url7 cookie1 2015-04-10 10:10:00 url4 4 url4 url7 cookie1 2015-04-10 10:50:01 url5 5 url5 url7 cookie1 2015-04-10 10:50:05 url6 6 url6 url7 cookie1 2015-04-10 11:00:00 url7 7 url7 url7 cookie2 2015-04-10 10:00:00 url11 1 url11 url77 cookie2 2015-04-10 10:00:02 url22 2 url22 url77 cookie2 2015-04-10 10:03:04 1url33 3 1url33 url77 cookie2 2015-04-10 10:10:00 url44 4 url44 url77 cookie2 2015-04-10 10:50:01 url55 5 url55 url77 cookie2 2015-04-10 10:50:05 url66 6 url66 url77 cookie2 2015-04-10 11:00:00 url77 7 url77 url77