In the Outbrain case in the final chapter of the Hive Programming Guide, there is a simple implementation that accesses the network traffic, but the results are wrong by its query, and then rewrite one.
I. Questions raised (from the original text of the book)
To analyze network traffic, we often want to be able to measure heat based on a variety of criteria. One approach is to break down user behavior into sessions, where a session represents a single "use" of a series of operations. A user can visit a website several times a day or a few days in one months, but each visit must be different.
So, what is a conversation? One definition is that a series of page activities that are not more than 30 minutes apart is a conversation. That is, if you go to your 1th page, wait 5 minutes, and then go to the 2nd page, then this is the same session. Wait 25 minutes and then go to page 3rd, which is still the same session. Wait 1 minutes to jump to page 4th, this session will be broken, this will not be the 4th access page, but the first page in the 2nd session.
Once we get the interrupt information, we can look at the session's attribute information to see what happened and cause the interruption. The usual way to do this is to compare the linked pages with the length of the session.
At first glance, this seems to be a perfect iterative process. For each page, keep the count down until you find the 1th page. However, Hive does not support iterations. However, this problem can still be solved. This process can be divided into 4 stages.
1. Identify which page views are the initial or "origin" pages of the session.
2. For each page, divide it into the correct source page.
3. Aggregate all page views to each source page.
4. Mark each source page, and then calculate the heat for each session.
This will result in a table where each row represents a full session, and then the user can query the information they want to know.
Second, the realization process
1. Settings
First define the table Session_test:
CREATE TABLE session_test ( st_user_id string, st_pageview_id string, st_page_url string, st_ Referrer_url STRING, st_timestamp DOUBLE) ROW FORMAT delimited fields TERMINATED by ', ';
Then prepare the CSV file for the test data session_test.txt
user_1,2184914255, Http://baike.baidu.com/link?url= stntagubmqjuhkt4kf-p1wxr0rknyun1thcwlblz4ewqwxwly5kgtftbbyvt8bcspvpj1bqns9modytcegvgjq#3,http:// ssx4501.blog.sohu.com/321517032.html,1459070426.712user_1,2489633785,http://blog.sohu.com/,http:// news.sohu.com/20160327/n442361110.shtml,1459070327.691user_1,9638816121,http://www.sohu.com/,https:// bj.122.gov.cn/views/inquiry.html,1459070227.725user_1,7730804743,http://www.bjjtgl.gov.cn/wanchengyemian2/ hlwptzcts.html,http://www.bjjtgl.gov.cn/,1459070127.563user_1,6135947520,http://mail.sohu.com/bapp/36/main# Maillist_-2,http://baike.baidu.com/link?url= Stntagubmqjuhkt4kf-p1wxr0rknyun1thcwlblz4ewqwxwly5kgtftbbyvt8bcspvpj1bqns9modytcegvgjq#3,1459070027.165user_ 1,2005762692,http://www.56.com/u63/v_mtixotazmtk2.html,http://www.56.com/u63/v_mtixotazmtk2.html, 1459069427.065user_1,9885965958,http://v.baidu.com/movie/32210.htm?fr=open_bdps&video_uri= movie.basic.002707.1372413037.2,https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=0&rsv_idx=1&tn=baidu&wd=%e8%94%a1%e6%9d%8e%e4%bd%9b%e5%b0%8f%e5%ad%90&rsv_pq=da0a1d6e001a2f26&rsv_t= ae7cf77gafjqvijzzaoitr6qzqezhlos5jaoillrgrsf82o0ewqe3xyjtlq&rsv_enter=1&rsv_n=2&rsv_sug3= 1,1459060427.743user_1,7781491477,https://www.baidu.com/,http://www.youku.com/,1459060227.538user_1,4874646952 , HTTP://STSWORDMAN.CNBLOGS.COM/ARCHIVE/2006/06/12/423910.HTML,HTTP://STSWORDMAN.CNBLOGS.COM/ARCHIVE/2006/06/12 /423910.html,1459060127.538user_1,4584871291,https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx =1&tn=baidu&wd=referrer_url&oq=mysql%20%e5%a4%9a%e8%a1%8cuuid&rsv_pq=b37dd6bf000f9361&rsv_ t=de34o52vuquhuv5xsbtpc8h8tcy7qosiywdwioyh04hjx%2fiyz7pbogibmla&rsv_enter=1&inputt=5220&rsv_sug3= 33&rsv_sug1=23&rsv_sug7=000&rsv_n=2&bs=mysql%20%e5%a4%9a%e8%a1%8cuuid,https://www.baidu.com/s? Ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=mysql%20%e5%a4%9a%e8%a1%8cuuid&oq=mysql%20 %e9%80%90%e8%a1%8cuuid&aMp;rsv_pq=bd5a2a29000e7681&rsv_t=0eb2z5jt3mubueh%2b93%2fdjmios9knap85eroqirffsyiyoh4csfnownrh%2fcw&rsv _enter=1&inputt=5220&rsv_sug3=29&rsv_sug1=20&rsv_sug7=100&rsv_sug2=0&rsv_sug4= 5943,1459060027.378user_1,7240734452,http://zhidao.baidu.com/link?url=_cw4-_ vo4xvi1lhfpusdo7rrpeydthhzietaowhr52h_g1flv8pitwbzncdhqtgaxjsrhiox_dttbwltmzs-8hfydpxg_ijunbjryjasway,http:// Zhidao.baidu.com/link?url=_cw4-_vo4xvi1lhfpusdo7rrpeydthhzietaowhr52h_g1flv8pitwbzncdhqtgaxjsrhiox_ Dttbwltmzs-8hfydpxg_ijunbjryjasway,1459059427.334user_1,6310115480,https://www.baidu.com/s?ie=utf-8&f=8 &rsv_bp=1&rsv_idx=1&tn=baidu&wd=mysql%20%e9%80%90%e8%a1%8cuuid&oq=mysql%20uuid&rsv_pq= db45a04e000eb323&rsv_t=a2cexgbrnrvdtssubhpdhgtu0%2bmpbceyugeqfhh49o3stpqfbgftvqmx3lm&rsv_enter=1& inputt=23761&rsv_sug3=24&rsv_sug1=17&rsv_sug7=100&rsv_sug2=0&rsv_sug4=25872,https:// Www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&aMp;wd=mysql%20uuid&oq=%e6%97%b6%e9%97%b4%e6%88%b3%20%e6%af%ab%e7%a7%92&rsv_pq=bd2966c9000d954a&rsv _t=2591i8xuliimerz%2fcaouyilpwymhp4r4z%2bjv5%2f4%2fk3jbvshjvsil0kdx09w&rsv_enter=1&rsv_sug3=16&rsv _sug1=11&rsv_sug7=100&bs=%e6%97%b6%e9%97%b4%e6%88%b3%20%e6%af%ab%e7%a7%92,1459057527.948user_ 1,9725586547,http://bbs.aardio.com/forum.php/forum.php?mod=viewthread&tid=8236,http://blog.csdn.net/ runming918/article/details/7231259,1459057427.004user_1,1995107575,http://blog.csdn.net/superhosts/article/ details/26054997,http://www.xue163.com/19990/519968/519968758926117085.html,1459055428.541user_1,4376767486, Http://www.xue163.com/36678/22365/223657734/,http://v.youku.com/v_show/id_XMTUxMTg4MjQ3Ng==_ev_4.html?from= y1.3-idx-uhome-1519-20887.205805-205902.7-1,1459053428.252user_2,2202616558, Http://baike.baidu.com/link?url= stntagubmqjuhkt4kf-p1wxr0rknyun1thcwlblz4ewqwxwly5kgtftbbyvt8bcspvpj1bqns9modytcegvgjq#3,http:// ssx4501.blog.sohu.com/321517032.html,1459070426.913user_2,2458809722,http://blog.sohu.com/,http://news.sohu.com/20160327/n442361110.shtml,1459070327.073user_ 2,8893838191,http://www.sohu.com/,https://bj.122.gov.cn/views/inquiry.html,1459070227.167user_2,2853640637, Http://www.bjjtgl.gov.cn/wanchengyemian2/hlwptzcts.html,http://www.bjjtgl.gov.cn/,1459070127.527user_ 2,1174630884,http://mail.sohu.com/bapp/36/main#maillist_-2,http://baike.baidu.com/link?url= Stntagubmqjuhkt4kf-p1wxr0rknyun1thcwlblz4ewqwxwly5kgtftbbyvt8bcspvpj1bqns9modytcegvgjq#3,1459070027.515user_ 2,5635223546,http://www.56.com/u63/v_mtixotazmtk2.html,http://www.56.com/u63/v_mtixotazmtk2.html, 1459069427.031user_2,6497571917,http://v.baidu.com/movie/32210.htm?fr=open_bdps&video_uri= movie.basic.002707.1372413037.2,https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=0&rsv_idx=1&tn= baidu&wd=%e8%94%a1%e6%9d%8e%e4%bd%9b%e5%b0%8f%e5%ad%90&rsv_pq=da0a1d6e001a2f26&rsv_t= Ae7cf77gafjqvijzzaoitr6qzqezhlos5jaoillrgrsf82o0ewqe3xyjtlq&rsv_enter=1&rsv_n=2&aMp;rsv_sug3=1,1459060427.961user_2,9736857005,https://www.baidu.com/,http://www.youku.com/,1459060227.972user_ 2,9522138344,http://stswordman.cnblogs.com/archive/2006/06/12/423910.html,http://stswordman.cnblogs.com/ archive/2006/06/12/423910.html,1459060127.819user_2,3288917999,https://www.baidu.com/s?ie=utf-8&f=8& rsv_bp=1&rsv_idx=1&tn=baidu&wd=referrer_url&oq=mysql%20%e5%a4%9a%e8%a1%8cuuid&rsv_pq= b37dd6bf000f9361&rsv_t=de34o52vuquhuv5xsbtpc8h8tcy7qosiywdwioyh04hjx%2fiyz7pbogibmla&rsv_enter=1& Inputt=5220&rsv_sug3=33&rsv_sug1=23&rsv_sug7=000&rsv_n=2&bs=mysql%20%e5%a4%9a%e8%a1%8cuuid , https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=mysql%20%E5%A4%9A%E8% a1%8cuuid&oq=mysql%20%e9%80%90%e8%a1%8cuuid&rsv_pq=bd5a2a29000e7681&rsv_t=0eb2z5jt3mubueh%2b93% 2fdjmios9knap85eroqirffsyiyoh4csfnownrh%2fcw&rsv_enter=1&inputt=5220&rsv_sug3=29&rsv_sug1=20 &rsv_sug7=100&rsv_sug2=0&rsv_sug4=5943,1459060027.815user_2,3790199378,http://zhidao.baidu.com/link?url=_cw4-_ vo4xvi1lhfpusdo7rrpeydthhzietaowhr52h_g1flv8pitwbzncdhqtgaxjsrhiox_dttbwltmzs-8hfydpxg_ijunbjryjasway,http:// Zhidao.baidu.com/link?url=_cw4-_vo4xvi1lhfpusdo7rrpeydthhzietaowhr52h_g1flv8pitwbzncdhqtgaxjsrhiox_ Dttbwltmzs-8hfydpxg_ijunbjryjasway,1459059427.106user_2,6413528694,https://www.baidu.com/s?ie=utf-8&f=8 &rsv_bp=1&rsv_idx=1&tn=baidu&wd=mysql%20%e9%80%90%e8%a1%8cuuid&oq=mysql%20uuid&rsv_pq= db45a04e000eb323&rsv_t=a2cexgbrnrvdtssubhpdhgtu0%2bmpbceyugeqfhh49o3stpqfbgftvqmx3lm&rsv_enter=1& inputt=23761&rsv_sug3=24&rsv_sug1=17&rsv_sug7=100&rsv_sug2=0&rsv_sug4=25872,https:// www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=mysql%20uuid&oq=%E6%97%B6% e9%97%b4%e6%88%b3%20%e6%af%ab%e7%a7%92&rsv_pq=bd2966c9000d954a&rsv_t=2591i8xuliimerz% 2fcaouyilpwymhp4r4z%2bjv5%2f4%2fk3jbvshjvsil0kdx09w&rsv_enter=1&rsv_sug3=16&rsv_sug1=11&rsv_sug7=100&bs=%e6%97%b6%e9%97%b4%e6%88%b3%20%e6%af%ab%e7%a7% 92,1459057527.689user_2,6745253283,http://bbs.aardio.com/forum.php/forum.php?mod=viewthread&tid=8236,http ://blog.csdn.net/runming918/article/details/7231259,1459057427.126user_2,7438056632,http://blog.csdn.net/ Superhosts/article/details/26054997,http://www.xue163.com/19990/519968/519968758926117085.html, 1459055428.198user_2,8615798962,http://www.xue163.com/36678/22365/223657734/,http://v.youku.com/v_show/id_ xmtuxmtg4mjq3ng==_ev_4.html?from=y1.3-idx-uhome-1519-20887.205805-205902.7-1,1459053428.636
Finally import the test data into the table
LOAD DATA LOCAL inpath '/home/grid/session_test.txt ' into TABLE session_test;
2. Find the Start page for each session
CREATE TABLE Sessionization_step_one_originsas SELECT t1.st_user_id as ssoo_user_id, t1.st_pageview_id ssoo_pageview_id, T1.st_timestamp ssoo_timestamp from Session_test T1 left JOIN (SELECT DISTINCT st_pageview_id From (SELECT b.st_pageview_id st_pageview_id, IF (b. St_timestamp > A.st_timestamp and B.st_timestamp-a.st_timestamp < 1800, 1, 0) c_nonorigin_flags from ses Sion_test a INNER JOIN session_test b on a.st_user_id = b.st_user_id) T1 where c_nonorigin_flags = 1) t2 on t1.st_pageview_id = t2.st_pageview_id where T2.st_pagevie W_ID is NULL;
The browse ID of the non-starting page is queried in the most inner subquery, because Hive does not currently support a join that is not equal, so it uses an if judgment to mark a non-starter page. The outermost query uses an outer join to get the session start page.
The Query Sessionization_step_one_origins table is shown in record 1
SELECT ssoo_user_id, ssoo_pageview_id, CAST (Ssoo_timestamp as DECIMAL (3)) from Sessionization_step_ One_originsorder by ssoo_user_id, Ssoo_timestamp;
Figure 1
3. Assign PV to the start page
CREATE TABLE Sessionization_step_two_origin_identificationas SELECT st_user_id sstoi_user_id, st_pageview_id SS toi_pageview_id, ssoo_pageview_id sstoi_origin_pageview_id from (SELECT t2.st_user_id, T2.st _pageview_id, t1.ssoo_pageview_id, IF (t2.st_timestamp >= t1 . Current_ssoo_timestamp and T2.st_timestamp < T1.lead_ssoo_timestamp, 1, 0) C_nonorigin_flags from (SELECT ssoo_user_id, SSO o_pageview_id, NVL (Lead_ssoo_timestamp, 9999999999) lead_ssoo_timest AMP, Current_ssoo_timestamp from (SELECT ssoo_user_id, ssoo_pageview_id, lead (Ssoo_timestamp, 1) over (PARTITION by ssoo_user_id Orde R by Ssoo_timestamp) Lead_ssoo_timestamp, ssoo_timest AMP Current_ssoo_timestamp from Sessionization_step_one_origins) t0) T1 INNER JOIN (SELECT st_user_id, st_pageview_id, st_timestamp from Session_test) T2 On t1.ssoo_user_id = t2.st_user_id) T WHERE t.c_nonorigin_flags = 1;
The most inner subquery uses the lead window parsing function to get the next timestamp, and the record between the current timestamp and the next timestamp is the PV corresponding to the current session start page. Because the start page corresponds to the PV containing the start page itself, the if interval to determine the condition is left closed right open.
The Query sessionization_step_two_origin_identification table is shown in record 2
SELECT sstoi_user_id, sstoi_pageview_id, sstoi_origin_pageview_id from sessionization_step_two_origin_ identification;
Figure 2
4. Aggregate by Start Page
CREATE TABLE sessionization_step_three_origin_aggregation asselect sstoi_user_id sstoa_user_id, sstoi_ origin_pageview_id sstoa_origin_pageview_id, count (1) sstoa_pageview_countfrom sessionization_step_two_ Origin_identificationgroup by sstoi_user_id, sstoi_origin_pageview_id;
The Query sessionization_step_three_origin_aggregation table is shown in record 3
SELECT sstoa_user_id, sstoa_origin_pageview_id, sstoa_pageview_count from sessionization_step_three_origin_ Aggregationorder by sstoa_user_id, Sstoa_pageview_count;
Figure 35. Aggregation by Start Page properties
CREATE TABLE Sessionization_step_four_qualitative_labelingas SELECT a.a_user_id as ssfql_user_id, A.A_ORIGIN_PA geview_id as ssfql_origin_pageview_id, B.b_timestamp as Ssfql_timestamp, B.b_page_url as Ssfql_page_url, B.b_referrer_url as Ssfql_referrer_url, A.a_pageview_count as Ssqfl_pageview_count from (SELECT SST oa_user_id as a_user_id, sstoa_origin_pageview_id as a_origin_pageview_id, Sstoa_pagevie W_count as A_pageview_count from Sessionization_step_three_origin_aggregation) a JOIN (Selec T st_user_id as b_user_id, st_pageview_id as b_pageview_id, St_page_url as B_page_url, St_referrer_url as B_referrer_url, St_timestamp as B_timestamp from session_t EST) b on a.a_user_id = b.b_user_id and a.a_origin_pageview_id = b.b_pageview_id;
The Query sessionization_step_four_qualitative_labeling table is shown in record 4
SELECT ssfql_user_id, ssfql_origin_pageview_id, Ssfql_timestamp, Ssfql_page_url, Ssfql_ref Errer_url, Ssqfl_pageview_count from Sessionization_step_four_qualitative_labelingorder by ssfql_user_id, SSFQL _timestamp;
Figure 46. Measure the Heat
SELECT Parse_url (Ssfql_referrer_url, ' HOST ') as Referrer_host, COUNT (1) as Session_count, AVG (Ssqfl_pageview _count) as Avg_pvs_per_session, SUM (ssqfl_pageview_count)/count (1) as Weighted_avg_pvs_per_session, MAX ( Ssqfl_pageview_count) as Max_pvs_per_session, MIN (Ssqfl_pageview_count) as Min_pvs_per_session, count ( DISTINCT ssfql_user_id) as unique_users from Sessionization_step_four_qualitative_labelinggroup by PARSE_URL ( Ssfql_referrer_url, ' HOST ');
The results of Query 5 show
Figure 5
Using hive queries to conversational access network traffic