Today, the HSQL statement written in hive is repetitive pv and uv computing. In addition, After calculating the total categories, such as pv and uv computing on the pc end, pv and uv on the mobile end, and then calculate the total pv and uv. The total pv is okay to say that pc + Mobile End is OK, but uv has to be re-fixed, every time you encounter such a thing, It is very uncomfortable, because it cannot be fast.
Today, the HSQL statement written in hive is repetitive pv and uv computing. In addition, After calculating the total categories, such as pv and uv computing on the pc end, pv and uv on the mobile end, and then calculate the total pv and uv. The total pv is okay to say that pc + Mobile End is OK, but uv has to be re-fixed, every time you encounter such a thing, It is very uncomfortable, because it cannot be fast.
Today, the HSQL statement written in hive is repetitive pv and uv computing. In addition, After calculating the total categories, such as pv and uv computing on the pc end, pv and uv on the mobile end, and then calculate the total pv and uv. The total pv is okay to say that pc + Mobile End is OK, but uv has to be re-fixed, every time I encounter such a problem, it is very uncomfortable, because I cannot quickly process it in an HSQL Statement (maybe I am a little obsessive), so I tried several different writing methods during work hours and compared the efficiency.
Okay, let's talk about the code. <无>
1. The previous statistics on total PVS, Uvs, and PVS of various categories are written in this way, that is, SELECT. type,. pv,. uv FROM (SELECT type, count (1) as pv, COUNT (distinct (uid) as uv FROM t1 WHERE dt = '000000' AND req_url like 'mbloglist? Domain = 100808 & ajwvr = 6% 'group by type union all SELECT 'all' as type, count (1) as pv, COUNT (distinct (uid )) as uv FROM t1 WHERE dt = '20140901' AND req_url like 'mbloglist? Domain = 100808 & ajwvr = 6% ') a note: although distinct is easy to write, it is really inefficient. We recommend that you never use distinct2, and then our statement can be changed: SELECT. type, sum (pv), count (uid) FROM (SELECT type, count (1) as pv, uid FROM t1 WHERE dt = '2013' AND req_url like 'mbloglist? Domain = 100808 & ajwvr = 6% 'group by uid, type union all SELECT 'all' as type, count (1) as pv, uid FROM t1 WHERE dt = '20140901' AND req_url like 'mbloglist? Domain = 100808 & ajwvr = 6% 'group by uid) Although agroup by type is more efficient and I have been using it for a while, I still feel uncomfortable, I always felt that the union all function was not used. 3. Today I found that this group by statement cannot be written in it, which seriously affects the efficiency. In addition, the number of jobs written above is large, and I need to change it decisively: SELECT type, SUM (pv), count (uid) FROM (SELECT. type, sum (pv), uid FROM (SELECT type, 1 as pv, uid FROM t1 WHERE dt = '000000' AND req_url like 'mbloglist? Domain = 100808 & ajwvr = 6% 'Union all SELECT 'all' as type, 1 as pv, uid FROM t1 WHERE dt = '2013' AND req_url like 'mbloglist? Domain = 100808 & ajwvr = 6% ') agroup by uid, type) B group by type tested, the efficiency is really good