Common Case 1: Data skew caused by null values
Some user_id values in the log table are empty or 0. As a result, when user_id is used for hash bucket sharding, the log data from user_id to 0 or empty is allocated to a reduce, data skew;
For example, if the user does not log on to the user, the user_id in the log is null. When user_id is associated with the user_id in the User table, the data of the user_id in the log is grouped together, leading to data skew due to excessive null keys;
Solution:Solving data skew with Random Functions
Returns the null key.Into a string with a random number (as long as it does not match the real end_user_idThe format is the same)To distribute skewed data to different reduce, Because nullThe value cannot be associated with the user table, and the final result is not affected after processing.;
Case: end_user 0.2 billion million data, dimension table; trackinfo per day, scale by daily increase;
Original statement:
select u.id, t.url, t.track_time from end_user ujoin(select end_user_id, url, track_time from trackinfo where ds=‘2013-12-01‘) ton u.id=t.end_user_id limit 2;
Note:WhereThe condition should not be placed at the end, but in a subquery or onYou can filter out a batch of data in advance..
Adjusted:
select u.id, t.url, t.track_time from end_user ujoin(select case when end_user_id=‘null‘ or end_user_id is null then cast (concat(‘00000000‘,floor(rand()*1000000)) as bigint) else end_user_id end as end_user_id , url,track_time from trackinfo where ds=‘2013-07-21‘) ton u.id=t.end_user_id limit 2;
In the same cluster, the test shows that the latter method is about twice faster than the former method;
Case... When... Then statements case:
SELECT empno, ename, sal, CASE WHEN sal < 1000.0 THEN ‘low‘ WHEN sal >= 1000.0 AND sal < 1500.0 THEN ‘middle‘ WHEN sal >= 1500.0 AND sal < 2000.0 THEN ‘high‘ ELSE ‘very high‘ END AS salary FROM emp; 7369 SMITH 800.0 low7499 ALLEN 1600.0 high7521 WARD 1250.0 middle7566 JONES 2975.0 very high7654 MARTIN 1250.0 middle7698 BLAKE 2850.0 very high7782 CLARK 2450.0 very high7788 SCOTT 3000.0 very high7839 KING 5000.0 very high7844 TURNER 1500.0 high7876 ADAMS 1100.0 middle7900 JAMES 950.0 low7902 FORD 3000.0 very high7934 MILLER 1300.0 middle