Six common cases of data skew in hive syntax Optimization

Source: Internet
Author: User
Common Case 1: Data skew caused by null values

Some user_id values in the log table are empty or 0. As a result, when user_id is used for hash bucket sharding, the log data from user_id to 0 or empty is allocated to a reduce, data skew;

For example, if the user does not log on to the user, the user_id in the log is null. When user_id is associated with the user_id in the User table, the data of the user_id in the log is grouped together, leading to data skew due to excessive null keys;

 

Solution:Solving data skew with Random Functions

Returns the null key.Into a string with a random number (as long as it does not match the real end_user_idThe format is the same)To distribute skewed data to different reduce, Because nullThe value cannot be associated with the user table, and the final result is not affected after processing.;

Case: end_user 0.2 billion million data, dimension table; trackinfo per day, scale by daily increase;

Original statement:

select u.id,  t.url,  t.track_time   from  end_user ujoin(select end_user_id,  url,  track_time   from trackinfo where ds=‘2013-12-01‘) ton u.id=t.end_user_id limit 2;

Note:WhereThe condition should not be placed at the end, but in a subquery or onYou can filter out a batch of data in advance..

Adjusted:

select u.id, t.url, t.track_time from end_user ujoin(select   case when end_user_id=‘null‘ or end_user_id is null   then cast (concat(‘00000000‘,floor(rand()*1000000)) as bigint)   else end_user_id end as end_user_id ,  url,track_time from trackinfo where ds=‘2013-07-21‘) ton u.id=t.end_user_id limit 2;

In the same cluster, the test shows that the latter method is about twice faster than the former method;

 

Case... When... Then statements case:

SELECT empno, ename, sal,  CASE  WHEN sal <  1000.0 THEN ‘low‘  WHEN sal >= 1000.0 AND sal <  1500.0 THEN ‘middle‘  WHEN sal >= 1500.0 AND sal < 2000.0 THEN ‘high‘  ELSE ‘very high‘  END AS salary FROM emp;    7369    SMITH   800.0   low7499    ALLEN   1600.0  high7521    WARD    1250.0  middle7566    JONES   2975.0  very high7654    MARTIN  1250.0  middle7698    BLAKE   2850.0  very high7782    CLARK   2450.0  very high7788    SCOTT   3000.0  very high7839    KING    5000.0  very high7844    TURNER  1500.0  high7876    ADAMS   1100.0  middle7900    JAMES   950.0   low7902    FORD    3000.0  very high7934    MILLER  1300.0  middle

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.