Data skew for hive---large table join

Source: Internet
Author: User

data skew for large table joinsThe development of code in the MapReduce programming model takes into account data skew, as is the hive code. The reasons for data skew include the following two points: 1. The number of map output keys is very small, resulting in reduced reduce to single-machine operation. 2. Map output Key distribution is uneven, a small number of keys corresponding to a large number of value, resulting in the reduce terminal single bottleneck. In hive we used mapjoin to solve the problem of data skew, and one of the tables (full volume) was distributed to all map ends for join, thus avoiding reduce. This requires that the distributed table be fully loaded into memory. In extreme cases, the tables on both sides of the join are large tables and cannot be used with Mapjoin. This problem is the most intractable, there are currently known to solve two kinds of ideas: 1. If this is the case 1, consider going to the first table in the join, and filtering the useless information with this result. This usually translates one of the large tables into a small table, and then uses the Mapjoin. An example is an ad-serving effect analysis, such as populating the information in the Advertiser information table I into the ad Exposure log table W, using a Drop ID association. Because the actual number of advertisers is very small (but the User Information table I is very large), so you can consider in the W table to re-query all the actual Advertiser ID list, as a join filter table I, this result must be a small table, you can use Mapjoin. 2. If this is the case 2, consider slicing a table in join as multiple slices in order to load all the tiles into memory and then use multiple mapjoin to get the results. An example is a commodity browsing log analysis, such as populating the information in the Commodity Information table I into the Product Browsing log table W, using the Commodity ID Association. However, some hot goods are viewed in large quantities, resulting in data skew. For example, the following statement implements a Innerjoin logic that splits the commodity information table into 2 tables: SELECT * FROM (select W.id, W.time, W.amount, I1.name, I1.loc, I1.catfrom W left Oute R join I sampletable (1 out of 2 in ID) i1) union All (select W.id, W.time, W.amount, I2.name, I2.loc, I2.catfrom W left Oute R join I sampletable (1 out of 2 on ID) i2);
The following statement implements the left OUTER join logic: Select T1.id, T1.time, T1.amount, Coalease (T1.name,t2.name), Coalease (T1.loc, T2.loc), C Oalease (T1.cat, T2.cat) from (select W.id, W.time,w.amount, I1.name, I1.loc, i1.cat from W left outer join Isample Table (1 out of 2 in ID) i1) T1 left outer join I sampletable (2 out of 2 on id) T2;
The above statement usesHiveof theSample Tableattribute to slice the table.

For more highlights, please follow: http://bbs.superwu.cn

Focus on the two-dimensional code of Superman Academy:

Data skew for hive---large table join

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.