How Pig optimizes data skew Join

Source: Internet
Author: User

How Pig optimizes data skew Join

How Pig optimizes data skew Join

1. Data Sampling

2. Based on the sample data, estimate the number of all records of a key and the total memory occupied, pig. skewedjoin. reduce. memusage controls the memory consumption ratio of reduce and calculates the number of reduce tasks required by a key and the total number of reduce tasks.

3. Store the result in a file, which is equivalent to an index file. The format is: (swpv,), (swps,) (Note: <join key>, <min index of reducer>, <max index ofreducer>)

4. Customize Patitioner, read the index, and evenly distribute the key to reduce. For example, (swpv,) evenly distributes swpv to reduce numbered 0-3.

Installation and testing of Pig

Pig installation and configuration tutorial

Pig installation and deployment and testing in MapReduce Mode

Install Pig and test in local mode.

Installation configuration and basic use of Pig

Hadoop Pig advanced syntax

This article permanently updates the link address:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.