How Pig optimizes data skew Join
How Pig optimizes data skew Join
1. Data Sampling
2. Based on the sample data, estimate the number of all records of a key and the total memory occupied, pig. skewedjoin. reduce. memusage controls the memory consumption ratio of reduce and calculates the number of reduce tasks required by a key and the total number of reduce tasks.
3. Store the result in a file, which is equivalent to an index file. The format is: (swpv,), (swps,) (Note: <join key>, <min index of reducer>, <max index ofreducer>)
4. Customize Patitioner, read the index, and evenly distribute the key to reduce. For example, (swpv,) evenly distributes swpv to reduce numbered 0-3.
Installation and testing of Pig
Pig installation and configuration tutorial
Pig installation and deployment and testing in MapReduce Mode
Install Pig and test in local mode.
Installation configuration and basic use of Pig
Hadoop Pig advanced syntax
This article permanently updates the link address: