How hive is coping with data skew

Source: Internet
Author: User

Data Skew

concept : Data skew means that when the Map/reduce program executes, most of the reduce nodes are executed, but

There are one or several reduce nodes that run very slowly, cause the entire program to be processed for a long time because a

Key is a lot more than other keys (sometimes hundreds or thousands of times), this key where the reduce

The amount of data processed by the node is much larger than that of the other nodes, causing a few nodes to run late.

To perform the operation:

1. One of the tables is small, but the key concentration may result in data being distributed to one or several reduce values far above the average
2. Large tables are associated with large tables, but null values or 0 are more, and these controls are handled by a reduce, very slowly.
3.group by dimension is small, and the number of values is too large. It is time consuming to process a value of reduce.
4.count distinct a particular value is too much, and the reduce for handling this special value takes time.

Reason:

1.key Uneven distribution
2. Characteristics of the business data itself
3. Poorly conceived when building a table

4. Some statements have data skew in their own right

Solution:
1. Parameter adjustment
Hive.map.aggr=true
Map-side partial aggregation, equivalent to Combiner (Consolidator).
Hive.groupby.skewindata=true

When there is data skew, load balancing is set to true, and the resulting query plan will have two MR jobs.

in the first MR Job, the Map's The output result set is randomly distributed to reduce, and each reduce makes a partial

The aggregation operation and outputs the result so that the result of processing is the same as the Group by Key may be distributed to different

Reduce, thus achieving the goal of load balancing; the second MR Job is then based on the data results of the pre-processing by the Group by Key

distributed to reduce (this process guarantees that the same Group by Key is distributed to the same reduce), and the last

Into the final aggregation operation.

2. How to join? For the driver table selection, choose the table with the most evenly distributed join key as the driver table. Do the column cropping and filter operations,

in order to achieve a join between two tables. the effect of a relatively small amount of data.

3. Use Mapjoin to make the small dimension table advanced memory. Complete the reduce on the map side.

4. When large tables join large tables, the null key becomes a string plus a random number, and the skewed data is randomly distributed to different reduce,

because the null value is not associated with the The final result is not affected after processing.

5.count DISTINCT, when the value is empty, the case is handled separately, if it is calculated count distinct, can not be processed, direct filtering,

Add 1 to the final result. If there are other calculations, group by is required, the records with the null value can be processed separately, and other

The result is a union.

6. Use the sum () group by method to replace count (distinct) for calculation.

7. In the case of Small business logic optimization, there are times when skewed data can be processed separately and finally union back.

8. For control-generated tilt problems, solution 1: null data is not involved in the association; Scenario 2: Random (rand ()) assigns null value to the new key.

Turn the null key into a by Adding a random number to the string, the skewed data can be divided into different reduce to solve the data skew problem.

9. Different data type associations generate data skew, and the default hash operation assigns reduce by the ID of the int type, which causes all string

type data is assigned to the same Reduce.

10. Use Mapjoin to solve the data skew problem of small tables (few records) associated with large tables, this method uses very high frequency, but if the small table

Big , big to map join will when a bug or exception occurs, special handling is required.

In short, the optimization of a rule, a single job best than the overall excellent.

How hive is coping with data skew

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.