Hive big data skew Summary

Source: Internet
Author: User

During the optimization process in the shuffle stage, the data skew problem is encountered, which makes the optimization effect less obvious in some cases. The main reason is that the counters obtained after the job is completed are the sum of the entire job, and the optimization is based on the average value obtained by these counters. However, the difference in the data volume processed by map is too large due to data skew, this reduces the value represented by these average values. Hive is executed in stages. The difference in the amount of data processed by map depends on the reduce output of the previous stage. Therefore, how to evenly distribute data to each reduce, is the root solution to data skew. It is more efficient to avoid errors to run better than to solve errors. After reading some documents, we can summarize them as follows.

1. Reasons for data skew 1.1 operation:

Keywords

Situation

Consequence

Join

One of the tables is small,

But the key set

The data distributed to one or more reducers is much higher than the average value.

Large tables and large tables, but the bucket-based Judgment Field 0 value or too many null values

All these null values are processed by a reduce, which is very slow.

Group

The group by dimension is too small,

The number of values is too large.

It is usually time-consuming to process reduce of a value.

Count distinct

Too many special values

Reduce time consumed for processing this special value

1.2 cause:

1) Uneven Key Distribution

2) features of business data

3) The table creation is not considered weekly.

4) Some SQL statements have data skew.

 

1.3 Performance:

The task progress remains at 99% (or 100%) for a long time. On the task monitoring page, only a few (one or several) Reduce subtasks are not completed. This is because the amount of data processed is significantly different from other reduce operations.

The number of records in a single reduce differs greatly from the average number of records, which may usually be 3 times or more. The maximum length is longer than the average length.

 

2. Data skew Solution 2.1 parameter adjustment:

Hive. Map. aggr = true

Partial aggregation on map, equivalent to combiner

Hive. groupby. skewindata= True

Load Balancing is performed when data skew occurs. When the option is set to true, the generated query plan has two Mr Jobs. In the first Mr job, the map output result set is randomly distributed to reduce. Each reduce performs partial aggregation and outputs the result, in this way, the same group by key may be distributed to different reduce tasks to achieve load balancing; the second Mr job is then distributed to reduce according to the pre-processed data results according to the group by key (this process can ensure that the same group by key is distributed to the same reduce ), finally, the final aggregation operation is completed.

 

2.2 SQL statement adjustment:

HowJoin:

For the selection of the driver table, use the table with the most even join key distribution as the driver table.

Perform column pruning and filter operations to reduce the data volume when two tables are joined.

Size tableJoin:

Use Map join to make small dimension tables (less than 1000 records) in advanced memory. Complete reduce on the map side.

Large tableJoinBig Table:

Convert the key of the null value into a string with a random number, and divide the skewed data into different reduce values. Because the null value cannot be correlated, the final result is not affected after processing.

Count distinctA large number of identical special values

When Count distinct is used, if the value is null, it is processed separately. If it is used to calculate count distinct, you can directly filter it without processing it. Add 1 in the final result. If other calculations require group by, you can process records with null values separately before union with other calculation results.

GroupThe dimension is too small:

Replace count (distinct) with sum () group by to complete the calculation.

Special Handling in special cases:

When the business logic optimization is not very effective, sometimes the skewed data can be taken out separately for processing. Finally, Union goes back.

 

3 typical business scenarios 3.1 data skew caused by a null value

Scenario:For example, in logs, information is often lost, for example, user_id in logs. If the user_id in the log is associated with the user_id in the User table, data skew may occur.

Solution1:Blank user_id is not associated (the red font is changed)

 Select   *   From   Log  A  Join  Users B  On A. User_id   Is   Not   Null    And A.User_id   = B. User_id  Union   All  Select   *   From   Log  A  Where A. User_id   Is   Null ;

 

Solution2:Assign a new key value to the Null Value

Select *From LogALeft Outer JoinUsers BOn Case WhenA.User_id Is Null ThenConcat ('hive ',Rand())ElseA.User_id End =B.User_id;

 

Conclusion:Method 2 is more efficient than method 1, with less Io and fewer jobs. Solution 1: log reads twice, and jobs is 2. Solution 2: the number of jobs is 1. This optimization is suitable for skews generated by Invalid IDS (such as-99, '', and null. By converting the key with a null value into a string and adding a random number, you can divide the skewed data into different reduce values to solve the data skew problem.

 

3.2 data skew caused by association of different data types

Scenario:The user_id field in the User table is int, And the user_id field in the log table has both string and INT types. When two tables are joined by user_id, the default hash operation is allocated by the int type ID, this will cause all records of string type IDs to be allocated to a CER Cer.

Solution:Convert numeric to string

Select * FromUsersLeft Outer JoinLogs BOnA. usr_id= Cast(B.User_id AsString)

 

3.3 small tables are not small. How can we use map join to solve the skew problem?

Map join is used to solve the data skew problem of large tables associated with small tables (with a small number of records). This method is frequently used, but if a small table is large, A bug or exception occurs when map join is large. Special processing is required.Example:

Select * From LogALeft Outer JoinUsers BOnA.User_id =B.User_id;

 

The users table has more than 600 million records, and distributing users to all maps is not a small overhead, and map join does not support such a large small table. If a common join clause is used, data skew may occur again.

Solution:

 
Select/* + mapjoin (x) */* From log a left Outer Join (select/* + mapjoin (c) */d. * From (select distinct user_id from log) C join users D on C. user_id = D. user_id) X on. user_id = B. user_id;

Assume that there are millions of user_id in the log, which returns to the original map join problem. Fortunately, there will not be too many UV members every day, there will not be too many members with transactions, there will not be too many members with clicks, and there will not be too many members with commissions. Therefore, this method can solve the problem of data skew in many scenarios.

4. Summary

It is our final goal to make map output data more evenly distributed to reduce. Because hashAlgorithmHowever, key-based hash may cause data skew more or less. A large amount of experience shows that the reason for data skew is that human table creation negligence or business logic can be avoided. General steps are provided here:

1. Which user_id values are skewed in the sample log table? A result table tmp1. For the computing framework, all the data comes from, and he does not know the data distribution. Therefore, sampling is indispensable.

2. The distribution of data conforms to the statistical rules of sociology, and the poverty and wealth are uneven. There won't be too many keys, just as there are not many rich people in a society, and there are not many strange people. Therefore, the number of tmp1 records is very small. Execute map join between tmp1 and users to generate tmp2, and read tmp2 to distribute File Cache. This is a map process.

3. Map reads users and logs. If the record is from log, check whether user_id is in tmp2. If yes, output it to local file a. Otherwise, generate the key of <user_id, value>, value Pair. If the record is from member, the key and value pairs of <user_id, value> are generated and enter the reduce stage.

4. Finally, file a is merged into the files output in the stage3 reduce stage and written to HDFS.

 

If you confirm that the business requires such a logic, consider the following optimization solution:

1. For join operations, map join is used when a small table cannot exceed 1 GB.

2. For group by or distinct, set hive. groupby. skewindata = true.

3. Try to use the preceding SQL statement adjustment for Optimization

 

From http://www.tbdata.org/archives/2109

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.