Transferred from: http://www.cnblogs.com/ggjucheng/archive/2013/01/03/2842860.html
In the process of optimizing the shuffle stage, the problem of data skew is encountered, which results in the less obvious optimization effect in some cases. The main reason is that after job completion the resulting counters is the sum of the entire job, the optimization is based on the average of these counters, and because of the data skew caused by the map processing data amount of large differences, so that the value of these mean can be reduced. The execution of hive is phased, and the difference in map processing data depends on the reduce output of the previous stage, so how to distribute the data evenly among each reduce is the root of the data skew. It is more efficient to circumvent errors to better run than to resolve errors. After reviewing some of the information, we summarize the following.
1 reasons for data skew
1.1 Operation:
Keywords |
Situation |
Consequences |
Join |
One of the tables is small, But the key is concentrated |
Data distributed to one or several reduce is far above average |
Large tables with large tables, but a bucket of judgment field 0 or null value too many |
These null values are handled by a reduce, which is often slow |
GROUP BY |
The group by dimension spends little, An excessive number of values |
Reduce ash that handles a value often takes time |
Count Distinct |
Too many special values |
Reduce time for handling this special value |
1.2 reasons:
1), Key distribution is not uniform
2), the characteristics of the business data itself
3), the construction of the table is poorly considered
4), some SQL statements themselves have data skew
1.3 Performance:
The task progresses for a long time at 99% (or 100%), viewing the Task monitoring page and discovering that only a small number (one or several) of the reduce subtasks are not completed. Because the amount of data processed and other reduce differences are too large.
The number of records in a single reduce differs too much from the average number of records, which can typically reach 3 times times or more. The longest term is longer than the average length.
2 solutions for data skew
2.1 parameter adjustment:
Hive.map.aggr=true
Map end Partial aggregation, equivalent to Combiner
Hive.groupby.skewindata =true
When there is data skew, load balancing is set to true, and the resulting query plan will have two MR jobs. In the first MR Job, the output set of MAP is randomly distributed to reduce, and each reduce does a partial aggregation operation and outputs the result so that the same Group by Key may be distributed to different reduce to achieve load balancing purposes; Two MR jobs are then distributed to reduce based on the preprocessed data results (this process guarantees that the same group by key is distributed to the same reduce), and finally the final aggregation operation is completed.
2.2 SQL statement tuning:
How to Join :
For driver table selection, choose the table with the most uniform distribution of join key as the driver table
Do a column cropping and filter operation, in order to achieve two tables to do join, the amount of data is relatively small effect.
Size Table Join :
Use the map join to let the small dimension table (number of record bars below 1000) advanced memory. Complete the reduce on the map side.
Large Table Join Big table:
The null key becomes a string plus a random number, the skewed data is divided into different reduce, because the null value is not associated, the final result is not affected after processing.
Count Distinct large number of identical special values
When count distinct, the empty value is handled separately, if the count distinct is calculated, it can be filtered without processing, adding 1 to the final result. If there are other calculations, group by is required, the record with the null value can be processed separately, and then the results of the other calculations will be union.
Group by To spend a small dimension:
Replace count (distinct) with the sum () group by method to complete the calculation.
Special Cases Special Treatment:
In the unlikely event of a business logic optimization effect, there are times when skewed data can be handled separately. Finally union back.
3 typical business scenarios
3.1 data Skew resulting from null values
scenarios: such as the log, often there is a problem of information loss, such as the user_id in the log, if the user_id and the user table in the USER_ID Association, you will encounter the problem of data skew.
Workaround 1 : user_id not participate in association with NULL (red font is modified)
SELECT * FROM Log a joins users B on a.user_id are not null and a.user_id = b.user_idunion Allselect * FROM log A where a.user_id is null;
Workaround 2 : assigns a new key value to the null value
SELECT * from log a left outer join users B in case when a.user_id was null then concat (' Hive ', rand ()) Else a.user_id end = b.user_id;
Conclusion: Method 2 is more efficient than Method 1, not only the IO is less, but also the number of jobs. In solution 1, log reads two times, and jobs is 2. Workaround 2 The job number is 1. This optimization is suitable for an invalid ID (such as -99, "", null, etc.) resulting from the tilt problem. By turning the null key into a string plus a random number, you can divide the skewed data into different reduce to solve the data skew problem.
3.2 Data skew caused by different data type associations
scene: the user_id field in the user table Int,log the USER_ID fields in the table have both a string type and an int type. When a join operation of two tables is performed by user_id, the default hash operation is assigned by the ID of the int type, which causes all records of the string type ID to be assigned to a reducer.
Workaround: convert a number type to a string type
SELECT * from users a left outer joins logs B on a.usr_id = cast (b.user_id as String)
3.3 Small table is not very small, how to use map join to solve the tilt problem
Using the map join to solve the data skew problem with large tables associated with small tables (fewer records), this method uses very high frequency, but if the small table is large, a bug or an exception occurs when the map join is large, which requires special handling. The following example :
SELECT * FROM log a left outer joins users B on a.user_id = b.user_id;
The Users table has a 600w+ record, and distributing the users to all maps is also a big expense, and the map join does not support such a large small table. If you use a common join, you will encounter the problem of data skew.
Workaround:
Select/*+mapjoin (x) */* from log a left outer join ( Select /*+mapjoin (c) */d.* from (SELECT distinct use r_id from log) c joins users D on c.user_id = d.user_id ) x
If there are millions of user_id in log, this will return to the original map join problem. Fortunately, the daily membership of the UV will not be too much, there are not too many members of the transaction, there is no click on the membership will not be too much, there will be no commission members too much and so on. So this method can solve the problem of data skew in many scenarios.
4 Summary
It is our ultimate goal to make the output data of map more evenly distributed to reduce. Due to the limitations of the hash algorithm, the key hash will result in more or less data skew. A great deal of experience shows that the reason for data skew is human-induced negligence or business logic that can be circumvented. The more general steps are given here:
1, the sampling log table, which user_id is more inclined to get a result table tmp1. Because of the computational framework, all of the data came out, he did not know the distribution of data, so the sampling is not less.
2, the distribution of data in line with sociological statistics rules, inequality. Tilt key not too much, like a society of the rich few, strange people are not much the same. So the number of TMP1 records is very small. Make TMP1 and users map join to generate TMP2, TMP2 read to distribute file cache. This is a map process.
3, map read the users and log, if the record from log, then check whether user_id in TMP2, if yes, output to local file A, otherwise generate <user_id,value> Key,value, if the record from member, Generate <user_id,value> key,value pair, enter the reduce phase.
4, finally put a file, the Stage3 reduce phase output of the file merge write to HDFs.
If you confirm that the business requires such skewed logic, consider the following optimization scenarios:
1. For join, use map join in case the small table is not less than 1G
2, for group by or distinct, set hive.groupby.skewindata=true
3, try to use the above SQL statement tuning to optimize
Hive Big Data Tilt Summary