Hive optimization Summary
--- By Hualien
Hive SQLAs map reduceThere will be unexpected surprises.
Understanding hadoopHiveThe foundation of optimization. This is a summary of the valuable experience of all members of the project team over the past year.
Long-term Observation of hadoop's data processing process has several notable features:
1. If there is not much data, you are afraid of data skew.
2. for jobs with a large number of jobs, the running efficiency is relatively low. For example, if a table with hundreds of rows is joined multiple times for multiple summaries, more than a dozen jobs are generated, you can't finish running in less than half an hour. The initialization time of the map reduce job is relatively long.
3. There is no data skew problem for sum and count.
4. For count (distinct), the efficiency is low, the data volume is too large, and there is a quasi-problem. If it is multi-count (distinct), the efficiency is lower.
Optimization can be performed in the following aspects:
1. Better model design.
2. Solve the data skew problem.
3. Reduce the number of jobs.
4. Setting a reasonable number of map reduce tasks can effectively improve performance. (For example, 160 +-level computing, with reduce, is a waste, and one is enough ).
5. It is a good choice to manually write SQL statements to solve the data skew problem. Set hive. groupby. skewindata = true; this is a common algorithm optimization, but algorithm optimization always ignores the business and provides a common solution. Etl developers are more familiar with the business and data. Therefore, business logic is more accurate and effective.
6. Ignore the count (distinct) method, especially when the data is big, it is easy to cause the problem of skew, not to be lucky. Do it yourself.
7. merging Small files is an effective way to improve the scheduling efficiency. If we set a reasonable number of files for our jobs, the overall scheduling efficiency of the cloud ladder will also have a positive impact.
8. The overall optimization is better than the overall optimization for a single job.
Cases during migration and optimization:
Question 1: In logs, information is often lost, for example, user_id in logs across the network.If the user_idAnd bmw_usersAssociation causes data skew.
Method: Solve the data skew problem.
Solution 1. If user_id is null, it does not participate in association. For example:
Select *
From log
Join bmw_users B
On a. user_id is not null
And a. user_id = B. user_id
Union all
Select *
From log
Where a. user_id is null.
Solution 2:
Select *
From log
Left Outer Join
Bmw_users B
On case when
A. user_id is null then concat ('dp _ hive ', rand () else a. user_id end =
B. user_id;
Summary:2 is more efficient than 1, with less Io and fewer jobs. Method 1: log is read twice, and the number of jobs is 2. 2. The number of jobs is 1.This optimization is suitable for ineffectiveId (For example,-99, '', null)Skew problem.Convert the key of the null value into a string and add a random number to divide the skewed data into different reduce instances.
To solve the data skew problem. Because null values do not participate in association, even if they are distributed to different reduce tasks, the final result is not affected. Attaches the implementation method of hadoop universal association (the Association is achieved through secondary sorting, the associated column is the parition key, the associated column C1 and the table tag constitute the group key for sorting, distribute reduce according to the parition key. Sort group keys in the same reduce ).
Question 2: Different data type IDsWill cause data skew.
One Table s8 log, one record for each item, must be associated with the item table. However, Association is prone to skew. The log of s8 contains the string product id and numeric product id. The type is string, but the numeric id in the product is bigint. The reason for this issue is that the s8 product id is converted into a digital id for hash distribution to reduce, so the s8 logs of the string id are all sent to a reduce, the solution verifies this guess.
Method: Convert the numeric type to the string type.
Select * from
S8_log
Left outer join
R_auction_auctions B
On
A. auction_id = cast (B. auction_id as string );
Question 3: Use hiveUNION ALLOptimized features
HiveUnion allOptimization is limited to non-nested queries.
For example:
Select * from
(Select * from t1
Group by c1, c2, c3
Union all
Select * from t2
Group by c1, c2, c3) t3
Group by c1, c2, c3;
In terms of business logic, group by in subqueries seems redundant (function redundancy, unless there is count (distinct )), if it is not because of the hive bug or performance considerations (there was a hive bug where the data could not get the correct results without subqueries of group ). Therefore, this hive is converted
Select * from
(Select * from t1
Union all
Select * from t2
) T3
Group by c1, c2, c3;
After testing, the hive bug of union all is not found, and the data is consistent. The number of mr Jobs is reduced from 3 to 1.
T1 is equivalent to a directory, and t2 is equivalent to a directory. For map reduce programs, t1 and t2 can be used as mutli inputs of map reduce jobs. So, this problem can be solved through a map reduce. The Hadoop computing framework is afraid of a large number of jobs because it does not have to worry about a large amount of data.
However, if it is replaced by another computing platform such as oracle, it is not necessarily because the large input is split into two inputs, sorted separately and aggregated, and then merge (if the two suborders are in parallel ), is likely to have better performance (such as better performance of hill sorting than Bubble Sorting ).
Question 4For example, to associate a promotion effect table with a product table, the auction id in the results tableColumn existing item id,Also has a numeric id,Associate with the product table to obtain product information.. The following hive SQL statements have better performance.
Select * from
Effect
Join (select
Auction_id as auction_id from auctions
Union all
Select auction_string_id as auction_id from auctions
) B
On a. auction_id
= B. auction_id.
It is better than filtering numeric IDs, string IDs, and then associating with commodity tables.
The advantage of this writing is that one MR job only reads the commodity table once and the promotion effect table only reads it once. If you replace this SQL statement with the MR code, tag a for the record in Table a during map. Each record in the product table reads one record, tag B, and two <keys, value> pair, <B, Number id>, <B, string id>. Therefore, the hdfs read of the commodity table is only once.
Question 5: Join firstGenerate a temporary table in union allOr write nested queries. This is a problem.. For example:
Select *
From (select *
From T1
Uion all
Select *
From T4
Union all
Select *
From T2
Join T3
On t2.id = t3.id
) X
Group by C1, C2;
There will be four jobs. If you join to generate a temporary table, then t5 and Union all will become two jobs.
Insert
Overwrite table T5
Select *
From T2
Join t3
On t2.id = t3.id
;
Select * from
(T1 union all t4 union all t5 );
HiveIn union allOptimization can be performed more intelligently (using subqueries as temporary tables), which can reduce the burden on developers. The reason for this problem is union all.The current optimization is limited to non-nested queries. If you write an MRThe program is not a problem, that is, multi inputs..
Question 6: Use map
JoinThis solution is used to associate small tables with large tables in a common scenario of data skew. However, if small tables are large, how can this problem be solved?. This is frequently used. However, if a small table is large, a bug or exception occurs in map join, which requires special processing. Yun Rui and Yu Ji provide a very powerful solution. Example:
Select * from
Log
Left outer join
Members B
On a. memberid =
B. memberid.
Members has over 600 million records. It is not a small overhead to distribute members to all maps, and map join does not support such a large small table. If a common join clause is used, data skew may occur again.
Solution:
Select
/* + Mapjoin (x) */* from log
Left outer join
(Select/* + mapjoin (c) */d .*
From (select distinct memberid from log) c
Join members d
On c. memberid = d. memberid
) X
On a. memberid =
B. memberid.
First retrieve all memberids Based on logs, and then associate members with mapjoin to obtain the information of the members with logs today, and then perform mapjoin with logs.
Assume that there are millions of memberids in the log, and this returns to the original map join problem. Fortunately, there will not be too many uv members every day, there will not be too many members with transactions, there will not be too many members with clicks, and there will not be too many members with commissions. Therefore, this method can solve the problem of data skew in many scenarios.
Question 7: HIVEDoubleThe joined table is relatively small.Programs.Or the question just now:
Select *
From log
Left outer join
(Select/* + mapjoin (e )*/
Memberid, number
From members d
Join num e
) B
On
A. memberid = B. memberid
And
Mod (a. pvtime, 30) + 1 = B. number.
The Num table has only one column of number, which has 30 rows and is a natural number sequence. It is to expand the member table into 30 parts, and then divide the log data into different reduce according to the memberid and pvtime, so that the data allocated by each reduce can be relatively even. In the current test, the performance of the mapjoin solution is slightly better. The following solutions are suitable when map join cannot solve the problem.
In the long run, we will make the following optimization scheme a general hive.Optimization Method
1.Sample logTable and memberidReturns a result table tmp1.. For the computing framework, all the data comes from, and he does not know the data distribution. Therefore, sampling is indispensable. Stage1
2.The distribution of data is in line with the rules of social statistics, and the poverty and wealth are uneven. Skewed keyThere won't be too many people, just as there are not many rich people in a society and there are not many strange people. So tmp1The number of records is very small. Set tmp1And membersMap joinGenerate tmp2,Set tmp2Read distribute file cache. This is a mapProcess. Stage2
3. mapRead into membersAnd logIf the record is from log,Check the memberidWhether in tmp2If yes, output to local file,Otherwise, <memberid, value> is generated.Key, valueYes. If the record is from member,Generate <memberid, value>Key, valueYes, go to reducePhase. Stage3.
4.FinallyFile to reduce Stage3Merge the output files and write them to hdfs..
In hadoopIs implemented. Stage2Is a mapProcess, and stage3MapThe process can be merged into a map.Process.
The goal of this solution is to use mapjoin for skewed data,Normal join is used for non-skewed data. Use hive SQLSQL statementIt will become a lot of segments, and the logThe table will be read multiple times. Skewed keyIt is always rare, which is applicable to the vast majority of business backgrounds. Can it be used as hive?Data skew joinWhat about general algorithms?
Question 8: Multi-granularity (Level UVComputing OptimizationFor example, to calculate the uv of a store. We also need to calculate the page uv and pvip.
Solution 1:
Select
Shopid, count (distinct uid)
From log group
By shopid;
Select pageid,
Count (distinct uid ),
From log group
By pageid;
Due to data skew, the running time of this result is very long.
Solution 2:
From log
Insert
Overwrite table t1 (type = '1 ')
Select shopid
Group by shopid
, Acookie
Insert
Overwrite table t1 (type = '2 ')
Group
Pageid, acookie;
Store uv:
Select
Shopid, sum (1)
From T1
Where type = '1'
Group by shopid
;
Page UV:
Select
Pageid, sum (1)
From T1
Where type = '1'
Group by pageid
;
Multi insert is used here.To effectively reduce the HDFSRead, but multi insertHDFS will be addedWrite, an additional MapPhase HDFSWrite. This method can be used to produce results smoothly.
Solution 3:
Insert into T1
Select
Type, type_name, ''as uid
From (
Select
'Page' as type,
Pageid as type_name,
UID
From log
Union all
Select
'Shop 'as type,
Shopid as type_name,
UID
From log) y
Group
Type, type_name, uid;
Insert into t2
Select type, type_name, sum (1)
From t1
Group
Type, type_name;
From t2
Insert into t3
Select
Type, type_name, uv
Where
Type = 'page'
Select
Type, type_name, uv
Where
Type = 'shop ';
The two result tables t3, uvtable, t4, and shop result table are obtained. For io, log is read once. But less hdfs writes than solution 2 (multi insert sometimes adds an additional map stage hdfs write ). The number of jobs is reduced by 1 to 3, and the number of jobs with reduce is reduced from 4 to 2. The third step is the map process of a small table, which is divided into the following table and consumes less computing resources. However, solution 2 is a large-scale de-duplicated summary computing.
The main idea of this optimization is map reduce.It takes a long time to initialize a job. Now that it is up, let him do more work.By the way, the page is de-duplicated by uid, saving one log read and job initialization time, saving the network shuffle io, but increasing the local disk read and write. The efficiency is greatly improved.
This solution is suitable for hierarchical multi-granularity UV that does not need to be summarized step by step.Computing: the more granularity, the more resource savings, and more common.
Question 9: Multi-granularity, layer-by-layer up UV SummarySettlement.For example, four dimensions, a, B, c, and d, calculate a, B, c, d, and uv respectively;
A, B, c, uv; a, B, uv; a; uv, total
Uv4 result tables. This can be done using solution 2 of Issue 8. Here, due to the special nature of the uv scenario, the multi-granularity and layer-by-layer summary can be used to sort all the uv computing methods.
Case:At present, mm_log logs have more than 2.5 billion PVS per day. uv is calculated from mm logs, and ipuv is calculated in total.
Result tables of three granularities
(Memberid, siteid, adzoneid, province, uv, ipuv) R_TABLE_4
(Memberid, siteid, adzoneid, uv, ipuv) R_TABLE_3
(Memberid, siteid, uv, ipuv)
R_TABLE_2
Step 1: Press memberid, siteid, adzoneid, province,Use groupDeduplication, Generate a temporary table, cookie, ip
Tag and place them together. The temporary table is called T_4;
Select
Memberid, siteid, adzoneid, province, type, user
From (
Select
Memberid, siteid, adzoneid, province, 'A' type, cookie as user from mm_log where
Ds = 20101205
Union all
Select
Memberid, siteid, adzoneid, province, 'I' type, ip as user from mm_log where
Ds = 20101205
) X group
Memberid, siteid, adzoneid, province, type, user;
Step 2: RankingTo generate tables T_4_NUM.Hadoop, the strongest and core capabilities are parition and sort. Grouping by type and acookie,
Type, acookie, memberid, siteid, adzoneid, and province ranking.
Select *,
Row_number (type, user, memberid, siteid, adzoneid
) As adzone_num,
Row_number (type, user, memberid, siteid) as site_num,
Row_number (type, user, memberid
) As member_num,
Row_number (type, user
) As total_num
From
(Select * from T_4 distribute by type, user sort by type, user,
Memberid, siteid, adzoneid) x;
In this way, we can get the ranking of users at different levels of granularity. The same user id is at different levels of granularity, and there are only one record with the ranking equal to 1. The sum operation is performed when the rank is equal to 1, and the result is equivalent to the sum operation after the Group by user deduplication.
Step 3: for uv statistics of different granularities, The result table R_TABLE_4 is generated starting from the finest granularity. At this time, the result set is only 10 million.
For example, how to count the granularity of memberid, siteid, adzoneid, and provinceid uv is
Select
Memberid, siteid, adzoneid, provinceid,
Sum (case
When type = 'A' then cast (1) as bigint end) as province_uv,
Sum (case
When type = 'I 'then cast (1) as bigint end) as province_ip,
Sum (case when
Adzone_num = 1 and type = 'A' then cast (1) as bigint end) as adzone_uv,
Sum (case when
Adzone_num = 1 and type = 'I' then cast (1) as bigint end) as adzone_ip,
Sum (case when
Site_num = 1 and type = 'A' then cast (1) as bigint end) as site_uv,
Sum (case when
Site_num = 1 and type = 'I' then cast (1) as bigint end) as site_ip,
Sum (case when
Member_num = 1 and type = 'A' then cast (1) as bigint end) as member_uv,
Sum (case when
Member_num = 1 and type = 'I' then cast (1) as bigint end) as member_ip,
Sum (case when
Total_num = 1 and type = 'A' then cast (1) as bigint end) as total_uv,
Sum (case when
Total_num = 1 and type = 'I' then cast (1) as bigint end) as total_ip,
From t_4_num
Group
Memberid, siteid, adzoneid, provinceid;
For the UV of advertising space granularity, The r_table_4 statistics are collected. This is the source table's-level statistics.
Select
Memberid, siteid, adzoneid, sum (adzone_uv), sum (adzone_ip)
From r_table_4
Group
Memberid, siteid, adzoneid;
UV Calculation of memberid and siteid,
UV Calculation of memberid,
The total UV calculation is also summarized from r_table_4.