when optimizing, hive SQL as a map reduce program to read, there will be unexpected surprises.
Understanding the core competencies of Hadoop is fundamental to hive optimization. This is a valuable experience summary for all members of the project team over the past year.
Long-term observations of Hadoop process data have several notable features :
1. Not afraid of more data, it is afraid of data skew.
2. More jobs than the job efficiency is relatively low, for example, even if there are hundreds of of rows of tables, if multiple associations multiple times, resulting in more than 10 jobs, not half an hour is not run out. The map reduce job initialization time is relatively long.
3. For Sum,count, there is no data skew problem.
4. For count (distinct), the efficiency is low, the data quantity is more than one, quasi problem, if is many count (distinct) efficiency is lower.
There are several ways in which optimization can begin:
1. Good model design with less effort.
2. Solve the problem of data skew.
3. Reduce job number.
4. Set reasonable map reduce task number, can effectively improve performance. (for example, the 10w+ level of calculation, with 160 reduce, which is quite wasteful, 1 enough).
5. Write your own SQL to solve the problem of data skew is a good choice. Set hive.groupby.skewindata=true; This is the general algorithm optimization, but algorithm optimization always ignores the business and habitually provides a common solution. ETL developers know more about the business and more about the data, so it's often more accurate and efficient to solve the skewed approach through business logic.
6. For count (distinct) to take a disregard of the method, especially when the data is large, it is easy to tilt the problem, not lucky psychology. Do it yourself.
7. The combination of small files, is to effectively improve the efficiency of the scheduling method, if our work to set a reasonable number of documents, the overall scheduling efficiency of the ladder will also have a positive impact.
8. Grasp the overall optimization, the single job best than the overall optimal.
cases in the migration and optimization process:
Question 1: If the log, often have the problem of information loss, such as the full network log user_id, if you take the USER_ID and Bmw_users Association, you will encounter the problem of data skew.
Methods: Solving Data skew problems
Workaround 1. USER_ID is empty and does not participate in the association, for example:
Select *
From Log a
Join Bmw_users b
On a.user_id are NOT null
and a.user_id = b.user_id
Union All
Select *
From Log a
where a.user_id is null.
Workaround 2:
Select *
From Log a
Left OUTER join Bmw_users b
On case when a.user_id are null then concat (' Dp_hive ', rand ()) Else a.user_id end = b.user_id;
Summary:2:1 efficiency is better, not only IO less, and the number of operations are less. 1 method Log read two times, Jobs is 2. The 2 method job number is 1. This optimization is suitable for skew problems caused by invalid IDs (such as -99, ', NULL, etc.). by turning the null key into a string plus a random number, the skew data can be divided into different reduce to solve the problem of data skew. Because null values do not participate in the association, they do not affect the final result, even if they are assigned to different reduce. Attached is the implementation method of the General Hadoop Association (the association is implemented by two order, the associated column Parition key, the associated columns C1 and the table tag composition of the sorted group key, according to the Parition key allocation reduce. Sorted within the same reduce according to group key).
Issue 2: The Association of different data type IDs produces data skew issues.
A table S8 log, one record for each item, and the commodity table to associate. But the connection has a skewed problem. The S8 log has a string commodity ID and a number's product ID, and the type is string, but the number ID in the product is bigint. The reason for guessing the problem is to turn the S8 's product ID into a digital ID to allocate reduce, so the S8 log of the string ID is on a reduce, and the solution verifies the guess.
method: Converts a numeric type to a string type
Select * from S8_log a
Left OUTER join R_auction_auctions b
On a.auction_id = cast (b.auction_id as String);
question 3: Using hive to optimize the Union all
Hive to UNION all optimizations are limited to not nested queries.
For example, the following examples:
SELECT * FROM
(SELECT * from t1
Group by C1,C2,C3
Union All
Select * FROM T2
Group by C1,C2,C3) T3
Group by C1,C2,C3;
From the business logic, the group by in the subquery looks superfluous (functionally superfluous, unless there is count (distinct)), if not because of hive bugs or performance considerations (once appeared if no subquery group BY, Hive bug where the data does not get the correct result. So this hive is converted by experience
SELECT * FROM
(SELECT * from t1
Union All
Select * FROM T2
) T3
Group by C1,C2,C3;
After testing, there is no hive bug with union all, and the data is consistent. The number of Mr Jobs was reduced by 3 to 1.
T1 is equivalent to a directory, T2 is equivalent to a directory, then for the map reduce program, T1,T2 can be the mutli inputs of the map reduce job. So, this solves this problem with a map reduce. Hadoop's computational framework, not afraid of data, is afraid of more work.
But if you are replacing other computing platforms such as Oracle, that's not necessarily the case, because splitting the big input into two inputs, sorting the totals after the merge (if the two subcategories are parallel), is likely to be better (such as Hill sort is better than bubble sort performance).
Question 4: For example, the promotion effect table should be associated with the product table, the auction ID column in the effect table has both the product ID and the digital ID, and the commodity table is associated with the product information . Then the following hive SQL performance will be better
Select * from Effect a
Join (select auction_id as auction_id from auctions
Union All
Select auction_string_id as auction_id from auctions
) b
On a.auction_id = b.auction_id.
Rather than filtering the numeric IDs separately, the string IDs are then correlated to the commodity table's performance.
The benefits of this writing are 1 Mr Jobs, the merchandise table is read only once, and the promotion effect table is read only once. Replace this SQL with Mr Code, when map, the record of a table is labeled A, the merchandise table records each read one, the label B, into two <key,value> pairs, <b, digital id>,<b, String id>. So the HDFs reading of the commodity table will only be one time.
Question 5: Join to generate a temporary table, in union ALL or write nested queries, this is a problem . For example, the following examples:
Select *
From (SELECT *
from T1
Uion All
SELECT *
from T4
Union All
Select *
From T2
Join T3
On t2.id = T3.id
) x
Group by C1,C2;
This will have 4 jobs. If you join to generate a temporary table T5, then union All, will become 2 jobs.
Insert Overwrite Table T5
Select *
From T2
Join T3
On t2.id = T3.id
;
Select * FROM (T1 UNION all T4 Union All T5);
hive can be smarter on union all optimizations (the handle query is a temporary table), which can reduce the burden on developers. The reason for this problem should be union all. The current optimizations are limited to not nested queries. If you write the MR Program this is not a problem, is multi inputs.
Question 6: Use a map join to solve the problem of Changjing the small table associated with the data skew, but if the small table is large, how to solve it. This is a very high frequency, but if the small table is large, the map join will be a bug or an exception, you need special processing. Yun Rui and Yu Yu provide a very strong solution. The following example:
Select * FROM Log a
Left OUTER JOIN Members B
On a.memberid = B.memberid.
The members have 600w+ records, and distributing members to all of the maps is a huge expense, and the map join does not support such a large small table. If you use a normal join, you will encounter the problem of data skew.
Workaround:
Select/*+mapjoin (x) */* from Log a
Left outer JOIN (select/*+mapjoin (c) */d.*
From (select distinct MemberID from log) c
Join Members D
On c.memberid = D.memberid
) x
On a.memberid = B.memberid.
Take all the MemberID based on the log, and then mapjoin the members to take the information of the members who have the log today, and then do mapjoin with log.
If there are millions of MemberID in the log, this goes back to the original map join problem. Fortunately, the daily members of the UV will not be too much, there are not too many members of the transaction, there is no click on the membership will not be too many members, and so on commission. So this method can solve the problem of data skew in many scenarios.
problem 7:hive The general method of data skew, double the relatively small table associated, this method is commonly used in Mr Programs. or just that question:
Select * FROM Log a
Left outer JOIN (select/*+mapjoin (E) * *
MemberID, number
From Members D
Join Num E
) b
On a.memberid= B.memberid
and mod (a.pvtime,30) +1=b.number.
The NUM table has only one column number, has 30 rows, is the 1,30 sequence of natural numbers. is to inflate the member table to 30, and then divide the log data according to MemberID and pvtime into different reduce, so that each reduce can be allocated a relatively uniform number of data. As far as the current test is, the performance of the Mapjoin scheme is slightly better. The following scenario is appropriate when the map join does not solve the problem.
In the long run, the following optimization scheme is made into a General hive optimization method .
1. Sample log table, which MemberID are more inclined to get a result table tmp1. Because of the computational framework, all the data, he is not aware of the distribution of data, so the sampling is not small. Stage1
2. The distribution of data is in accordance with the sociological statistical rules, inequality between rich and poor. Tilt key not too much, just like a society of the rich are not many, strange people are not much the same. So the number of TMP1 records will be very small. TMP1 and members do map join to generate TMP2, TMP2 read to distribute file cache. This is a map process. Stage2
3. Map read members and log, if the record from log, then check whether MemberID in TMP2, if it is, output to local file A, otherwise generate <memberid,value> key,value pairs, If the record is from member, generate <memberid,value> key,value pair, enter the reduce phase. Stage3.
4. Finally, the document A, the Stage3 reduce phase output of the file merged to write to HDFs.
This method should be achievable in Hadoop. Stage2 is a map process that can be merged with the Stage3 map process into a map process.
The goal of this scheme is to use the Mapjoin data with the normal join for the skewed data and the final merge to get the complete result. With hive SQL, SQL will become a lot of paragraphs, and the log table will read more than once. The tilt key is always small, and this applies in most business contexts. Can that be used as a general algorithm for hive for data skew joins?
Question 8: Multi-granularity (lateral) UV calculation optimization , such as to calculate the UV shop. There are uv,pvip to calculate the page.
Programme 1:
Select Shopid,count (Distinct uid)
From log group by Shopid;
Select PageID, COUNT (distinct UID),
From log group by PageID;
This result has a very long running time due to a data skew problem.
Programme II:
From log
Insert Overwrite table t1 (type= ' 1 ')
Select Shopid
Group by Shopid, Acookie
Insert Overwrite table t1 (type= ' 2 ')
Group by Pageid,acookie;
Shop UV:
Select shopid,sum (1)
from T1
Where type = ' 1 '
Group by Shopid;
Page UV:
Select pageid,sum (1)
from T1
Where type = ' 1 '
Group by PageID;
here the multi Insert method is used to effectively reduce the HDFS read, but multi insert will add HDFs write, one more time for the additional map stage HDFs write. Using this method, you can successfully produce results.
Programme III:
Insert into T1
Select type,type_name, ' as UID
From (
Select ' page ' as type,
PageID as Type_name,
Uid
From log
Union All
Select ' shop ' as type,
Shopid as Type_name,
Uid
From log) y
Group by Type,type_name,uid;
Insert into T2
Select type,type_name,sum (1)
from T1
Group by Type,type_name;
From T2
Insert into T3
Select TYPE,TYPE_NAME,UV
Where type= ' page '
Select TYPE,TYPE_NAME,UV
Where type= ' shop ';
Finally get two results table T3, page UV table, T4, store results table. From IO, log reads at once. But less hdfs than Scenario 2 writes (Multi inserts sometimes add additional map stages to HDFs writes). The number of jobs is reduced by 1 to 3, and the number of jobs with reduce is reduced from 4 to 2, and the third step is a map process of a small table, which is less computationally intensive. But Scenario 2 is a massive recalculation of the totals.
The main idea of this optimization is that map reduce job initialization time is relatively long, since up, let him do more work, by the way of the page by the UID to heavy work also dry, save log of a read and the initialization time of the job, save the network shuffle IO, However, local disk reads and writes are added. Increased efficiency.
this scheme is suitable for the level of the multiple-granularity UV calculation, the more granular, the more saving resources, more general.
Question 9: Multi-granularity, the UV settlement of the total layer up. for example, 4 dimensions, a,b,c,d, respectively calculate A,B,C,D,UV;
A,b,c,uv;a,b,uv;a;uv,total uv4 a result table. This can be used in question 8 of the scenario two, here due to the specificity of the UV scene, the multi-granularity, the level of upward rollup, you can use a sorting, all the UV calculation benefit calculation method.
case: The current Mm_log log 2.5 billion + PV a day, to calculate from the MM log UV, and IPUV, a total of calculation
Three-granularity results table
(MEMBERID,SITEID,ADZONEID,PROVINCE,UV,IPUV) R_table_4
(MEMBERID,SITEID,ADZONEID,UV,IPUV) R_table_3
(MEMBERID,SITEID,UV,IPUV) R_table_2
The first step: Press Memberid,siteid,adzoneid,province, use Group to go heavy , produce temporary table, to Cookie,ip
Put the tag together, go heavy together, temporary table called T_4;
Select Memberid,siteid,adzoneid,province,type,user
From (
Select memberid,siteid,adzoneid,province, ' A ' type, cookie as user from Mm_log where ds=20101205
Union All
Select memberid,siteid,adzoneid,province, ' I ' type, IP as user from Mm_log where ds=20101205
) x group by Memberid,siteid,adzoneid,province,type,user;
Step Two: Rank , produce table T_4_num. The most powerful and core competencies of Hadoop are parition and sort. Grouped by Type,acookie,
Type,acookie,memberid,siteid,adzoneid,province rankings.
Select *,
Row_number (Type,user,memberid,siteid,adzoneid) as Adzone_num,
Row_number (Type,user,memberid,siteid) as Site_num,
Row_number (Type,user,memberid) as Member_num,
Row_number (Type,user) as Total_num
From (SELECT * to T_4 distribute by Type,user Sort by Type,user, Memberid,siteid,adzoneid) x;
In this way, you can get the rank of user on different levels of granularity, the same user ID at different granularity level, the number of records equal to 1 is only 1. Take the rank equal to 1 to do sum, the effect is equal to the group by user after the sum operation.
The third step: different granularity of UV statistics, first from the most fine-grained start statistics, produce results table r_table_4, at this time, the result set only 10w level.
such as statistical memberid,siteid,adzoneid,provinceid particle size of the UV-use method is
Select Memberid,siteid,adzoneid, Provinceid,
SUM (case is type = ' a ' then cast (1) as bigint end) as PROVINCE_UV,
sum (case when type = ' I ' then cast (1) as bigint end) as Province_ip,
SUM (case when adzone_num =1 and type = ' a ' then cast (1) as bigint end) as ADZONE_UV,
SUM (case when adzone_num =1 and type = ' I ' then cast (1) as bigint end) as Adzone_ip,
SUM (case when site_num =1 and type = ' a ' then cast (1) as bigint end) as SITE_UV,
SUM (case when site_num =1 and type = ' I ' then cast (1) as bigint end) as Site_ip,
SUM (case when member_num =1 and type = ' a ' then cast (1) as bigint end) as MEMBER_UV,
SUM (case when member_num =1 and type = ' I ' then cast (1) as bigint end) as Member_ip,
SUM (case when total_num =1 and type = ' a ' then cast (1) as bigint end) as TOTAL_UV,
SUM (case when total_num =1 and type = ' I ' then cast (1) as bigint end) as Total_ip,
From T_4_num
Group by Memberid,siteid,adzoneid, Provinceid;
Advertising bit granularity of UV words, from the r_table_4 statistics, this is the source table do 10w level of statistics
Select memberid,siteid,adzoneid,sum (ADZONE_UV), sum (ADZONE_IP)
From R_table_4
Group by Memberid,siteid,adzoneid;
UV calculation of Memberid,siteid,
UV calculation of MemberID,
Total UV calculations are also summarized from the R_table_4.