Hive Optimization (important) __hive

Source: Internet
Author: User
Tags joins

Hive ive optimization essentials:

When optimizing, hive SQL as a map reduce program to read, there will be unexpected surprises.

Understanding the core competencies of Hadoop is fundamental to hive optimization. Long-term observations of Hadoop process data have several notable features:

1. Not afraid of more data, it is afraid of data skew.

2. More jobs than the job efficiency is relatively low, for example, even if there are hundreds of of rows of tables, if multiple associations multiple times, resulting in more than 10 jobs, not half an hour is not run out. The map reduce job initialization time is relatively long.

3. for Sum,count, there is no data skew problem.

4. for COUNT (distinct), the efficiency is low, the data quantity is more than one, quasi problem, if is many count (distinct) efficiency is lower.

There are several ways in which optimization can begin:

1. Good model design with less effort.

2. Solve the problem of data skew.

3. Reduce job number.

4. Set reasonable map reduce task number, can effectively improve performance. (for example, the 10w+ level of calculation, with 160 reduce, which is quite wasteful, 1 enough).

5. write your own SQL to solve the problem of data skew is a good choice. set hive.groupby.skewindata=true; This is the general algorithm optimization, but algorithm optimization always ignores the business and habitually provides a common solution. ETL developers know more about the business and more about the data, so it's often more accurate and efficient to solve the skewed approach through business logic.

6. For count (distinct) to take a disregard of the method, especially when the data is large, it is easy to tilt the problem, not lucky psychology. Do it yourself.

7. The combination of small files, is to effectively improve the efficiency of the scheduling method, if our work to set a reasonable number of documents, the overall scheduling efficiency of the ladder will also have a positive impact.

8. Grasp the overall optimization, the single job best than the overall optimal.

Optimization case:

Case:


Question 1: If the log, often have the problem of information loss, such as the full network log user_id, if you take the USER_ID and Bmw_users Association, you will encounter the problem of data skew.

Methods: Solving data skew problems

Workaround 1. USER_ID is empty and does not participate in the association, for example:

Select * FROM Log a Join bmw_users B on a.user_id are NOT null and a.user_id = b.user_id Union all Select * from log a wher e a.user_id is null.

Workaround 2:

Select * FROM Log a

Left OUTER join Bmw_users B in case when a.user_id are null then concat (' Dp_hive ', rand ()) Else a.user_id end = b.user_id;

Summary: 2:1 efficiency is better, not only IO less, and the number of operations are less. 1 method Log read two times, Jobs is 2. The 2 method job number is 1. This optimization is suitable for skew problems caused by invalid IDs (such as -99, ', NULL, etc.).

By turning the null key into a string plus a random number, the skew data can be divided into different reduce to solve the problem of data skew. Because null values do not participate in the association, they do not affect the final result, even if they are assigned to different reduce. Attached is the implementation method of the General Hadoop Association (the association is implemented by two order, the associated column Parition key, the associated columns C1 and the table tag composition of the sorted group key, according to the Parition key allocation reduce. Sorted within the same reduce according to group key).


Issue 2: The Association of different data type IDs produces data skew issues. A table S8 log, one record for each item, and the commodity table to associate. But the connection has a skewed problem. The S8 log has a string commodity ID and a number's product ID, and the type is string, but the number ID in the product is bigint. The reason for guessing the problem is to turn the S8 's product ID into a digital ID to allocate reduce, so the S8 log of the string ID is on a reduce, and the solution verifies the guess.

Method: Convert numeric type to String type Select * from S8_log a LEFT outer join r_auction_auctions b on a.auction_id = cast (b.auction_id As String );


Question 3: Using Hive to the optimization of the union ALL hive the union All optimization is limited to not nested queries. For example, the following examples:

SELECT * FROM (SELECT * to T1 GROUP by C1,C2,C3 Union all SELECT * to T2 Group by C1,C2,C3) T3 Group by C1,C2,C3;

From the business logic, the group by in the subquery looks superfluous (functionally superfluous, unless there is count (distinct)), if not because of hive bugs or performance considerations (once appeared if no subquery group BY, Hive bug where the data does not get the correct result. So this hive is converted from experience to select * FROM (SELECT * to T1 Union all select * to T2) T3 Group by C1,C2,C3; After testing, there is no hive bug with union all, and the data is consistent. The number of Mr Jobs was reduced by 3 to 1. T1 is equivalent to a directory, T2 is equivalent to a directory, then for the map reduce program, T1,T2 can be the mutli inputs of the map reduce job. So, this solves this problem with a map reduce.

Hadoop's computational framework, not afraid of data, is afraid of more work . But if you are replacing other computing platforms such as Oracle, that's not necessarily the case, because splitting the big input into two inputs, sorting the totals after the merge (if the two subcategories are parallel), is likely to be better (such as Hill sort is better than bubble sort performance).


Question 4: For example, the promotion effect table should be associated with the product table, the auction ID column in the effect table has both the product ID and the digital ID, and the commodity table is associated with the product information. Then the following hive SQL performance will be better

SELECT * from Effect a Join (select auction_id as auction_id from auctions Union all select auction_string_id as Auction_i D from Auctions) b

On a.auction_id = b.auction_id.

Rather than filtering the numeric IDs separately, the string IDs are then correlated to the commodity table's performance. The benefits of this writing are 1 Mr Jobs, the merchandise table is read only once, and the promotion effect table is read only once. Replace this SQL with Mr Code, when map, the record of a table is labeled A, the merchandise table records each read one, the label B, into two <key,value> pairs, <b, digital id>,<b, String id>. So the HDFs reading of the commodity table will only be one time.


Question 5: Join to generate a temporary table, in union ALL or write nested queries, this is a problem. For example, the following examples:

SELECT * FROM t1

Uion All

SELECT * from T4

Union All

Select * from T2 Join t3 on t2.id = t3.id) x

Group by C1,C2;

This will have 4 jobs. If you join to generate a temporary table T5, then union All, will become 2 jobs.

Insert Overwrite Table T5 Select * from T2 Join t3 on t2.id = t3.id;

Select * FROM (T1 UNION all T4 Union All T5);

Hive can be smarter on union all optimizations (the handle query is a temporary table), which can reduce the burden on developers. The reason for this problem should be union all. The current optimizations are limited to not nested queries. If you write the MR Program this is not a problem, is multi inputs.


Question 6: Use a map join to solve the problem of Changjing the small table associated with the data skew, but if the small table is large, how to solve it. This is a very high frequency, but if the small table is large, the map join will be a bug or an exception, you need special processing. The following example:

Select * FROM log a LEFT outer join members B on a.memberid = B.memberid.

The members have 600w+ records, and distributing members to all of the maps is a huge expense, and the map join does not support such a large small table. If you use a normal join, you will encounter the problem of data skew.

Workaround: Select/*+mapjoin (x) */* from log a left outer join (select/*+mapjoin (c) */d.* from (select distinct MemberID from lo g) C Join members d in C.memberid = D.memberid) x on a.memberid = B.memberid.

Take all the MemberID based on the log, and then mapjoin the members to take the information of the members who have the log today, and then do mapjoin with log. If there are millions of MemberID in the log, this goes back to the original map join problem. Fortunately, the daily members of the UV will not be too much, there are not too many members of the transaction, there is no click on the membership will not be too many members, and so on commission. So this method can solve the problem of data skew in many scenarios. Problem 7:hive The general method of data skew, double the relatively small table associated, this method is commonly used in Mr Programs. Or just that question: SELECT * from log a LEFT outer join (select/*+mapjoin (E) * *

MemberID, number from members D Join num e) b on a.memberid= B.memberid and mod (a.pvtime,30) +1=b.number. The NUM table has only one column number, has 30 rows, is the 1,30 sequence of natural numbers. is to inflate the member table to 30, and then divide the log data according to MemberID and pvtime into different reduce, so that each reduce can be allocated a relatively uniform number of data. As far as the current test is, the performance of the Mapjoin scheme is slightly better. The following scenario is appropriate when the map join does not solve the problem.

The following optimization scheme can be made into General hive optimization Method 1. Sample log table, which MemberID are more inclined to get a result table tmp1. Because of the computational framework, all the data, he is not aware of the distribution of data, so the sampling is not small. Stage1 2. The distribution of data conforms to the rules of sociological statistics and inequality between rich and poor. Tilt key not too much, just like a society of the rich are not many, strange people are not much the same. So the number of TMP1 records will be very small. TMP1 and members do map join to generate TMP2, TMP2 read to distribute file cache. This is a map process. Stage2 3. Map read members and log, if the record from log, then check whether MemberID in TMP2, if it is, output to local file A, otherwise generate <memberid,value> key,value pairs, if the record from member , generate <memberid,value> Key,value pairs, into the reduce phase. Stage3. 4. Finally, the document A, the Stage3 reduce phase output of the file merged to write to HDFs. This method should be achievable in Hadoop. Stage2 is a map process that can be merged with the Stage3 map process into a map process. The goal of this scheme is to use the Mapjoin data with the normal join for the skewed data and the final merge to get the complete result. With hive SQL, SQL will become a lot of paragraphs, and the log table will read more than once. Tilted Key

is always very small, this is applicable in most of the business context. Can that be used as a general algorithm for hive for data skew joins? Question 8: Multi-granularity (lateral) UV calculation optimization, such as to calculate the UV shop. There are uv,pvip to calculate the page. Scheme 1:select Shopid,count (distinct uid) from log group by Shopid; Select PageID, COUNT (distinct UID), from log group by PageID; This result has a very long running time due to a data skew problem.

Scenario II: From log Insert overwrite table t1 (type= ' 1 ') Select shopid Group by shopid, acookie Insert overwrite table t1 (type= ' 2 ') Group by Pageid,acookie; Shop Uv:select shopid,sum (1) from t1 Where type = ' 1 ' Group by Shopid; Page Uv:select pageid,sum (1) from t1 Where type = ' 1 ' Group by PageID; Here the multi Insert method is used to effectively reduce the HDFS read, but multi insert will add HDFs write, one more time for the additional map stage HDFs write. Using this method, you can successfully produce results. Scenario three: Insert into T1 Select type,type_name, ' as UID from (SELECT ' page ' as type, PageID as type_name, uid from log Union All Select "Shop" as type, shopid as Type_name, Uid from log) y
Group by Type,type_name,uid; Insert into T2 Select type,type_name,sum (1) to T1 Group by Type,type_name; From T2 inserts into the T3 select Type,type_name,uv where type= ' page ' select Type,type_name,uv where type= ' shop '; Finally get two results table T3, page UV table, T4, store results table. From IO, log reads at once. But less hdfs than Scenario 2 writes (Multi inserts sometimes add additional map stages to HDFs writes). The number of jobs is reduced by 1 to 3, and the number of jobs with reduce is reduced from 4 to 2, and the third step is a map process of a small table, which is less computationally intensive. But Scenario 2 is a massive recalculation of the totals. The main idea of this optimization is that the map reduce job initialization time is relatively long, since get up, let him do more work, by the way the page by the UID to heavy work also dry, save log of a read and job initialization time, save network shuffle IO, but increased local disk read and write. Increased efficiency. This scheme is suitable for the level of the multiple-granularity UV calculation, the more granular, the more saving resources, more general. Question 9: Multi-granularity, the UV settlement of the total layer up. For example, 4 dimensions, a,b,c,d, respectively calculate A,B,C,D,UV;

A,b,c,uv;a,b,uv;a;uv,total uv4 a result table. This can be used in question 8 of the scenario two, here due to the specificity of the UV scene, the multi-granularity, the level of upward rollup, you can use a sorting, all the UV calculation benefit calculation method. Case: The current Mm_log log 2.5 billion + PV a day, to calculate UV from the MM log, and IPUV, a total of three granularity of the results table (MEMBERID,SITEID,ADZONEID,PROVINCE,UV,IPUV) r_table_4 (MEMBERID,SITEID,ADZONEID,UV,IPUV) r_table_3 (MEMBERID,SITEID,UV,IPUV) r_table_2

The first step: Press Memberid,siteid,adzoneid,province, use Group to go heavy, produce temporary table, cookie,ip tag put together, go heavy together, temporary table is called T_4; Select Memberid,siteid,adzoneid,province,type,user from (select Memberid,siteid,adzoneid,province, ' A ' type, cookie as User from Mm_log where ds=20101205 Union all Select memberid,siteid,adzoneid,province, ' I ' type, IP as user from Mm_log whe Re ds=20101205) x group by Memberid,siteid,adzoneid,province,type,user;

Step two: rank, produce table t_4_num. The most powerful and core competencies of Hadoop are parition and sort. By Type,acookie group, Type,acookie,memberid,siteid,adzoneid,province ranked. Select *,

Row_number (Type,user,memberid,siteid,adzoneid) as Adzone_num, Row_number (Type,user,memberid,siteid) as Site_num, row _number (Type,user,memberid) as Member_num, Row_number (Type,user) as Total_num from (SELECT * to T_4 distribute by type , user sort by Type,user, Memberid,siteid,adzoneid) x; In this way, you can get the rank of user on different levels of granularity, the same user ID at different granularity level, the number of records equal to 1 is only 1. Take the rank equal to 1 to do sum, the effect is equal to the group by user after the sum operation.

Step three: Different granularity of UV statistics, first from the most fine-grained start statistics, resulting table r_table_4, at this time, the result set is only 10w level. such as the statistical memberid,siteid,adzoneid,provinceid granularity of UV using the method is Select Memberid,siteid,adzoneid, Provinceid, sum (case when type = ' A ' then cast (1) as bigint) as PROVINCE_UV, sum (case when type = ' I ' then cast (1) as bigint end) as PROVINCE_IP, sum (case where adzone_num =1 and type = ' a ' then cast (1) as bigint end) as ADZONE_UV, sum (case when adzone_num =1 and type = ' I ' then cast (1) as bigint) as ADZONE_IP, sum (case when site_num =1 and type = ' a ' then cast (1) as bigint end) as sit E_UV, sum (case when site_num =1 and type = ' I ' then cast (1) as bigint end) as SITE_IP, sum (case when Member_num =1 and T ype = ' A ' then cast (1) as bigint end) as MEMBER_UV, sum (case when member_num =1 and type = ' I ' then cast (1) as bigint end ) as MEMBER_IP,

SUM (case when total_num =1 and type = ' a ' then cast (1) as bigint end) as TOTAL_UV, sum (case when total_num =1 and type = ' I ' then cast (1) as bigint end) as Total_ip, from the T_4_num group by Memberid,siteid,adzoneid, Provinceid; Ad bit granularity of UV words, from r_table_4 statistics, this is the source table to do the 10w level of statistics Select memberid,siteid,adzoneid,sum (ADZONE_UV), sum (ADZONE_IP) from R_table_4 Group by Memberid,siteid,adzoneid Memberid,siteid UV calculation, MemberID UV calculation, total UV calculations are also summarized from R_table_4.


A Joinjoinjoin the basic principle of optimizing Join lookup operations: You should place a table/subquery with fewer entries on the left side of the join operator. The reason is that in the reduce phase of a join operation, the contents of the table located on the left side of the join operator are loaded into memory, and the table with fewer entries on the left can effectively reduce the chance of a memory overflow error. If more than one join is present in a join lookup operation and all the keys participating in the JOIN join are the same, all joins are merged into one mapred program. Case: Select A.val, B.val, c.val from a join B in (A.key = b.key1) join C on (C.key = b.key1) executes join SELECT in a mapre program A.val , B.val, c.val from a join B in (A.key = b.key1) join C on (C.key = B.key2) The key to performing a join MAP join in two mapred programs is the data of a table in the join operation Small number of cases:

SELECT/*+ Mapjoin (b)/A.key, a.value from a join B on a.key = B.key Mapjoin limit is unable to perform a full/right OUTER join B, and map join Related hive parameters: Hive.join.emit.interval Hive.mapjoin.size.key hive.mapjoin.cache.numrows because the join operation is performed before the where operation, So when you perform a join, the Where condition does not reduce the role of the join data; case: SELECT A.val, b.val from left OUTER JOIN B on (a.key=b.key) WHERE a.ds= ' 2009-07 -07 ' and b.ds= ' 2009-07-07 ' best modified as: SELECT A.val, b.val from left OUTER JOIN B on (A.key=b.key and b.ds= ' 2009-07-07 ' and a . ds= ' 2009-07-07 ' in every mapred program in the join operation, Hive will stream the data that appears in the relative table in the join statement, and the data that is relative to the previous variable is cached in memory. Of course, you can also manually specify a stream table: SELECT/*+ streamtable (a) * * * a.val, B.val, c.val from a join B on (A.key = b.key1) join C on (C.key = B.key1)


Group Groupgroupgroup by optimized map end aggregation, first on the map side of the initial aggregation, and finally in the reduce end of the final results, related parameters: hive.map.aggr = True Whether the map side of aggregation, default to True Hive.groupby.mapaggr.checkinterval = 100000 number of entries on the Map end of the aggregation operation data skew aggregation optimization, set parameter Hive.groupby.skewindata = True, the selected item is set to True, The resulting query plan will have two MR jobs. In the first MR Job, the MAP's output collection is randomly distributed to reduce, with each reduce doing a partial aggregation and outputting the results so that the same Group by Key may be distributed to different Reduce to achieve load balancing purposes; The two MR jobs are then distributed to Reduce by the group by key according to the preprocessed data results (this process ensures that the same Group by key is distributed to the same Reduce) and finally completes the final aggregation operation.


Third, the combination of too many small file files, will bring pressure to the HDFS, and will affect processing efficiency, can be merged with the map and Reduce the result file to eliminate such effects: hive.merge.mapfiles = True if and and map output file, default to Tru Ehive.merge.mapredfiles = False Merges the Reduce output file, defaults to Falsehive.merge.size.per.task = 256*1000*1000 size of the merged file


Four, hivehive hive implementation (not) (not) is queried through the left outer join (assuming B table contains another field Key1 select A.key from a left outer join B on a.ke Y=b.key where b.key1 is null implements the in SELECT A.key by the left semi join, the A.val from a left semi join B on (A.key = B.key) the left SE Mi Join restriction: The table to the right of the join condition can only appear in the join condition.


Five, sort optimization order by implementation of global ordering, a reduce implementation, inefficient sort by implementation of the partial order, the result of a single reduce output is orderly, efficient, usually with the Distribute by keyword (Distribute by keyword You can specify distribution key for map to reduce end CLUSTER by col1 equivalent to distribute by col1 SORT by col1


Each partition in the partition hive corresponds to a directory on the HDFs, and the partition column is not an actual field in the table, but one or more pseudo columns, which do not actually save the information and data of the partitioned columns in the table's data file. The partition keyword is the primary partition (only one), followed by the secondary partition static partition: Static partitions need to specify the case in the SQL statement when loading data and using: (stat_date= ' 20120625 ', province= ' Hunan ') Dynamic partitioning: Using dynamic partitioning requires that the Hive.exec.dynamic.partition parameter value is true, the default value is False, and by default, Hive assumes that the primary partition is static and the secondary partition uses dynamic partitioning; If you want to make

With dynamic partitioning, set Hive.exec.dynamic.partition.mode=nostrick is required, default is Strick case: (stat_date= ' 20120625 ', province)


Seven, distinctdistinct distinctdistinct Distinctdistinct uses hive to support multiple distinct operations on the same column on group by Distinct operations on multiple columns in the same statement are not supported.


Viii. hqlhqlhql Use custom mapredmapredmapred mapredmapred scripts using custom Use customizations Note: When you use a custom mapred script, the keyword map REDUCE is the key word Is the statement select TRANSFORM (...) Syntax conversion does not imply that a new map process is enforced when using the Map keyword, and a red process is generated when using the Reduce keyword. Custom mapred scripts can be HQL statements to complete more complex functions, but performance is worse than HQL statements, should try to avoid using, if possible, use the UDTF function to replace the custom mapred script


UDTFUDTFUDTFUDTF UDTF Converts a single input row into multiple output rows, and when using UDTF, the SELECT statement cannot contain other columns, UDTF does not support nesting, and does not support statements such as group by, sort by, and so on. If you want to avoid these limitations, you need to use the lateral view syntax, case: Select A.timestamp, Get_json_object (a.appevents, ' $.eventid '), Get_json_object ( A.appenvets, ' $.eventname ') from log A; Select A.timestamp, b.* from log a lateral view json_tuple (a.appevent, ' EventID ', ' eventname ') b as F1, F2; Where Get_json_object is the UDF function and json_tuple is the UDTF function. The UDTF function can greatly improve the performance of HQL statements in some scenarios, such as the need to parse JSON or XML data for multiple scenarios.







Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.