Hive ive optimization (important)

Source: Internet
Author: User

Hive ive optimization essentials: When optimizing, the hive SQL as the map reduce program to read, there will be unexpected surprises.

Understanding the core capabilities of Hadoop is fundamental to hive optimization. The long-term observation of Hadoop's process of data processing has several notable features:

1. Not afraid of data, it is afraid of data tilt.

2. More jobs than the number of job run efficiency is relatively low, for example, even if there are hundreds of rows of the table, if multiple association multiple summaries, resulting in more than 10 jobs, not half an hour is not run out. The map reduce job initialization time is relatively long.

3. For Sum,count, there is no data skew problem.

4. For count (distinct), less efficient, more data volume, quasi-problem, if more than count (distinct) efficiency is lower.

Optimization can be undertaken in several ways:

1. Good model design with less effort.

2. Resolve data skew issues.

3. Reduce job count.

4. Set a reasonable number of tasks for map reduce, which can improve performance effectively. (for example, 10w+-level calculations, with 160 reduce, that is quite wasteful, 1 enough).

5. Write your own SQL to solve the data tilt problem is a good choice. Set hive.groupby.skewindata=true; this is a generic algorithm optimization, but algorithm optimization always ignores the business and habitually provides a common solution. ETL developers know more about the business and more about the data, so it is often more accurate and effective to solve the tilt through business logic.

6. The method of ignoring the count (distinct), especially when the data is large, is prone to tilt problems and does not take a fluke. Hands-on, clothed.

7. The merging of small files is an effective way to improve the efficiency of scheduling, if our work set a reasonable number of files, the overall scheduling efficiency of the ladder will also have a positive impact.

8. Grasp the whole when optimizing, the single job optimal is inferior to the overall optimal.

Optimization case:

Case:

Issue 1: In the log, there is often a problem of information loss, such as the user_id in the whole network log, if the user_id and bmw_users associated with it, you will encounter the problem of data skew.

Method: Solve data tilt problem

Workaround 1. USER_ID does not participate in the association, for example: Select * from log a Join bmw_users B on a.user_id are NOT null and a.user_id = b.user_id Union all Sele CT * from Log a where a.user_id is null.

Workaround 2:select * from Log a

Left OUTER join Bmw_users B in case when a.user_id was null then concat (' Dp_hive ', rand ()) Else a.user_id end = b.user_id; Summary: 2:1 efficiency is better, not only the IO is less, and the number of jobs is also less. 1 method Log is read two times, and jobs is 2. 2 Method job number is 1. This optimization is suitable for an invalid ID (such as -99, "", null, etc.) resulting from the tilt problem. By turning the null key into a string plus a random number, you can divide the skewed data into different reduce to solve the data skew problem. Because null values do not participate in associations, they do not affect the final result even if they are divided into different reduce. Attach the implementation method of the Common Hadoop Association (associated with two order implementation, the associated column Parition key, the associated columns C1, and the table's tag group key, according to Parition key to allocate reduce. Sorted by group key within the same reduce). Issue 2: Correlation of different data type IDs results in data skew issues. A log of a table S8, a record of each item, to be associated with a commodity table. But the association has encountered the problem of tilt. The S8 log has a string commodity ID and a number commodity ID, which is a string, but the number ID in the product is bigint. The reason to guess the problem is to turn S8 's commodity ID into a digital ID to do hash to allocate reduce, so the string ID of the S8 log, all to a reduce, the solution of the method to verify this speculation. Method: Convert a number type to a string type Select * from S8_log a LEFT outer join r_auction_auctions b on a.auction_id = cast (b.auction_id As String ); Issue 3: The optimization of the union all using hive hive to Union ALL optimization is limited to non-nested queries. For example, the following examples:
SELECT * FROM (SELECT * from T1 GROUP by C1,C2,C3 Union all select * from T2 Group by C1,C2,C3) T3 Group by C1,C2,C3; From the business logic, the group by in the subquery looks redundant (redundant in function, unless there is count (distinct)), if not because of hive bugs or performance considerations (which once appeared if not subquery group BY, Data is not getting the correct result for the hive bug). So this hive is converted from experience to select * FROM (SELECT * to T1 Union all select * to T2) T3 Group by C1,C2,C3; After testing, there is no hive bug for union all, and the data is consistent. The number of jobs in Mr is reduced to 1 by 3. T1 is equivalent to a directory, T2 equivalent to a directory, then for the map reduce program, T1,T2 can be used as the map reduce job mutli inputs. Well, this can be solved by a map reduce. Hadoop's computational framework is not afraid of more data, and is afraid of many jobs. But if it is a different computing platform such as Oracle, it is not necessary, because the large input is split into two input, sorted after the aggregate merge (if two sub-sorting is parallel), it is possible to better performance (such as the hill sort than the performance of bubble sort).
Question 4: For example, the promotion effect table is related to the commodity table, the auction ID column in the effect table has both the commodity ID and the digital ID, and the commodity table is associated with the product information. Then the following hive SQL performance will be better for SELECT * from Effect a Join (select auction_id as auction_id from auctions Union all select Auction_st ring_id as auction_id from auctions) b on a.auction_id = b.auction_id. Compared to filtering the numeric ID separately, the string ID is then correlated with the commodity table for better performance. The benefits of this writing, 1 Mr Jobs, the commodity table read only once, the promotion effect table read only once. To change this SQL into Mr Code, when the map, the record of a is labeled a, the product table records every read one, labeled B, into two <key,value> pairs, <b, the number id>,<b, string id>. So the HDFs reading of the commodity table will only be one time. Question 5: The join generates a temporary table, and the union all or the nested query is a problem. For example: SELECT * FROM (SELECT * from t1 uion all Select * from T4
Union all Select * from T2 Join t3 on t2.id = t3.id) x Group by C1,C2; There will be 4 jobs in this one. If the join generates a temporary table then the T5, then union all, becomes 2 jobs. Insert Overwrite Table T5 Select * from T2 joins t3 on t2.id = t3.id; Select * FROM (T1 UNION all T4 Union All T5); Hive can do more intelligently on union all optimization (as a query as a temporary table), which can reduce the burden on developers. The reason for this problem should be that the current optimization of union all is limited to non-nested queries. If writing Mr Programs is not a problem, it is multi inputs. Issue 6: Use the map join to solve the problem of tsunekage the small table associated with a large table under the data skew, but if the small table is large, how to solve. This is a very high frequency, but if the small table is large and a bug or exception occurs when the map join is large, special handling is required. The following example:
Select * FROM Log a LEFT outer joins members B on a.memberid = B.memberid. Members have a record of 600w+, and it's not a small expense to distribute them to all maps, and the map join does not support such a large scale. If you use a common join, you will encounter the problem of data skew. Workaround: Select/*+mapjoin (x) */* from log A to outer join (select/*+mapjoin (c) */d.* from (select distinct MemberID from lo g) C Join members D on c.memberid = D.memberid) x on a.memberid = B.memberid. First, according to log all the MemberID, and then mapjoin associated members take today's members of the log information, and then in and log do mapjoin. If there are millions of MemberID in log, this will return to the original map join problem. Fortunately, the daily membership of the UV will not be too much, there are not too many members of the transaction, there is no click on the membership will not be too much, there will be no commission members too much and so on. So this method can solve the problem of data skew in many scenarios. Problem 7:hive is a common data skew solution, double is associated with a relatively small table, and this method is commonly used in Mr Programs. Or just that question: SELECT * from log a LEFT outer join (select/*+mapjoin (E) */
MemberID, number from Members D joins NUM e) b on a.memberid= B.memberid and mod (a.pvtime,30) +1=b.number. The NUM table has only one column number, with 30 rows, which is a sequence of natural numbers for 1,30. is to inflate the member table into 30 parts, and then divide the log data into different reduce by MemberID and pvtime, so that the data allocated to each reduce can be relatively uniform. For the current test, the performance of the Mapjoin scheme is slightly better. The latter scenario is appropriate if the map join does not solve the problem. The following optimization scheme can be made into a Universal hive optimization Method 1. Sample the Log table, which MemberID are skewed, and get a result table tmp1. Because of the computational framework, all of the data came out, he did not know the distribution of data, so the sampling is not less. Stage1 2. The distribution of data conforms to the rules of sociological statistics and inequality between rich and poor. Tilt key not too much, like a society of the rich few, strange people are not much the same. So the number of TMP1 records is very small. Make TMP1 and members map join to generate TMP2, TMP2 read to distribute file cache. This is a map process. Stage2 3. Map read the members and log, if the record from log, then check whether MemberID in TMP2, if yes, output to local file A, otherwise generate <memberid,value> key,value pair, if the record from member , generate <memberid,value> Key,value pairs, into the reduce phase. Stage3. 4. Finally, a file is merged into the Stage3 reduce phase output file to HDFs. This approach should be achievable in Hadoop. Stage2 is a map process that can be combined with the Stage3 map process to form a map process. The goal of this project is to: tilt the data with the Mapjoin, the non-skewed data with the ordinary join, the final merge to obtain the complete result. With hive SQL, SQL becomes a lot of paragraphs, and the log table has multiple reads. Tilt key
is always very small and this applies in most business contexts. Can it be used as a generic algorithm for hive for data skew join? Issue 8: Multi-granularity (lateral) UV calculation optimization, such as to calculate the UV of the store. There are also uv,pvip to calculate the page. Scenario 1:select Shopid,count (distinct uid) from log group by Shopid; Select PageID, COUNT (distinct UID), from log group by PageID; Because of the data skew problem, this result runs for a very long time. Scenario Two: From log insert overwrite table t1 (type= ' 1 ') Select shopid Group by shopid, acookie Insert overwrite table t1 (type= ' 2 ') Group by Pageid,acookie; Shop Uv:select shopid,sum (1) from t1 Where type = ' 1 '
Group by Shopid; Page Uv:select pageid,sum (1) from t1 Where type = ' 1 ' Group by PageID; The multi Insert method is used here, which effectively reduces the HDFS read, but the multi insert adds HDFs write, one more time for the HDFs write of the additional map phase. With this method, the results can be produced smoothly. Scenario three: Insert into T1 Select type,type_name, "as UID from" (Select ' page ' as type ', Pageid as type_name, uid from log Union All Select "Shop" as type, shopid as Type_name, Uid from log) y
Group by Type,type_name,uid; Insert into T2 Select type,type_name,sum (1) from T1 Group by Type,type_name; From T2 Insert to T3 select Type,type_name,uv where type= ' page ' select Type,type_name,uv where type= ' shop '; Finally get two results table T3, page UV table, T4, store results table. From IO, log is read once. However, the HDFs write is less than Scenario 2 (Multi Insert sometimes adds additional map phase HDFs writes). The number of jobs decreased by 1 to 3, the number of jobs with reduce decreased from 4 to 2, and the third step was a small table map process, which was divided into tables with less computational resource consumption. But scenario 2 each is a massive de-rollup calculation. The main idea of this optimization is that the map reduce job initialization time is relatively long, since up, let him do more work, by the way the page by the UID to heavy work also dry, save log of a read and the initialization time of the job, save the network shuffle IO, but increased the local disk read and write. Increased efficiency. This scheme is suitable for the level of the multi-granularity UV calculation that does not need to be summarized, the more granularity, the more resources to save, more common. Issue 9: Multi-granularity, up-level summarization of UV settlements. For example, 4 dimensions, a,b,c,d, calculate A,B,C,D,UV respectively;
A,b,c,uv;a,b,uv;a;uv,total uv4 a result table. This can be used in question 8 of scenario two, here due to the specificity of the UV scene, multi-granularity, layer-by-level summary, you can use a sort, all UV computing benefit calculation method. Case: Currently Mm_log log 2.5 billion + PV number per day, to calculate the UV from the MM log, and IPUV, a total of three granularity of the results table (MEMBERID,SITEID,ADZONEID,PROVINCE,UV,IPUV) r_table_4 (MEMBERID,SITEID,ADZONEID,UV,IPUV) r_table_3 (MEMBERID,SITEID,UV,IPUV) r_table_2 First step: Press Memberid,siteid,adzoneid, Province, use group to weight, generate temporary table, Cookie,ip label put together, go heavy together, temporary table called T_4; Select Memberid,siteid,adzoneid,province,type,user from (select Memberid,siteid,adzoneid,province, ' A ' type, cookie as User from Mm_log where ds=20101205 Union all Select memberid,siteid,adzoneid,province, ' I ' type, IP as user from Mm_log whe Re ds=20101205) x group by Memberid,siteid,adzoneid,province,type,user; Step two: rank, generate table T_4_num. The most powerful and core capabilities of Hadoop are parition and sort. Grouped by Type,acookie, Type,acookie,memberid,siteid,adzoneid,province ranked. Select *,
Row_number (Type,user,memberid,siteid,adzoneid) as Adzone_num, Row_number (Type,user,memberid,siteid) as Site_num, row _number (Type,user,memberid) as Member_num, Row_number (Type,user) as Total_num from (SELECT * from T_4 distribute by type , user sort by Type,user, Memberid,siteid,adzoneid) x; This allows the user to rank at different levels of granularity, with the same user ID at different levels of granularity, with only 1 records ranked equal to 1. Take the rank equals 1 to do sum, the effect is equivalent to the group by user to do the sum operation. The third step: different granularity of UV statistics, first from the most fine-grained start statistics, resulting in the result table r_table_4, at this time, the result set only 10w level. such as statistical memberid,siteid,adzoneid,provinceid granularity of UV using the method is Select Memberid,siteid,adzoneid, Provinceid, sum (case when type = ' A ' then cast (1) as bigint end) as PROVINCE_UV, sum (case if type = ' I ' then cast (1) as bigint end) as PROVINCE_IP, sum (Case is adzone_num =1 and type = ' a ' then cast (1) as bigint end) as ADZONE_UV, sum (case when adzone_num =1 and type = ' I ' then cast (1) as bigint end) as ADZONE_IP, sum (case is site_num =1 and type = ' a ' then cast (1) as bigint end) as sit E_UV, sum (case is site_num =1 and type = ' I ' then cast (1) as bigint End) as SITE_IP, sum (case if member_num =1 and type = ' a ' then cast (1) as bigint end) as MEMBER_UV, sum (case when me Mber_num =1 and type = ' I ' then cast (1) as bigint end) as Member_ip,
SUM (case if total_num =1 and type = ' a ' then cast (1) as bigint end) as TOTAL_UV, sum (case when total_num =1 and type = ' I ' then cast (1) as bigint end) as Total_ip, from T_4_num Group by Memberid,siteid,adzoneid, Provinceid; Ad bit granularity of UV words, from r_table_4 statistics, which is the source table to do 10w level statistics Select memberid,siteid,adzoneid,sum (ADZONE_UV), sum (ADZONE_IP) from R_table_4 Group by Memberid,siteid,adzoneid; Memberid,siteid UV calculations, MemberID UV calculations, total UV calculations are also aggregated from r_table_4.
A Joinjoinjoin the basic principle of optimizing join lookup operations: You should place a table/subquery with fewer entries on the left side of the join operator. The reason is that in the reduce phase of the join operation, the contents of the table on the left side of the join operator are loaded into memory, and the table with fewer entries is left, which effectively reduces the chance of memory overflow errors. If multiple joins exist in a join lookup operation, and all of the tables participating in the join have the same key that participates in the join, all joins are merged into a single mapred program. Case: SELECT a.val, B.val, c.val from a join B in (A.key = b.key1) join C on (C.key = b.key1) Execute join SELECT mapre in a a.val program , B.val, c.val from a join B on (A.key = b.key1) JOIN c on (C.key = B.key2) The key to performing a join MAP join in two mapred programs is the data of a table in the join operation The quantity is very small, case:
SELECT/*+ Mapjoin (b) */A.key, a.value from a join B on a.key = B.key Mapjoin limit is unable to perform a full/right OUTER join B, and map join Related hive parameters: Hive.join.emit.interval Hive.mapjoin.size.key hive.mapjoin.cache.numrows because the join operation was performed before the where operation, So when you perform a join, the Where condition does not reduce the role of the join data; case: SELECT A.val, B.val from a left OUTER JOIN B on (a.key=b.key) WHERE a.ds= ' 2009-07 -07 ' and b.ds= ' 2009-07-07 ' best modified to: SELECT A.val, B.val from a left OUTER JOIN B on (A.key=b.key and b.ds= ' 2009-07-07 ' and a . ds= ' 2009-07-07 ') in each of the mapred programs in the join operation, Hive will stream the data from the table that appears in the join statement, and cache the data relative to the previous variable in memory. Of course, you can also manually specify a stream table: SELECT/*+ streamtable (a) */A.val, B.val, c.val from a join B on (A.key = b.key1) join C on (C.key = B.key1)
Second, group Groupgroupgroup by optimizing map-side aggregation, first on the map end of the initial aggregation, and finally in the reduce end to obtain the final result, the relevant parameters: hive.map.aggr = True If the map side is aggregated, the default is True Hive.groupby.mapaggr.checkinterval = 100000 number of entries at the Map end of the aggregation operation data skew aggregation optimization, set parameter Hive.groupby.skewindata = True, the selected item is set to True, The resulting query plan will have two MR jobs. In the first MR Job, the output set of MAP is randomly distributed to reduce, and each reduce does a partial aggregation operation and outputs the result so that the same Group by Key may be distributed to different reduce to achieve load balancing purposes; Two MR jobs are then distributed to reduce based on the preprocessed data results (this process guarantees that the same group by key is distributed to the same reduce), and finally the final aggregation operation is completed.
Third, the combination of too many small files, will bring pressure to HDFS, and will affect the processing efficiency, can be combined with the map and reduce the result file to eliminate the effect: hive.merge.mapfiles = True and map output file, default is Tru Ehive.merge.mapredfiles = False If the Reduce output file is merged, the default is Falsehive.merge.size.per.task = 256*1000*1000 the size of the merged file
Iv. hivehive Hive Implementation (not) in query through a LEFT outer join (assuming that table B contains an additional field Key1 select A.key from a left outer join B on a.ke Y=b.key where b.key1 is null through the left semi join implementation in SELECT A.key, A.val from a left semi joins B on (A.key = B.key) left SE Mi Join restriction: The right-hand table in the join condition can only appear in the join condition.
Ordering optimization order BY to achieve global ordering, a reduce implementation, inefficient sort by implementation is partially ordered, the result of a single reduce output is orderly, high efficiency, usually used with the distribute by keyword (Distribute by keyword You can assign a map to the reduce side of the distribution key) CLUSTER by col1 equivalent to distribute by col1 SORT by col1
Six, each partition in the use of partition hive corresponds to a directory on HDFs, the partition column is not an actual field in the table, but one or more pseudo-columns, in the table's data file does not actually save the partition column information and data. The partition keyword is preceded by the primary partition (only one), followed by the secondary partition static partition: Static partitioning needs to specify a case in the SQL statement when loading data and usage: (stat_date= ' 20120625 ', province= ' Hunan ') Dynamic partitioning: Use dynamic partitioning to set the Hive.exec.dynamic.partition parameter value to True, the default value is False, by default, Hive assumes that the primary partition is statically partitioned, the secondary partition uses dynamic partitioning, and if you want to make
With dynamic partitioning, set Hive.exec.dynamic.partition.mode=nostrick is required and the default is Strick case: (stat_date= ' 20120625 ', province)
Vii. distinctdistinct distinctdistinct distinctdistinct use hive to support multiple distinct operations on the same column at group by. Distinct operations on multiple columns in the same statement are not supported.
Viii. HQLHQLHQL using custom custom using custom mapredmapredmapred mapredmapred scripting considerations: When using a custom mapred script, the keyword map, REDUCE, is used to define Is the statement select TRANSFORM (...) Syntax conversion does not mean that a new map process is forced when using the Map keyword, and a red process is generated when using the Reduce keyword. The custom mapred script can be a more complex function of the HQL statement, but performance is a bit worse than the HQL statement and should be avoided as much as possible, using the UDTF function to replace the custom mapred script
UDTFUDTFUDTFUDTF UDTF Converts a single input row into multiple output lines, and when using UDTF, the SELECT statement cannot contain additional columns, UDTF does not support nesting, and does not support the group by, sort by, and so on. If you want to avoid these limitations, you need to use the lateral view syntax, case: Select A.timestamp, Get_json_object (a.appevents, ' $.eventid '), Get_json_object ( A.appenvets, ' $.eventname ') from log A; Select A.timestamp, b.* from log a lateral view json_tuple (a.appevent, ' EventID ', ' eventname ') b as F1, F2; Where Get_json_object is the UDF function, Json_tuple is the UDTF function. The UDTF function can greatly improve the performance of HQL statements in some scenarios, such as applications that need to parse JSON or XML data multiple times.

Hive ive optimization (important)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.