Hive tuning practices

Source: Internet
Author: User
 

Hive optimization Summary

 

 

 

--- By Hualien

 

 

 

 

 

 

 

Hive SQLAs map reduceThere will be unexpected surprises.

 

Understanding hadoopHiveThe foundation of optimization. This is a summary of the valuable experience of all members of the project team over the past year.

 

 

 

Long-term Observation of hadoop's data processing process has several notable features:

 

1. If there is not much data, you are afraid of data skew.

 

2. for jobs with a large number of jobs, the running efficiency is relatively low. For example, if a table with hundreds of rows is joined multiple times for multiple summaries, more than a dozen jobs are generated, you can't finish running in less than half an hour. The initialization time of the map reduce job is relatively long.

 

3. There is no data skew problem for sum and count.

 

4. For count (distinct), the efficiency is low, the data volume is too large, and there is a quasi-problem. If it is multi-count (distinct), the efficiency is lower.

 

 

 

Optimization can be performed in the following aspects:

 

1. Better model design.

 

2. Solve the data skew problem.

 

3. Reduce the number of jobs.

 

4. Setting a reasonable number of map reduce tasks can effectively improve performance. (For example, 160 +-level computing, with reduce, is a waste, and one is enough ).

 

5. It is a good choice to manually write SQL statements to solve the data skew problem. Set hive. groupby. skewindata = true; this is a common algorithm optimization, but algorithm optimization always ignores the business and provides a common solution. ETL developers are more familiar with the business and data. Therefore, business logic is more accurate and effective.

 

6. Ignore the count (distinct) method, especially when the data is big, it is easy to cause the problem of skew, not to be lucky. Do it yourself.

 

7. merging Small files is an effective way to improve the scheduling efficiency. If we set a reasonable number of files for our jobs, the overall scheduling efficiency of the cloud ladder will also have a positive impact.

 

8. The overall optimization is better than the overall optimization for a single job.

 

 

 

Cases during migration and optimization:

 

 

 

Question 1: In logs, information is often lost, for example, user_id in logs across the network.If the user_idAnd bmw_usersAssociation causes data skew.

 

Method: Solve the data skew problem.

 

Solution 1. If user_id is null, it does not participate in association. For example:

 

Select *

 

From log

 

Join bmw_users B

 

On a. user_id is not null

 

And a. user_id = B. user_id

 

Union all

 

Select *

 

From log

 

Where a. user_id is null.
Solution 2:

 

Select *

 

From log

 

Left Outer Join bmw_users B

 

On case when a. user_id is null then Concat ('dp _ hive ', Rand () else A. user_id end = B. user_id;

 

 

 

Summary:2 is more efficient than 1, with less Io and fewer jobs. Method 1: log is read twice, and the number of jobs is 2. 2. The number of jobs is 1.This optimization is suitable for ineffectiveID (For example,-99, '', null)Skew problem.By converting the key with a null value into a string and adding a random number, you can divide the skewed data into different reduce values to solve the data skew problem. Because null values do not participate in association, even if they are distributed to different reduce tasks, the final result is not affected. Attaches the implementation method of hadoop universal association (the Association is achieved through secondary sorting, the associated column is the parition key, the associated column C1 and the table tag constitute the group key for sorting, distribute reduce according to the parition key. Sort group keys in the same reduce ).

 

 

 

Question 2: Different data type IDsWill cause data skew.

 

One Table S8 log, one record for each item, must be associated with the item table. However, Association is prone to skew. The log of S8 contains the string product ID and numeric product ID. The type is string, but the numeric ID in the product is bigint. The reason for this issue is that the S8 product ID is converted into a digital ID for hash distribution to reduce, so the S8 logs of the string ID are all sent to a reduce, the solution verifies this guess.

 

Method: Convert the numeric type to the string type.

 

Select * From s8_log

 

Left Outer Join r_auction_auctions B

 

On a. auction_id = cast (B. auction_id as string );

 

 

 

Question 3: Use hiveUnion allOptimized features

 

HiveUnion allOptimization is limited to non-nested queries.

 

For example:

 

Select * from

 

(Select * from T1

 

Group by C1, C2, C3

 

Union all

 

Select * From T2

 

Group by C1, C2, C3) T3

 

Group by C1, C2, C3;

 

In terms of business logic, group by in subqueries seems redundant (function redundancy, unless there is count (distinct )), if it is not because of the hive bug or performance considerations (there was a hive bug where the data could not get the correct results without subqueries of group ). Therefore, this hive is converted

 

Select * from

 

(Select * from T1

 

Union all

 

Select * From T2

 

) T3

 

Group by C1, C2, C3;

 

After testing, the hive bug of Union all is not found, and the data is consistent. The number of Mr Jobs is reduced from 3 to 1.

 

T1 is equivalent to a directory, and T2 is equivalent to a directory. For map reduce programs, t1 and t2 can be used as mutli inputs of map reduce jobs. So, this problem can be solved through a map reduce. The hadoop computing framework is afraid of a large number of jobs because it does not have to worry about a large amount of data.

 

However, if it is replaced by another computing platform such as Oracle, it is not necessarily because the large input is split into two inputs, sorted separately and aggregated, and then merge (if the two suborders are in parallel ), is likely to have better performance (such as better performance of hill sorting than Bubble Sorting ).

 

 

 

Question 4For example, to associate a promotion effect table with a product table, the auction ID in the results tableColumn existing item ID,Also has a numeric ID,Associate with the product table to obtain product information.. The following hive SQL statements have better performance.

 

Select * From effect

 

Join (select auction_id as auction_id from auctions

 

Union all

 

Select auction_string_id as auction_id from auctions

 

) B

 

On a. auction_id = B. auction_id.

 

It is better than filtering numeric IDs, string IDs, and then associating with commodity tables.

 

The advantage of this writing is that one Mr job only reads the commodity table once and the promotion effect table only reads it once. If you replace this SQL statement with the Mr code, tag a for the record in Table A during map. Each record in the product table reads one record, tag B, and two <keys, value> pair, <B, Number ID>, <B, string ID>. Therefore, the HDFS read of the commodity table is only once.

 

 

 

Question 5: Join firstGenerate a temporary table in Union allOr write nested queries. This is a problem.. For example:

 

Select *

 

From (select *

 

From T1

 

Uion all

 

Select *

 

From T4

 

Union all

 

Select *

 

From T2

 

Join T3

 

On t2.id = t3.id

 

) X

 

Group by C1, C2;

 

There will be four jobs. If you join to generate a temporary table, then t5 and Union all will become two jobs.

 

Insert overwrite table T5

 

Select *

 

From T2

 

Join T3

 

On t2.id = t3.id

 

;

 

Select * from (T1 Union all T4 Union all T5 );

 

HiveIn Union allOptimization can be performed more intelligently (using subqueries as temporary tables), which can reduce the burden on developers. The reason for this problem is Union all.The current optimization is limited to non-nested queries. If you write an MRThe program is not a problem, that is, multi inputs..

 

 

 

Question 6: Map joinThis solution is used to associate small tables with large tables in a common scenario of data skew. However, if small tables are large, how can this problem be solved?. This is frequently used. However, if a small table is large, a bug or exception occurs in map join, which requires special processing. Yun Rui and Yu Ji provide a very powerful solution. Example:

 

Select * From log

 

Left Outer Join Members B

 

On a. memberid = B. memberid.

 

Members has over 600 million records. It is not a small overhead to distribute members to all maps, and map join does not support such a large small table. If a common join clause is used, data skew may occur again.

 

Solution:

 

Select/* + mapjoin (x) */* From log

 

Left Outer Join (select/* + mapjoin (c) */D .*

 

From (select distinct memberid from log) c

 

Join Members d

 

On C. memberid = D. memberid

 

) X

 

On a. memberid = B. memberid.

 

First retrieve all memberids Based on logs, and then associate members with mapjoin to obtain the information of the members with logs today, and then perform mapjoin with logs.

 

Assume that there are millions of memberids in the log, and this returns to the original map join problem. Fortunately, there will not be too many UV members every day, there will not be too many members with transactions, there will not be too many members with clicks, and there will not be too many members with commissions. Therefore, this method can solve the problem of data skew in many scenarios.

 

 

 

Question 7: HiveDoubleThe joined table is relatively small.Programs.Or the question just now:

 

Select * From log

 

Left Outer Join (select/* + mapjoin (e )*/

 

Memberid, number

 

From members d

 

Join num E

 

) B

 

On a. memberid = B. memberid

 

And Mod (A. pvtime, 30) + 1 = B. Number.

 

The num table has only one column of number, which has 30 rows and is a natural number sequence. It is to expand the member table into 30 parts, and then divide the log data into different reduce according to the memberid and pvtime, so that the data allocated by each reduce can be relatively even. In the current test, the performance of the mapjoin solution is slightly better. The following solutions are suitable when map join cannot solve the problem.

 

 

 

In the long run, we will make the following optimization scheme a general hive.Optimization Method

 

1.Sample logTable and memberidReturns a result table tmp1.. For the computing framework, all the data comes from, and he does not know the data distribution. Therefore, sampling is indispensable. Stage1

 

2.The distribution of data is in line with the rules of social statistics, and the poverty and wealth are uneven. Skewed keyThere won't be too many people, just as there are not many rich people in a society and there are not many strange people. So tmp1The number of records is very small. Set tmp1And MembersMap joinGenerate tmp2,Set tmp2Read distribute File Cache. This is a mapProcess. Stage2

 

3. MapRead into membersAnd logIf the record is from log,Check the memberidWhether in tmp2If yes, output to local file,Otherwise, <memberid, value> is generated.Key, ValueYes. If the record is from member,Generate <memberid, value>Key, ValueYes, go to reducePhase. Stage3.

 

4.FinallyFile to reduce stage3Merge the output files and write them to HDFS..

 

In hadoopIs implemented. Stage2Is a mapProcess, and stage3MapThe process can be merged into a map.Process.

 

The goal of this solution is to use mapjoin for skewed data,Normal join is used for non-skewed data. Use hive SQLSQL statementIt will become a lot of segments, and the logThe table will be read multiple times. Skewed keyIt is always rare, which is applicable to the vast majority of business backgrounds. Can it be used as hive?Data skew joinWhat about general algorithms?

 

 

 

Question 8: Multi-granularity (Level UVComputing OptimizationFor example, to calculate the UV of a store. We also need to calculate the page UV and pvip.

 

Solution 1:

 

Select shopid, count (distinct UID)

 

From log group by shopid;

 

Select pageid, count (distinct UID ),

 

From log group by pageid;

 

Due to data skew, the running time of this result is very long.

 

Solution 2:

 

From log

 

Insert overwrite table T1 (type = '1 ')

 

Select shopid

 

Group by shopid, acookie

 

Insert overwrite table T1 (type = '2 ')

 

Group by pageid, acookie;

 

Store UV:

 

Select shopid, sum (1)

 

From T1

 

Where type = '1'

 

Group by shopid;

 

Page UV:

 

Select pageid, sum (1)

 

From T1

 

Where type = '1'

 

Group by pageid;

 

Multi insert is used here.To effectively reduce the HDFSRead, but multi insertHDFS will be addedWrite, an additional MapPhase HDFSWrite. This method can be used to produce results smoothly.

 

Solution 3:

 

Insert into T1

 

Select Type, type_name, ''as uid

 

From (

 

Select 'page' as type,

 

Pageid as type_name,

 

UID

 

From log

 

Union all

 

Select 'shop 'as type,

 

Shopid as type_name,

 

UID

 

From log) y

 

Group by type, type_name, uid;

 

Insert into T2

 

Select Type, type_name, sum (1)

 

From T1

 

Group by type, type_name;

 

From T2

 

Insert into T3

 

Select Type, type_name, UV

 

Where type = 'page'

 

Select Type, type_name, UV

 

Where type = 'shop ';

 

The two result tables T3, uvtable, T4, and shop result table are obtained. For Io, log is read once. But less HDFS writes than solution 2 (multi insert sometimes adds an additional map stage HDFS write ). The number of jobs is reduced by 1 to 3, and the number of jobs with reduce is reduced from 4 to 2. The third step is the map process of a small table, which is divided into the following table and consumes less computing resources. However, solution 2 is a large-scale de-duplicated summary computing.

 

The main idea of this optimization is map reduce.It takes a long time to initialize a job. Now that it is up, let him do more work.By the way, the page is de-duplicated by uid, saving one log read and job initialization time, saving the network shuffle Io, but increasing the local disk read and write. The efficiency is greatly improved.

 

This solution is suitable for hierarchical multi-granularity UV that does not need to be summarized step by step.Computing: the more granularity, the more resource savings, and more common.

 

 

 

Question 9: Multi-granularity, layer-by-layer up UV SummarySettlement.For example, four dimensions, A, B, C, and D, calculate a, B, c, d, and UV respectively;

 

A, B, C, UV; A, B, UV; A; UV, total uv4. This can be done using solution 2 of Issue 8. Here, due to the special nature of the UV scenario, the multi-granularity and layer-by-layer summary can be used to sort all the UV computing methods.

 

Case:At present, mm_log logs have more than 2.5 billion PVS per day. UV is calculated from mm logs, and ipuv is calculated in total.

 

Result tables of three granularities

 

(Memberid, siteid, adzoneid, province, UV, ipuv) r_table_4

 

(Memberid, siteid, adzoneid, UV, ipuv) r_table_3

 

(Memberid, siteid, UV, ipuv) r_table_2

 

Step 1: Press memberid, siteid, adzoneid, province,Use GroupDeduplication, Generate a temporary table, Cookie, IP

 

Tag and place them together. The temporary table is called T_4;

 

Select memberid, siteid, adzoneid, province, type, user

 

From (

 

Select memberid, siteid, adzoneid, province, 'A' type, cookie as user from mm_log where DS = 20101205

 

Union all

 

Select memberid, siteid, adzoneid, province, 'I' type, IP as user from mm_log where DS = 20101205

 

) X Group by memberid, siteid, adzoneid, province, type, user;

 

Step 2: RankingTo generate tables t_4_num.hadoop, the strongest and core capabilities are parition and sort. Grouping by type and acookie,

 

Type, acookie, memberid, siteid, adzoneid, and province ranking.

 

Select *,

 

Row_number (type, user, memberid, siteid, adzoneid) as adzone_num,
Row_number (type, user, memberid, siteid) as site_num,

 

Row_number (type, user, memberid) as member_num,

 

Row_number (type, user) as total_num

 

From (select * From T_4 distribute by type, user sort by type, user, memberid, siteid, adzoneid) X;

 

In this way, we can get the ranking of users at different levels of granularity. The same user ID is at different levels of granularity, and there are only one record with the ranking equal to 1. The sum operation is performed when the rank is equal to 1, and the result is equivalent to the sum operation after the group by user deduplication.

 

Step 3: for UV statistics of different granularities, The result table r_table_4 is generated starting from the finest granularity. At this time, the result set is only 10 million.

 

For example, how to count the granularity of memberid, siteid, adzoneid, and provinceid UV is

 

Select memberid, siteid, adzoneid, provinceid,

 

Sum (case when type = 'A' then cast (1) as bigint end) as province_uv,

 

Sum (case when type = 'I 'then cast (1) as bigint end) as province_ip,

 

Sum (case when adzone_num = 1 and type = 'A' then cast (1) as bigint end) as adzone_uv,

 

Sum (case when adzone_num = 1 and type = 'I' then cast (1) as bigint end) as adzone_ip,

 

Sum (case when site_num = 1 and type = 'A' then cast (1) as bigint end) as site_uv,

 

Sum (case when site_num = 1 and type = 'I 'then cast (1) as bigint end) as site_ip,

 

Sum (case when member_num = 1 and type = 'A' then cast (1) as bigint end) as member_uv,

 

Sum (case when member_num = 1 and type = 'I' then cast (1) as bigint end) as member_ip,

 

Sum (case when total_num = 1 and type = 'A' then cast (1) as bigint end) as total_uv,

 

Sum (case when total_num = 1 and type = 'I' then cast (1) as bigint end) as total_ip,

 

From t_4_num

 

Group by memberid, siteid, adzoneid, provinceid;

 

For the UV of advertising space granularity, The r_table_4 statistics are collected. This is the source table's-level statistics.

 

Select memberid, siteid, adzoneid, sum (adzone_uv), sum (adzone_ip)

 

From r_table_4

 

Group by memberid, siteid, adzoneid;

 

UV Calculation of memberid and siteid,

 

UV Calculation of memberid,

 

The total UV calculation is also summarized from r_table_4.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.