Summary: Some summary on performance optimization of Hive __ Performance optimization

Source: Internet
Author: User
Tags joins subq
some summarization on the optimization of hive performance

Note that this article 90% from this article: Hive performance optimization, very grateful to the author of the careful collation, some of which I have done supplements and additions, if there is anything wrong to write, please leave a message, thank you Preface

Today, the phone interview was suddenly involved in the hive has done what optimization, just woke up, stumbled on the previous internship in the encounter some of the problems elaborated, here by the way reproduced an article and to do a summary of the introduction

First, let's take a look at the computational framework features of Hadoop, which are derived from this feature. Data volume is not a problem, and data skew is a problem. Jobs more than the operating efficiency is relatively low, for example, even if there are hundreds of rows of tables, if multiple associations multiple times, resulting in more than 10 jobs, time-consuming. The reason is that the map reduce job initialization time is relatively long. Sum,count,max,min and other Udaf, not afraid of data skew problem, Hadoop at the map end of the consolidation and optimization, so that the data skew is not a problem.

COUNT (distinct), which is less efficient in the case of large amounts of data, if multiple count (distinct) is less efficient, because count (distinct) is grouped by the Group by field, sorted by distinct field, Generally, this distribution is very skewed. For example: male UV, female UV, like Taobao 3 billion PV a day, if by sex group, distribution 2 reduce, each reduce processing 1.5 billion data.

In the face of these problems, we can have some effective means of optimization. Here are some of the best ways to optimize your work:

Good model design is easier. Resolve data skew issues. Reduce job count. Setting a reasonable number of tasks for map reduce can effectively improve performance. (for example, the 10w+ level of calculation, with 160 reduce, which is quite wasteful, 1 enough). To understand the data distribution, it is a good choice to solve the problem of data skew yourself. Set hive.groupby.skewindata=true; This is the general algorithm optimization, but the algorithm optimization can not adapt to specific business background, developers understand the business, understand the data, through the business logic can accurately and effectively solve the problem of data skew. In the case of large amount of data, it is easy to use COUNT (distinct), and count (distinct) is prone to skew problems. Merging small files is an effective way to improve the efficiency of scheduling, if all the work set a reasonable number of files, the overall scheduling efficiency of the ladder will also have a positive impact.

When the whole is optimized, the single job best is inferior to the overall optimal.

And then, there should be some doubt in our minds as to what the root of the performance is. the root cause of low performance

Hive performance optimization, the HIVEQL as a m/r program to read, that is, from the M/R operating point of view to consider optimization performance, from the bottom to think how to optimize operational performance, rather than only the logical code to replace the level.

RAC (real application Cluster) True application clusters are like a mobile, flexible pickup truck that responds quickly; Hadoop is like a huge ship with a large start-up overhead, which can be very inefficient if you have only a small amount of input and output at a time. So the first task of using Hadoop is to increase the amount of data that each task carries.

  The core competencies of Hadoop are parition and sort, so this is the root of optimization.

There are several notable characteristics of the process of data processing by Hadoop: The mass of the data is not the focus of the load, and the running pressure is too high because of the tilt of the running data. Jobs more than the operating efficiency is relatively low, for example, even if there are hundreds of of rows of tables, if multiple associations to this rollup, resulting in dozens of jobs, it will take more than 30 minutes and most of the time for assignment, initialization and data output. M/R Job initialization time is a part of the time consuming resource. In the use of sum,count,max,min and other UDAF functions, not afraid of data skew problem , Hadoop at the map end of the rollup merged optimized, so that the data skew is not a problem. Count (DISTINCT) is less efficient in the case of large amounts of data, if multiple count (DISTINCT) is less efficient because count (DISTINCT) is grouped by the Group by field, sorted by DISTINCT field, Generally this kind of distributed way is very inclined; for example: male UV, female UV, Taobao 3 billion PV a day, if by sex group, distribution 2 reduce, each reduce processing 1.5 billion data.

Data skew is the main cause of the significant reduction in efficiency, and can be used once more map/reduce method to avoid tilt.

  Finally, the conclusion is: Shini, with the increase of job number, the increase of input, occupy more storage space, make full use of idle CPU and other methods to decompose the burden caused by data skew. optimize performance configuration to optimize map phase optimization

The map phase optimization is mainly to determine the appropriate number of maps. The first step is to understand the calculation formula for the map number, and also to illustrate that this optimization is only for the hive 0.9 version.

Num_map_tasks = Max[${mapred.min.split.size},min (${dfs.block.size},${mapred.max.split.size})]
Mapred.min.split.size: Refers to the minimum size of the data partition unit; min's default is 1B mapred.max.split.size: The maximum size of the data partition unit; Max's default is 256MB Dfs.block.size: Refers to the size of the data block that the HDFs sets. A value that has already been specified, and this parameter is not recognized by default hive

By adjusting Max to adjust the number of maps, reducing Max can increase the number of maps, and increasing max can reduce the number of maps. It is necessary to remind that the direct adjustment of Mapred.map.tasks This parameter is ineffective. Reduce phase optimization

The reduce phase, described here, refers to the reduce phase (actual reduce calculation) in the previous flowchart rather than the entire reduce task in the diagram. The main task of reduce phase optimization is also to select the appropriate reduce task quantity, unlike map optimization, where reduce optimization allows you to directly set the Mapred.reduce.tasks parameter to specify the number of reduce

Num_reduce_tasks = Min[${hive.exec.reducers.max}, (${input.size}/${hive.exec.reducers.bytes.per.reducer})]

Hive.exec.reducers.max: This parameter has been introduced from Hive 0.2.0. The default value is 999 before hive version 0.14.0, and the default value is 1009, starting with Hive 0.14.0, which means the maximum number of start reduce

Hive.exec.reducers.bytes.per.reducer: This parameter has been introduced from Hive 0.2.0. The default value is 1G (1,000,000,000) before hive version 0.14.0, and the default value is 256M (256,000,000) from Hive 0.14.0, see HIVE-7158 and HIVE-7917. The meaning of this parameter is the number of bytes per reduce processing. For example, if the input file size is 1GB, then 4 reduce will be started to process the data.

That is, depending on the amount of data you enter, the default Hive.exec.Reducers.bytes.per.Reducer is 1G, and the number of reduce cannot exceed one of the upper-bound parameter values, which is a default value of 999. So we can adjust the Hive.exec.Reducers.bytes.per.Reducer to set the number of reduce.

It is to be noted that: The number of reduce has a significant impact on the performance of the entire operation. If reduce settings are too large, then there will be a lot of small files, the Namenode will have a certain impact , and the entire operation of the running time may not be reduced, if reduce set too small, then the single reduce processing data will be increased, is likely to cause oom anomalies . If the mapred.reduce.tasks/mapreduce.job.reduces parameter is set, then hive uses its value directly as the number of reduce ; if mapred.reduce.tasks/ Mapreduce.job.reduces value is not set (that is, 1), then hive estimates the number of reduce based on the size of the input file. It may not be accurate to estimate the number of reduce according to the input file, because the input of reduce is the output of map, and the output of the map may be smaller than the input, so the most accurate number estimates the number of reduce according to the output of the map. column cropping

Hive when reading data, you can read only the columns that are needed in the query, ignoring the other columns. For example, if you have the following query:

SELECT a,b from Q WHERE e<10;

In implementing this query, the Q table has 5 columns (a,b,c,d,e), and Hive reads only 3 columns A, B, and e that are really needed in the query logic, ignoring column c,d; This saves read overhead, intermediate table storage overhead, and data consolidation overhead.

The corresponding parameter items for cropping are: Hive.optimize.cp=true (default is True)

Add: In the course of my internship, also useful to this reason, that is, multiple join, taking into account only the required indicators, rather than for the sake of easy use SELECT * as a subquery partition cropping

You can reduce unnecessary partitions in the process of querying. For example, if you have the following query:

SELECT 
* 
from (   
selectt 
    A1,
    COUNT (1) from 
T 
GROUP by A1
) SUBQ    # recommended Welt write, This makes it easy to check whether the brackets are Chinese.
WHERE subq.prtn=100; # (extra partition)

Select 
* 
from 
T1 
JOIN 
(
  Select 
  * to 
  T2
) subq on 
(t1.a1= SUBQ.A2) 
WHERE subq.prtn=100;

Query statements can reduce the number of partitions that are read by placing the "subq.prtn=100" condition in a subquery more efficient. Hive automatically performs this cropping optimization.

Partition parameter is: hive.optimize.pruner=true (default value is True)

Add: The actual cluster operation process, plus partitioning is the most important, without the consequences of zoning is likely to fill the entire queue resources, and resulting in IO Read and write abnormal, unable to log on the server and hive. Remember to remember that partitioning and limit operations join Operations

When you write a code statement with a JOIN operation, you should place a table/subquery with fewer entries on the left side of the join operator. Because in the reduce phase, the contents of the table located on the left side of the Join operator are loaded into memory, and a table with fewer entries can effectively reduce OOM (out of memory), or memory overflow. So for the same key, the corresponding value of the value of small before, the big put, this is the "small table put before" principle. If there is more than one join in a statement, there is a different approach depending on whether the join condition is the same or not. Join Principle

There is a principle when you use a query statement that has a JOIN operation: You should place a table/subquery with fewer entries on the left side of the join operator. The reason is that in the reduce phase of a join operation, the contents of the table located on the left side of the join operator are loaded into memory, and the table with fewer entries on the left can effectively reduce the chance of OOM errors. For multiple joins in a single statement, if the join is in the same condition, the first word is a small table on the left , like a query:

INSERT OVERWRITE TABLE pv_users 
SELECT 
    Pv.pageid,
    u.age from 
page_view p 
JOIN the user u on (pv.userid = U.userid) 
JOIN newuser x on (U.userid = X.userid);  
If the join key is the same, no matter how many tables there are, they will be merged into one Map-reduce one map-reduce task, not the same as the ' n ' when doing OUTER Join

If the conditions of the Join are not the same, for example:

INSERT OVERWRITE TABLE pv_users 
SELECT 
   Pv.pageid, 
   u.age from 
page_view p 
JOIN the user u on (pv.userid = U.userid) 
JOIN newuser x on (u.age = x.age);   

The number of Map-reduce tasks corresponds to the number of Join operations, and the above query is equivalent to the following query:

INSERT OVERWRITE TABLE tmptable 
SELECT 
* from 
page_view p 
JOIN 
user u on 
(Pv.userid = U.userid);

INSERT OVERWRITE TABLE pv_users 
SELECT 
    X.pageid, 
    x.age from 
tmptable x 
JOIN 
newuser y On 
(x.age = y.age);    
MAP Join Operation

If you have a table that is very, very small, and the other associated table is very, very large, you can use Mapjoin this join operation to complete in the MAP phase, no need to reduce, no need to go through the shuffle process , thereby saving resources to a certain extent to improve join efficiency the prerequisite is that the required data can be accessed in the process of the MAP. For example, query:

INSERT OVERWRITE TABLE pv_users 
   SELECT/*+ mapjoin (PV)/Pv.pageid, u.age from 
   page_view PV 
     JOIN user u on (Pv.userid = U.userid);    

You can complete the Join in the map phase, as shown in the figure:

The related parameters are: hive.join.emit.interval = 1000 hive.mapjoin.size.key = 10000 hive.mapjoin.cache.numrows = 10000

Refer to Hive Mapjoin: It is noteworthy that after hive version 0.11, hive starts the optimization by default, that is, the use of the mapjoin tag that is not needed to be displayed, which triggers the optimization operation when necessary to convert the normal join to Mapjoin

Two properties to set the trigger time for this optimization

Hive.auto.convert.join

The default value is True, automatic account Mapjoin optimization

Hive.mapjoin.smalltable.filesize

The default value is 2500000 (25M), which is configured to determine the size of the table to which the optimization is used, and if the table is less than this value, it is loaded into memory GROUP by Action

There are a few things to note when you perform a group by operation:

Map-side Partial aggregation

In fact, not all aggregation operations need to be done in the reduce section, and many aggregation operations can be partially aggregated on the map side, and then the end result is reached by the reduce end.

The parameters to be modified here are:

Hive.map.aggr=true (for setting whether to aggregate on the map side, the default is True) hive.groupby.mapaggr.checkinterval=100000 (the number of entries to set the map side to aggregate operations)

Load balancing when data is skewed

Here you need to set the hive.groupby.skewindata, the selected item is true, and the resulting query plan has two mapreduce tasks. In the first MapReduce, the map's output set is randomly distributed to reduce, with each reduce doing a partial aggregation operation and outputting the results. As a result of this processing, the same group by key is likely to be distributed to different reduce, thereby achieving load balancing purposes ; the second MapReduce task is distributed to reduce according to the preprocessing data results according to the Group by key (This process ensures that the same Group by Key is distributed to the same reduce), and finally completes the final aggregation operation. Merging small Files

We know that the number of files is small, easy to create bottlenecks in the file storage side, bring pressure to HDFS, affect the processing efficiency. In this case, you can eliminate such effects by merging the map and reduce result files.

The parameters used to set the merge properties are: whether to merge the map output file: Hive.merge.mapfiles=true (the default is true) merge the reduce-end output file: Hive.merge.mapredfiles=false (default is False) Size of merged file: hive.merge.size.per.task=256*1000*1000 (default is 256000000)

Add: In the actual cluster operation process, join the small table in front of the principle is a comparison of first contact, this point after consulting some information and asked colleagues that is the fastest optimization join operation, and Mapjoin is almost no use, may be exposed to the small scale is also relatively large, And the company's hive looks like is 0.12, should be automatic optimization of the program angle optimization skilled Use SQL to improve the query

Skilled use of SQL, can write efficient query statements.

Scene: There is a user table, for sellers every day to receive the table, User_id,ds (date) for key, attributes have main categories, indicators have transaction amount, number of transactions. Every day to take the first 10 days of total income, total number of pens, and the most recent day of the main category.

Common Methods

# First step: Use the analytic function, take each user_id the most recent day main class, in temporary table T1.
CREATE TABLE T1
as SELECT 
    user_id,
    substr (MAX (CONCAT (Ds,cat), 9) as Main_cat) from 
users 
WHERE ds=20120329//20120329 Is the value of the date column, the actual code can be used to represent the day of the date GROUP by user_id; 


# Step Two: Sum up the total transaction amount of 10 days, the number of trades, and deposit in the temporary table T2
CREATE table T2 as
SELECT 
    user_id, sums
    (qty) as Qty,sum (AMT) as amt< C15/>from users 
WHERE ds BETWEEN 20120301 and 20120329 
GROUP by user_id 

# Step Three: Associate T1,t2, get the final result.
SELECT 
    t1.user_id,
    t1.main_cat,
    t2.qty,t2.amt from 
T1 
JOIN T2 on T1.user_id=t2.user _id

Optimization method 

SELECT 
    user_id,
    substr (MAX (CONCAT (Ds,cat)), 9) as Main_cat,
    sum (qty),
    sum (AMT) from 
users 
WHERE ds BETWEEN 20120301 and 20120329 
GROUP by user_id

In our work, we conclude that the cost of scenario 2 equals the cost of the second step of scheme 1, and the performance is increased from the original 25 minutes to the completion of 10 minutes. Saving two of temporary tables for reading and writing is a key reason, and this is also true for data lookup work in Oracle. SQL is universal, and many SQL-generic optimizations can be achieved in a Hadoop distributed computing approach.

Add: In the actual cluster operation process, the first common operation is to be ridiculed by colleagues, general write join the operation of the composite class, we will try to write it in the same code, so there may be a section of Hive have seven or eight join, only when the output of the intermediate table or the business logic is a little confusing, We just store the middle table and write it down again, in general, we use the core table to go to the left join other tables when we extract the data, which ensures that the fields in the core table will be present, even if the match is not there, there would be null rather than loss of data, which is more important for us to compute the index. data skew problem with invalid ID at association time

Problem: The log often appears information loss, such as daily about 2 billion of the full network log, where the user_id as the primary key, in the log collection process will be lost, the primary key is null situation, if take the user_id and Bmw_users Association, will encounter the problem of data skew. The reason is that in Hive, items with a null value for the primary key are assigned to the same calculation Map as the same key. Workaround 1:USER_ID NULL without participating association, subquery filters Nulls

SELECT 
* from 
Log a 
JOIN 
bmw_users B on a.user_id are not 
NULL and a.user_id=b.user_id 
UNION all< C6/>select 
* from 
Log a 
WHERE a.user_id is NULL
Workaround 2 Looks like this: function filters null
SELECT 
* from 
log a left 
OUTER JOIN bmw_users B in case when a.user_id is 
NULL THEN CONCAT ( ' Dp_hive ', RAND () ELSE a.user_id end =b.user_id;  That's a good thing to say, and I haven't tried it.

Tuning result: As the data skew caused the runtime to run longer than 1 hours, workaround 1 runs a daily average of 25 minutes, and resolution 2 runs daily on average for around 20 minutes. The optimization effect is obvious.

In our work, we conclude that solution 2 is better than solution 1, with fewer IO and fewer jobs. Workaround 1 Log reads two times, and the job number is 2. The job number in workaround 2 is 1. This optimization is suitable for skew problems caused by invalid IDs (such as-99, ', NULL, etc.). By turning the null key into a string plus a random number, the skew data can be divided into different reduce to solve the problem of data skew . Because null values do not participate in the association, they do not affect the final result, even if they are assigned to different Reduce. The implementation method of the common Hadoop Association is: The Association is implemented by two order, the association is listed as Partion key, the associated column and the tag of the table are the sorted group key, and reduce is allocated according to the Pariton key. In the same reduce, sort by group key. skew problems caused by association of different data types

Problem: The Association of different data type IDs produces a data skew problem.

A table S8 log, one record for each item, and the commodity table to associate. But the connection has a skewed problem. 32 of the S8 log has a string commodity ID and a numeric commodity ID, and the type in the log is string, but the value ID in the product is bigint. The reason for guessing the problem is that the S8 's product ID is converted to a numeric ID to allocate reduce, so the S8 log of the string ID is on a reduce, and the solution verifies the guess. WORKAROUND: Convert data type to String type

SELECT 
* from 
s8_log a left 
OUTER JOIN 
r_auction_auctions b 

The results of the tuning show that the data sheet processing can be completed in 20 minutes from 1 hours 30 minutes after the code is adjusted.

Add: Say hive is not an implicit conversion, you need to cast the bigint type to a string type. the characteristics of the Union ALL optimization using hive

Question: For example, the promotion effect table should be associated with the commodity table, the auction_id column in the effect table has 32 as the string commodity ID, also has the digital ID, and the commodity table is associated with the commodity information. Solution: UNION ALL

Select 
* from 
effect a 
JOIN 
(
  select 
    auction_id as auction_id from 
  Auctions 
  UNION all 
  SELECT 
    auction_string_id as auction_id from 
  auctions
) b 

Multiple Table union All is optimized for a job. Rather than filtering the numeric IDs separately, the string IDs are then correlated to the commodity table's performance.

The advantages of this writing: 1 MapReduce jobs, the merchandise table is read only once, promotion effect table only read once. Replace this SQL with Map/reduce code, when Map, put a table of records labeled A, merchandise table record every read one, labeled B, into two solutions hive to union all optimization of the short board

Hive the optimization of union All: the Union All optimization is limited to not nested queries . eliminate GROUP By example in subquery 1: subquery has GROUP by

SELECT * FROM 
(
    Select 
      * to 
    T1 
    GROUP by c1,c2,c3 
UNION all 
    SELECT
      * From 
    T2 
    GROUP by c1,c2,c3
) T3 

From the business logic, the GROUP by in the subquery looks superfluous (functionally superfluous, unless there is COUNT (DISTINCT)), if it is not due to Hive bugs or performance considerations (if the data does not get the correct result if the subquery GROUP by is not executed) Hive bugs). So this Hive is converted by experience as follows:

SELECT * FROM 
(
    Select 
    * to 
    T1 
  UNION all 
    Select 
    * 
    T2
) T3 

Tuning result: After testing, the Hive Bug with union All is not present and the data is consistent. The number of MapReduce jobs is reduced from 3 to 1.

T1 is equivalent to a directory, T2 equivalent to a directory, for the Map/reduce program, T1,T2 can act as map/reduce mutli job inputs. This can be solved by a map/reduce. Hadoop's computational framework, not afraid of data, is afraid of more work.

But if you are replacing other computing platforms such as Oracle, that's not necessarily the case, because splitting the big input into two inputs, sorting the totals after the merge (if the two subcategories are parallel), is likely to be better (such as Hill sort is better than bubble sort performance). eliminates the COUNT (DISTINCT) in the subquery, Max,min.

Select 
* 
from 
(
    select 
    * to 
    T1 
  UNION all 
    Select 
      C1,
      C2,
      C3,
      COUNT (DISTINCT c4) 
    From T2 
    GROUP by c1,c2,c3
) T3 

Because the subquery has COUNT (DISTINCT) operations, going directly to GROUP by will not reach the business goal. the use of temporary table to eliminate COUNT (DISTINCT) operation can not only solve the tilt problem, but also effectively reduce jobs.

INSERT T4 SELECT c1,c2,c3,c4 from T2 GROUP by C1,C2,C3; 

Select 
C1,
C2,
C3,
sum (income),
sum (UV) 
from 
(
    select 
      C1,
      C2,
      c3 ,
      Income,
      0 as UV from 
    T1 
  UNION all 
    SELECT 
      C1,
      C2,
      C3,
      0 as income,
      1 as UV 
    From T2
) T3 
GROUP by C1,C2,C3;

The job number is 2, reduced by half, and two times map/reduce is more efficient than COUNT (DISTINCT).

Tuning results: Tens other categories, member table, and 1 billion level of the Commodity Table Association. The original 1963s task has been adjusted, 1152s is complete. to eliminate joins within subqueries

SELECT * FROM 
(
      Select 
        * to 
      T1 
    UNION all 
      Select 
        * 
      T4 
    UNION 
      All SELECT 
        * from 
      T2 
      JOIN T3 on 
      t2.id=t3.id
) x 

The above code runs with 5 jobs. Add first join survival temporary table T5, then UNION all, will become 2 jobs.

INSERT OVERWRITE TABLE T5 
SELECT * from T2 JOIN T3 on t2.id=t3.id; 

The results of the tuning show: for the Tens advertising table, from the original 5 jobs a total of 15 minutes, decomposed into 2 jobs for 8-10 minutes, a 3 minutes.

Add: The second eliminates the JOIN in the subquery, I think the significance is not very big, the first need to establish temporary tables, this table also needs to be cleaned regularly, the second is not optimized from the principle, just to separate the steps, so the significance is not so big, if from the sense of reducing the job, there is indeed a promotion, But if it's a business code block, it doesn't make sense to write apart, it's my personal understanding. GROUP by overrides Count (DISTINCT) to achieve optimal results

Count (DISTINCT) is often used when calculating UV, but count (DISTINCT) is slower when the data is skewed. You can then try to compute the UV using the GROUP by rewrite code. Original code

ALTER  TABLE s_dw_tanx_adzone_uv ADD PARTITION (ds=20120329) 
SELECT 
  20120329 as Thedate,
  Adzoneid,
  COUNT (DISTINCT acookie) as UV 
From S_ODS_LOG_TANX_PV t 
WHERE t.ds=20120329 
GROUP by Adzoneid

The data skew problem about count (DISTINCT) cannot be generalized, depending on the situation, here is a set of data I tested:

Test data: 169,857 Records

#统计每日IP 
CREATE TABLE ip_2014_12_29 as 
SELECT 
COUNT (DISTINCT IP) as IP from 
logdfs 
WHERE logdate= ' 2014_12_29 '; 
Time consuming: 24.805 seconds 

#统计每日IP (makeover) 
CREATE TABLE ip_2014_12_29 as 
SELECT
COUNT (1) as IP from 
(
SELECT 
DISTINCT IP from 
logdfs 
WHERE logdate= ' 2014_12_29 '
) tmp; 
Time consuming: 46.833 seconds

Test Result table Name: Obviously modified statements than before time, this is because the modified statement has 2 Select, a job, so that when the amount of data is small, the data will not have a tilt problem.

Add: The equivalent of doing more subquery operations, which is certainly slow to optimize the summary

optimization, the hive SQL as a mapreduce program to read, there will be unexpected surprises. Understanding the core competencies of Hadoop is fundamental to hive optimization. This is a valuable experience summary for all members of the project team over the past year.

Long-term observation of the process of Hadoop processing data, there are several notable features: Not afraid of more data, it is afraid of data skew. More jobs than the job efficiency is relatively low, for example, even if there are hundreds of of rows of tables, if multiple associations multiple times, resulting in more than 10 jobs, not half an hour is not run out. The map reduce job initialization time is relatively long. For Sum,count, there is no data skew problem. For count (distinct), the efficiency is low, the data quantity is more than one, quasi problem, if is multiple count (distinct) efficiency is lower.

Optimization can be done from several aspects: good model design with less effort. Resolve data skew issues. Reduce job count. Setting a reasonable number of tasks for map reduce can effectively improve performance. (for example, the 10w+ level of calculation, with 160 reduce, which is quite wasteful, 1 enough). It's a good choice to write your own SQL to solve the data skew problem. Set hive.groupby.skewindata=true; This is the general algorithm optimization, but algorithm optimization always ignores the business and habitually provides a common solution. ETL developers know more about the business and more about the data, so it's often more accurate and efficient to solve the skewed approach through business logic. The method of ignoring the count (distinct), especially when the data is large, it is easy to tilt the problem, do not take a lucky mentality. Do it yourself. Merging small files is an effective way to improve the efficiency of scheduling, if we set a reasonable number of documents, the overall scheduling efficiency of the ladder will also have a positive impact.

when the whole is optimized, the single job best is inferior to the overall optimal. Update

This is an accumulation of the process, so there will inevitably be updated 2017.07.29 first update 2017.08.04 second update thanks

Hive Performance Optimization

Hive Insert and insert overwrite

Split and block differences as well as maptask and Reducetask number setting <

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.