Improvements to the Hive Optimizer

Source: Internet
Author: User

Languagemanual joinoptimization

Improvementsto the Hive Optimizer

Hive can be automatically optimized, and some optimization cases have been improved in hive 0.11

1, the join side suitable for in memory, has the new optimization scheme

A) Read the table into memory in the form of a hash table

b) scan only large tables

c) The fact table uses only a small amount of memory

2. Star Join

3, in many cases, no longer need hint

4. Map Join automatic optimization

Starjoin optimization

First introduce the star model and snowflake model

=================== start =======================

1. Introduction

A star pattern is a multidimensional data relationship that consists of a fact table (facttable) and a set of dimension tables (Dimension table). Each dimension table has one dimension as the primary key, and the primary key of all these dimensions is combined into the primary key of the fact table. The non-primary key attributes of a fact table are known as facts (fact), which are generally numeric or other data that can be computed, while dimensions are mostly data of type, time, etc. By organizing the data in this way we can sum (summary), average (average), count (count), percent (percent) aggregation calculations by different dimensions (part or all of the fact table primary key), or even 20~80 analysis. This allows you to analyze the business topic from a number of different angles.

In the business intelligence solution of multidimensional analysis, according to the relationship between fact table and dimension table, the common model can be divided into star model and snowflake model . When you design a model of logical data, you should consider whether the data is organized according to a star model or a snowflake model.

When all the dimension tables are connected directly to the fact table, the entire plot is like a star, so the model is called the Star model, 2.

Star Schema is a non-normalized structure, each dimension of the cube is directly connected to the fact table, there is no gradient dimension, so the data has some redundancy , such as in the geographical dimension table, the existence of the country a province B city C and the state a province B city D two records, then the state A and the province The information for B is stored two times, i.e. redundancy exists.

to sell a star model in the Data Warehouse :

When one or more dimension tables are not directly connected to the fact table, but are connected to the fact table by other dimension tables, their plots are like multiple snowflakes connected together, so they are called Snowflake models. The snowflake model is an extension of the star model. It further hierarchies the dimension tables of the star model, and the original dimension tables may be expanded into small fact tables, forming some partial "hierarchical" areas that are connected to the main dimension table instead of the fact table. 2, the geographical dimension of the table is divided into countries, provinces, cities and other dimension tables. It has the advantage of improving query performance by minimizing data storage and by combining smaller dimension tables . the snowflake-type structure removes data redundancy.

as a snowflake model in the Sales Data Warehouse :

Star model because of the redundancy of the data, so many statistical queries do not need to do external connections, so the efficiency is generally higher than the snowflake model. Star structure does not consider many regularization factors, design and implementation are relatively simple. Because the snowflake model is redundant, some statistics need to be joined by the table to produce, so the efficiency does not necessarily have the star model high. Regularization is also a relatively complex process, the corresponding database structure design, data ETL, and later maintenance are more complicated. Therefore, in the premise of redundancy can be accepted, the actual use of the star model used more, but also more efficient.

2. Use selection

The Star Model (STARSCHEMA) and snowflake Model (SNOWFLAKESCHEMA) are two common methods used in data Warehouse, and the contrast between them is discussed from four angles.

  1) Data optimization

the snowflake model uses normalized data, which means that the data is organized inside the database to eliminate redundancy, so it can effectively reduce the amount of data. with referential integrity, its business hierarchies and dimensions are stored in the data model.


▲ Figure 1 Snowflake model

In comparison, the star model is useful for inverse normalization data. in a star model, a dimension directly refers to a fact table, and the business hierarchy is not deployed through referential integrity between dimensions.


▲ Figure 2 Star model

 2) business model

The primary key is a separate unique key (data attribute) that is selected for special data. In the example above, advertiser_id will be a primary key. A foreign key (reference attribute) is simply a field in a table that matches a primary key in another dimension table. In the example we quoted, ADVERTISER_ID will be a foreign key to the account_dimension.

In a snowflake model, the business level of the data model is represented by a relationship between the primary key-foreign key of a different dimension table. In a star model, all the necessary dimension tables have only foreign keys in the fact table.

 3) Performance

The third difference lies in the difference in performance. Snowflake models have many connections between dimension tables and fact tables, so performance is relatively low. For example, if you want to know the details of advertiser, the Snowflake model asks for a lot of information, such as Advertisername, ID, and the address of the Advertiser and client tables that need to be connected and then connected to the fact table.

The star model is much less connected, and in this model, if you need the above information, you simply connect the Advertiser dimension table to the fact table.

  4) ETL

The snowflake model loads the data mart, so ETL operations are more complex in design and cannot be parallelized due to the limitations of the subordinate model.

the star model loads the dimension table and does not need to add a secondary model between the dimensions, so ETL is relatively simple, and can achieve a high degree of parallelism.

  Summarize

Snowflake models make dimensional analysis easier, such as "what customers or companies are online for a particular advertiser?" The star model is better suited for indicator analysis, such as "What is the income of a given customer?"

=================== End =======================

In decision support systems or data warehouses, a simple pattern is a star schema, where events are stored in large fact tables (facttables), and many small dimension tables (dimensions) describe the data in the fact table.

TPC DS is an example of a star pattern.

1 , Starschema Example

SelectCount (*) CNT

Fromstore_sales SS

Join Household_demographics HD on (Ss.ss_hdemo_sk = Hd.hd_demo_sk)

Join Time_dim T on (Ss.ss_sold_time_sk =t.t_time_sk)

Join store S on (S.s_store_sk =ss.ss_store_sk)

Where

T.t_hour = 8

T.t_minute >= 30

Hd.hd_dep_count = 2

Order BY CNT;

2 , Priorsupport for Mapjoin

Hive supports Mapjoins, which is suitable for this scenario-at least for dimensions small enough to fit into memory.

Before Hive 0.11, the Hive.auto.convert.join default value is False, and if you need to use Mapjoin, use the optimizer hint method:

select/*+ Mapjoin (Time_dim) */COUNT (*) from

Store_sales Join Time_dimon (Ss_sold_time_sk = T_time_sk);

or by setting the parameters automatically after Mapjoin:

Sethive.auto.convert.join=true;

SelectCount (*) from

Store_sales Join Time_dimon (Ss_sold_time_sk = T_time_sk);

Starting at Hive 0.11.0, the Hive.auto.convert.join default value is true.

Mapjoins A small table hash map into memory, and then match the big Table key, the following are the stages of division:

1) localwork: Local

ReadRecords via standard table scan (including filters and projections) from the Sourceon Local machine--------Scan table

Buildhashtable in memory-------build a hash table

writehashtable to Local Disk--------hash table written into

Uploadhashtable to DFS-----------upload hash table to HDFs

Add hashtable to distributed cache--------Add a hash table into the distributed cache

2) Maptask:map Task

? readhashtable from Local Disk (distributed cache) into memory------read the hash table from the on-premises (distributed cache)

? matchrecords ' keys against Hashtable--------match hash Table key

? Combine matches and write to output--------merge matches and write output

3) Noreduce Task:mapjoin features, no reduce

Limitationsof Prior Implementation

Mapjoin Some of the following limitations before Hive 0.11:

1) A mapjoin can only handle one key at a time, it may perform multiple table joins, but only if all the tables are joined by the same key. (Typical star connections do not belong to this category)

2) Even if the addition of hint may not really use mapjoin.

3) A series of Mapjoins will not be merged into a single map job unless the query is written as a cascade Mapjoin (Mapjoin (table,subquery (table, Mapjoin ...). The auto-convert will not become a single map job.

4) hash table used in Mapjoin, each sub-query run will be generated, first downloaded, then distributed to map.

Enhancementsfor Star Joins

Tuning is mainly from three aspects:

1) When using Mapjoinhint, turn a series of mapjoin operations into a map-only job.

2) Optimize the optimization scheme as much as possible (by way of backup execution plan).

3) make the Hashtable in the Taskside (map) directly generated, now the scheme is generated locally, and then to HDFs, then distributed cache to each map, the future version will be implemented.

The following sections describe each aspect of optimization enhancement:

Optimizechains of Map Joins

The following SQL will be decomposed into 2 separate map-only jobs execution:

select/*+ Mapjoin (Time_dim, Date_dim) */COUNT (*) from

Store_sales

Jointime_dim on (Ss_sold_time_sk = T_time_sk)

Joindate_dim on (Ss_sold_date_sk = D_date_sk)

where T_hour = 8 andd_year = 2002;

Reading a small table into memory, if fact is read only once, rather than 2 times, can greatly reduce the execution time.

Current and future optimizations current and future direction of tuning

1) MERGEM*-MR patterns to a single MR. ----To turn multiple map-only job+mrjob patterns into a single Mr

2) MERGEMJ->MJ into a single MJ when possible. -----as much as possible to turn the mapjoin nesting pattern into a mapjoin

3) Merge mj* patterns Intoa single Map stage as a chain of MJ operators.  (not yet implemented.) ------------string together multiple Mapjoin into a series of mapjoin (the above example is divided into two independent map-only jobs, not a series of functions are not implemented)

If Hive.auto.convert.join is true, not only will the join be converted to Mapjoin, but it can also be transformed into a mj* pattern.

Optimizeauto Join Conversion

When auto join is turned on, it is no longer necessary to use hint, with two parameters:

Sethive.auto.convert.join.noconditionaltask = true;

Hive0.11.0 Start by default is True

Sethive.auto.convert.join.noconditionaltask.size = 10000000;

A table smaller than this size is put into memory, this size refers to the sum of the size of the hash table that is put into memory, the current version, the N-1 table can be put into memory, the largest table on disk match. There is no way to check if the table is compressed, and the file size is obtained directly from HDFs.

The previous example can be turned into:

SelectCount (*) from

Store_sales

Jointime_dim on (Ss_sold_time_sk = T_time_sk)

Joindate_dim on (Ss_sold_date_sk = D_date_sk)

where T_hour = 8 andd_year = 2002;

If the size of these 2-dimensional tables conforms to the sizes of config, they are converted to Map-join (2). Here's the size I think should mean hive.smalltable.filesize this value defaults to 25m.

If the sum of the dimension tables is less than Noconditionaltask.size, the 2 map-join will be merged into one. This reduces the number of Mr Jobs and significantly increases the speed of query: this example can be easily extended to Muti-way join and will work as expected.

External connections cannot be map-join. Because Map-join can only have one steam table, all the column of the steam table should be full, and outer joins may appear null.

This means that the outer connection is not available and only the inner connection can be map-join. Outer joins can only be tuned in the form of stream table. Not to mention the Cartesian product, the Map-jon can not be used.

The automatic switch can also act on the Sort-merge-bucketjoins

currentoptimization Current optimization Scenarios

Combine multiple MJ into one MJ.

Autoconversion to SMB (sort-merge-bucket) Map Join

A bucket-based join can be converted to a bucket-based map join.

The premise is that the table is divided by buckets. An ordered table will be faster than a table with no sorting to make a map join. If the table is a partitioned table and a bucket table, it might be slower because each mapper needs to get a small chunk of a single-key partition (Eachmapper would need to get a very small chunk of a partition wh Ich has a singlekey).

The following configuration parameters make an SMB transition to MAP-JOINSMB:

Sethive.auto.convert.sortmerge.join=true;

Sethive.optimize.bucketmapjoin = true;

Sethive.optimize.bucketmapjoin.sortedmerge = true;

Sethive.auto.convert.sortmerge.join.noconditionaltask=true;

Here's an option to set a large table selection strategy (Big tableselectionpolicy):

Set Hive.auto.convert.sortmerge.join.bigtable.selection.policy

= Org.apache.hadoop.hive.ql.optimizer.TableSizeBasedBigTableSelectorForAutoSMJ;

By default, the average partition size, this large table strategy helps determine whether to select stream, rather than hash or stream.

The list of available selection policies is:

Org.apache.hadoop.hive.ql.optimizer.AvgPartitionSizeBasedBigTableSelectorForAutoSMJ (default)

Org.apache.hadoop.hive.ql.optimizer.LeftmostBigTableSelectorForAutoSMJ

Org.apache.hadoop.hive.ql.optimizer.TableSizeBasedBigTableSelectorForAutoSMJ

The use of the name can be judged, especially in the join of fact and fact.

Generatehash Tables on the Task Side

In future versions, the hash may be placed on the task side (which is currently generated on the client).

Prosand Cons of client-side Hash Tables (the pros and cons of generating a hash table on the client)

There is a problem with either generating a hash or a hash join for multiple tables. Because the client machine is used to run the hive client or to submit the job.

Disadvantages:

? Data Locality:the client machine typically was not a data node. allthe data accessed is remote and have to be read via the network.

Data distribution: The client machine is generally not a data node, all data access is remote and must be read through the network.

? Specs:for the same reason, it is not clear about the specificationsof the machine running this processing would be. It might has limitations inmemory, hard drive, or CPU, the task nodes does not.

Space: For the same reason, it is unclear what the machine is a bit. The memory on the task node, the hard disk, the CPU condition is unclear.

? HDFS upload:the data have to is brought back to the Clusterand replicated via the distributed cache to being used by task nod Es.

HDFs data uploads: Data is returned to the cluster and replicated to the distributed cache through the task node.

Benefits:

? What's stored in the distributed cache islikely to be smaller than the original table (filter and projection).

Because a filter or projection is made, the resulting hash table (to the distributed cache) may be smaller than the original table.

? In contrast, loading hashtables directly onthe task nodes using the distributed cache means larger objects in the Cache,po Tentially reducing opportunities for using mapjoin.

In contrast, if the hash table is loaded directly on the task side using distributed cache, it means that the cache consumes large tables, indirectly reducing the likelihood of using mapjoin.

task-sidegeneration of Hash Tables the task side generates a hash

When a hash is generated on the task side, all task nodes must access the hash table generated by the original data source (while accessing the consolidated resource). Under normal circumstances, this operation is parallel and does not cause delay, but hive has a concept that multitasking accesses external data sources such as hbase,database, which can cause delays.

furtheroptions for optimization The direction of future optimization

1.Increasethe replication factor on dimension tables. ----Increase the replication factor for a dimension table

2.Use the Distributedcache to hold dimension tables. ----use distributed cache to store dimension tables

Improvements to the Hive Optimizer

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.