Improvements to the Hive Optimizer, improvementshive

Source: Internet
Author: User

Improvements to the Hive Optimizer, improvementshive

LanguageManual JoinOptimization

Improvementsto the Hive Optimizer

Hive can be automatically optimized, and some optimization cases are improved in Hive 0.11.

1. The JOIN side is suitable for storing in memory and there are new optimization solutions

A) read the table into the memory as a hash table

B) scan only large tables

C) fact tables only use a small amount of memory

2. Star join

3. In many cases, hint is no longer required.

4. Automatic Optimization of Map Join

 

StarJoin Optimization

First, we will introduce the star and snow pattern models.

==============================Start =

1. Introduction

The star mode is a multidimensional data relationship that consists of a fact Table and a set of Dimension tables. Each dimension table has a dimension as the primary key. The primary keys of all these dimensions are combined into the primary keys of the fact table. The non-primary key attribute of a Fact table is called a Fact (Fact). They are generally numerical values or other data that can be computed. dimension values are data of the text, time, and other types, after data is organized in this way, we can sum (summary) and average (average) The fact data according to different dimensions (some or all of the primary keys of the fact table), count, percentage, or even 20-20 ~ 80 analysis. In this way, you can analyze the business theme from different perspectives.

In the business intelligence solution of multidimensional analysis, common models can be divided into stars based on the relationship between fact tables and dimension tables.Model and snow Pattern. When designing a model for logical data, consider whether the data is organized according to the star or snowflake model.

When all dimension tables are directly connected to the "fact table", the entire graph is like a star, so this model is called a star model, 2.

The star schema is an informal structure. Each dimension of a multi-dimensional dataset is directly connected to a fact table and there is no gradient dimension. Therefore, data is redundant.For example, in the region dimension table, there are two records of City C in province A and City D in province B in country A, and the information of country A and province B is stored twice, respectively, redundancy exists.

Is a star model in the sales data warehouse.:

When one or more dimension tables are not directly connected to the fact table, but are connected to the fact table through other dimension tables, the diagram is like a snowflake model. The snowflake model is an extension of the star model. The dimension table of the star model is further layered, and the original dimension tables may be expanded to small fact tables to form some partial "layers" areas, these decomposed tables are connected to the primary dimension table instead of the fact table. 2. Divide the region dimension table into the dimension table of country, province, and city. It has the following advantages:Query performance can be improved by minimizing data storage and combining small dimension tables.The snowflake structure removes data redundancy.

For the snowflake model in the sales data warehouse:

Because of data redundancy, many statistical queries do not require external connections. Therefore, the star model is generally more efficient than the snowflake model. The star structure does not need to consider many normalization factors, so the design and implementation are relatively simple. Due to the de-redundancy of the snowflake model, some statistics need to be generated through table join, so the efficiency is not necessarily high in the star model. Normalization is also a complicated process. The database structure design, data ETL, and later maintenance are all complicated. Therefore, when redundancy is acceptable, the star model is used more efficiently.

 

2. Usage Selection

The StarSchema and snowflake models are two common methods in data warehouses, and the comparison between them should be discussed from four perspectives.

  1) Data Optimization

The snowflake model uses standardized data, which means that the data is organized within the database to eliminate redundancy. Therefore, it can effectively reduce the data volume.By referencing integrity, both the business level and dimension are stored in the data model.


▲Figure 1 snowflake model

In comparison, the star model is used to normalize data.In a star model, a dimension directly refers to a fact table, and the business level is not deployed by reference integrity between dimensions.


▲Figure 2 Star Model

 2) Business Model

A primary key is a unique key (Data Attribute) selected for special data. In the preceding example, Advertiser_ID is a primary key. The foreign key (reference attribute) is only a field in a table, used to match the primary key in other dimension tables. In the example we reference, Advertiser_ID is a foreign key of Account_dimension.

In the snowflake model, the business level of the data model is represented by the relationship between the primary key and the foreign key of a table of different dimensions. In a star model, all necessary dimension tables only have foreign keys in fact tables.

 3) Performance

The third difference is performance.The snowflake model has many connections between dimension tables and fact tables, so the performance is relatively low.For example, if you want to know the details of Advertiser, the snowflake model will request a lot of information, such as AdvertiserName and ID, and the addresses of advertisers and customer tables need to be connected, connect to the fact table.

In this model, if you need the preceding information, you only need to connect the Advertiser dimension table to the fact table.

  4) ETL

The snowflake model loads the data mart. Therefore, the ETL operation is more complicated in design and cannot be parallel due to the limitations of the affiliated model.

When a star model loads dimension tables, you do not need to add additional models between dimensions. Therefore, ETL is relatively simple and highly parallel.

  Summary

The snowflake model makes dimension analysis easier. For example, "which customers or companies are online for specific advertisers ?" The star model is more suitable for index analysis. For example, "what is the revenue of a given customer ?"

 

=========================================================== ===

In decision-making support systems or data warehouses, a simple mode is the star mode, where events are stored in large fact tables (Facttables), Many small dimension tables (dimensions) to describe the data in the fact table.

Tpc ds is an example of the star mode.

1. StarSchema Example

Selectcount (*) cnt

Fromstore_sales ss

Join household_demographics hd on (ss. ss_hdemo_sk = hd. hd_demo_sk)

Join time_dim t on (ss. ss_sold_time_sk = t. t_time_sk)

Join store s on (s. s_store_sk = ss. ss_store_sk)

Where

T. t_hour = 8

T. t_minute> = 30

Hd. hd_dep_count = 2

Order by cnt;

 

2. PriorSupport for MAPJOIN

Hive supports MAPJOINS, which is suitable for this solution-at least for dimensions, which is smaller than enough for memory.

Before Hive 0.11, the default value of hive. auto. convert. join is false. If MAPJOIN is required, use the hint method of the optimizer:

Select/* + MAPJOIN (time_dim) */count (*) from

Store_sales join time_dimon (ss_sold_time_sk = t_time_sk );

Alternatively, mapjoin is automatically performed after parameters are set:

Sethive. auto. convert. join = true;

Selectcount (*) from

Store_sales join time_dimon (ss_sold_time_sk = t_time_sk );

 

Starting from Hive 0.11.0, the default value of hive. auto. convert. join is true.

MAPJOINS reads the hash map form of a small table into the memory, and then matches the key with the big table. The following is the division of labor in each stage:

1) Localwork: Local

• Readrecords via standard table scan (including filters and projections) from sourceon local machine -------- scan a table

• Buildhashtable in memory ------- create a hash table in the memory

• Writehashtable to local disk -------- write the hash table to the local disk

• Uploadhashtable to dfs ----------- upload a hash table to hdfs

• Add hashtable to distributed cache -------- add the hash table to the distributed cache

2) Maptask: Map task

• Readhashtable from local disk (distributed cache) into memory ------ read the hash table into the memory from the local disk (distributed cache)

• Matchrecords 'keys against hashtable -------- match the key of the hash table

• Combine matches and write to output -------- merge and match and write output

3) Noreduce task: MapJoin feature, no reduce

 

Limitationsof Prior Implementation

MAPJOINThere are some restrictions before Hive 0.11:

1) A mapjoin can only process one key at a time. It can perform multi-table join operations, but only when all tables have the same key added. (Typical star connections do not belong to this category)

2) even if hint is added, mapjoin may not be used.

3) A series of mapjoins will not be merged into a single map job, unless the query is written into a cascading mapjoin (table, subquery (mapjoin (table, subquery ....). after automatic conversion, it will not become a single map job.

4) the hash table used in mapjoin. Each sub-QUERY is generated and downloaded before being distributed to map.

 

Enhancementsfor Star Joins

Optimization mainly starts from three aspects:

1) When MapJoinHint is used, a series of MapJoin operations are converted into a map-only job.

2) Change the optimization solution to automatic optimization as much as possible (by the way, back up the execution plan ).

3) make hashtable generate directly on taskside (map side). The current solution is to generate it locally, then upload it to HDFS, and then distribute the cache to each map. It will be implemented in future versions.

The following sections describe each optimization enhancement aspect:

OptimizeChains of Map Joins

The following SQL statement is divided into two independent map-only jobs for execution:

Select/* + MAPJOIN (time_dim, date_dim) */count (*) from

Store_sales

Jointime_dim on (ss_sold_time_sk = t_time_sk)

Joindate_dim on (ss_sold_date_sk = d_date_sk)

Where t_hour = 8 andd_year = 2002;

Read small tables into the memory. If the fact is read-only once instead of twice, the execution time will be greatly reduced.

 

Current and Future Optimizations Current and Future tuning directions

1) MergeM *-MR patterns into a single MR. ---- convert the mode of multiple map-only jobs + mrjobs into a single MR

2) MergeMJ-> MJ into a single MJ when possible. ----- convert the nested mapjoin mode into a mapjoin as much as possible.

3) Merge MJ * patterns implements a single Map stage as a chain of MJ operators. (Not yet implemented .) ------------ concatenate multiple mapjoins into a series of mapjoins (the above example is divided into two independent map-only jobs, rather than a series of jobs, which are not implemented yet)

If hive. auto. convert. join is true, it not only converts join to mapjoin, but also converts it to MJ * mode.

 

OptimizeAuto Join Conversion

When auto join is enabled, the hint is no longer needed. The parameters include:

Sethive. auto. convert. join. noconditionaltask = true;

The default value of Hive0.11.0 is true.

Sethive. auto. convert. join. noconditionaltask. size = 10000000;

A table smaller than this size is put into the memory. This size refers to the total size of the hash table put into the memory. In the current version, n-1 tables can all be put into the memory, the largest table is placed on the disk to match. In this case, we will not check whether the file is compressed and the file size is obtained directly from HDFS.

The previous example can be changed:

Selectcount (*) from

Store_sales

Jointime_dim on (ss_sold_time_sk = t_time_sk)

Joindate_dim on (ss_sold_date_sk = d_date_sk)

Where t_hour = 8 andd_year = 2002;

 

If the sizes of the two dimension tables match the config size, they are converted to map-join (2 ).. I think the size here refers to hive. smalltable. filesize. The default value is 25 MB.

If the sum of dimension tables is smaller than noconditionaltask. size, two map-join statements are merged into one. This reduces the number of MR Jobs and significantly increases the query speed .. This example can be easily extended to muti-way join and will work as expected.

Map-join cannot be used for outer join. Because map-join can only have one steam table, all columns in the steam table should be full, and null may occur in Outer join.

This means that the outer join cannot be used. Only the inner join can be map-join. The external connection can only be optimized in the form of stream table. Let alone cartesian products. map-jon cannot be used.

The automatic switch can also be used in sort-merge-bucketjoins.

 

CurrentOptimizationCurrent optimization scheme

Merge multiple MJ instances into one MJ instance.

 

AutoConversion to SMB (Sort-Merge-Bucket) Map Join

A bucket-based join can be converted into a bucket-Based map join.

The premise is that the table is divided by bucket. A sorted table performs map join Operations faster than a non-sorted table. If the table is a partition table or a bucket table, it may be slower, because each mapper needs to obtain a small block in a single key partition (eachmapper wocould need to get a very small chunk of a partition which has a singlekey ).

The following configuration parameters convert an SMB instance to map-joinSMB:

Sethive. auto. convert. sortmerge. join = true;

Sethive. optimize. bucketmapjoin = true;

Sethive. optimize. bucketmapjoin. sortedmerge = true;

Sethive. auto. convert. sortmerge. join. noconditionaltask = true;

Here is an option to set the big table selection policy ):

Set hive. auto. convert. sortmerge. join. bigtable. selection. policy

= Org. apache. hadoop. hive. ql. optimizer. TableSizeBasedBigTableSelectorForAutoSMJ;

By default, it is the average partition size. This large table policy helps you determine whether to select stream, which is better than hash or stream.

The list of available selection policies is:

Org. apache. hadoop. hive. ql. optimizer. AvgPartitionSizeBasedBigTableSelectorForAutoSMJ (default)

Org. apache. hadoop. hive. ql. optimizer. LeftmostBigTableSelectorForAutoSMJ

Org. apache. hadoop. hive. ql. optimizer. TableSizeBasedBigTableSelectorForAutoSMJ

 

The purpose can be determined from the name, especially in the join of fact and fact.

 

GenerateHash Tables on the Task Side

In future versions, hash may be placed on the task side (currently generated on the client ).

Prosand Cons of Client-Side Hash Tables(Advantages and Disadvantages of generating hash tables on the client)

There is a problem in both generating hash and multi-Table hash join. Because the client machines are used to run the hive client or submit jobs.

Disadvantages:

• Data locality: The client machine typically is not a data node. Allthe data accessed is remote and has to be read via the network.

Data Distribution: client machines are generally not data nodes, and all data access is remote and must be read through the network.

• Specs: For the same reason, it is not clear what the specificationsof the machine running this processing will be. it might have limitations inmemory, hard drive, or CPU that the task nodes do not have.

Space: For the same reason, it is unclear about the machine. The memory, hard disk, and cpu conditions on the task node are unclear.

 

• HDFS upload: The data has to be brought back to the clusterand replicated via the distributed cache to be used by task nodes.

HDFS data upload: data is returned to the cluster and replicated through the distributed cache of the task node.

 

Benefits:

• What is stored in the distributed cache islikely to be smaller than the original table (filter and projection ).

Because filter or projection is performed, the generated hash table (to the distributed cache) may be smaller than the original table.

• In contrast, loading hashtables directly onthe task nodes using the distributed cache means larger objects in the cache, potentially cing opportunities for using MAPJOIN.

In contrast, if the task side directly uses the distributed cache to load the hash table, it means that the cache occupies a large table and indirectly reduces the possibility of using mapjoin.

 

Task-SideGeneration of Hash TablesGenerate a hash on the task side

When a task generates a hash, all task nodes must access the hash table generated by the original data source (and access unified resources at the same time ). Under normal circumstances, this operation is parallel and will not lead to latency. However, hive has the concept that multiple tasks can simultaneously access external data sources, such as HBase and Database, this may lead to latency.

 

FurtherOptions for OptimizationFuture Optimization Direction

1. Increasethe replication factor on dimension tables. ---- increase the replication factor of the dimension table.

2. Use the distributedcache to hold dimension tables. ---- Use distributed cache to store dimension tables

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.