Hive Data Skew

Source: Internet
Author: User

Reason for inclination:

It is our ultimate goal to make the output data of map more evenly distributed to reduce. Due to the limitations of the hash algorithm, the key hash will result in more or less data skew. A great deal of experience shows that the reason for data skew is human-induced negligence or business logic that can be circumvented.

Solution Ideas :

The execution of the hive is phased, and the difference in map processing data depends on the reduce output of the previous stage, so how to distribute the data evenly among each reduce is the root of the data skew.

Specific measures:

Memory optimization and I/O Optimized :

Driver table: Use large table to do driver table, to prevent memory overflow; the right-most table of join is the driver table; Mapjoin ignores the join order and makes a driver table with a large table; streamtable.

1. Mapjoin is a means of avoiding data skew

Allow the join operation in the map phase, mapjoin the small table all into memory, in the map phase directly with the other table data and in-memory table data matching, because in the map is a join operation, save reduce the efficiency of the operation is also much higher

In the "Hive:join encounter problems" have specific action

When joining operations on multiple table joins, place the small table on the left side of the join, the large table on the right side of the Jion,

Data in the small table is cached in memory when performing such join connections, which effectively reduces the chance of memory overflow errors

2. Setting parameters

Hive.map.aggr = True

Hive.groupby.skewindata=true There are other parameters

3.SQL language Adjustment

For example: Group by dimension spends hours: Replace count (distinct) with sum () group by to complete the calculation

4.StreamTable

When a join operation is made in reducer, the small table is put into memory, and the large table is read by stream

5. Index

Hive has been available since 0.80, providing a bitmap bitmap index that accelerates the execution of GROUP by query statements by less.

Other optimizations:

1. Row cropping (column pruning): Only columns to be used for output

2, predicate push (predicate pushdown): Early data filtering (see Figure 7, the following is the first processing of the logic

Reduce the amount of data to be processed later

3. Partition clipping (Partition pruning): Only read files that meet the partitioning criteria

4, Map-join: For some small files in the join, you can join operation in the map phase, see 3.2.2 Map-join section

5, Join-reordering: will be in reducer in the join operation when the small table into memory, and large table through

Stream mode Read

6, group-by optimization: Local aggregation for optimization (including hash-based and sort-based), for skew

Key (the row num and size of key are very uneven at reduce) can be optimized two times map-reduce

Hive configuration parameters are more conservative, so the efficiency will be very close, modify the configuration will make the query more efficient, record a few of the impact of the query efficiency more important parameters.

Meta Data:

Nested SQL parallel execution optimizations:

Set hive.exec.parallel=true;

Set hive.exec.parallel.thread.number=16;

Sorting optimization

Order BY to achieve global ordering, a reduce implementation, low efficiency

The Sort by implementation is partially ordered, and the results of a single reduce output are ordered and efficient, and are usually used with the distribute by keyword (the Distribute by keyword can specify a map to the reduce-side distribution key)

CLUSTER by col1 is equivalent to distribute by Col1 SORT by Col1.

Merging small files

An excessive number of files can put pressure on HDFS and affect processing efficiency by merging the results files of Map and Reduce to try to eliminate such effects

Hive.merge.mapfiles = True if and Map output file, default to True

Hive.merge.mapredfiles = False if the Reduce output file is merged, the default is False

Hive.merge.size.per.task = 256*1000*1000 the size of the merged file.

The parameters here are not written in the table above because this can be set temporarily depending on the task, not necessarily the global setting. Sometimes the global setting has a performance impact on the operation of large files instead.

using partitions, Rcfile , Lzo , Orcfile wait

Each partition in hive corresponds to a directory on HDFs, and the partition column is not an actual field in the table, but one or more pseudo-columns, which in fact do not save the information and data for the partitioned column in the table's data file. The partition keyword is preceded by the primary partition (only one), followed by the secondary partition

Static partitioning: Static partitions need to be specified in SQL statements when loading data and using them

Example: (stat_date= ' 20120625 ', province= ' Hunan ')

Dynamic partitioning: Use dynamic partitioning to set the Hive.exec.dynamic.partition parameter value to True, the default value is False, and by default, Hive assumes that the primary partition is statically partitioned, the secondary partition uses dynamic partitioning, and if you want to use dynamic partitioning, you need to set Hive.exec.dynamic.partition.mode=nostrick, default is Strick

Example: (stat_date= ' 20120625 ', province)

Hive Data Skew

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.