Usage and optimization notes for hive

Source: Internet
Author: User

hive.optimize.cp=true: Column clipping, fetching only
when reading the data, only the columns that are needed in the query are read, and the other columns are ignored. For example, for a query: SELECT A, b from T Wheree < 10, where T contains 5 columns (a,b,c,d,e), the column c,d will be ignored and only read a, C, and E columns
Hive.optimize.prunner: Partition cropping
 
LIMIT
hive.limit.optimize.enable=true: Optimizing the Limit N statement
when using the simple limit sampling data when the optimization option is turned on, the default is false, about limit optimization problem, in hive programming book explained that this feature has drawback, for the uncertainty of sampling gives the risk hint;
hive.limit.row.max.size=1000000:
hive.limit.optimize.limit.file=10: Maximum number of files
Hive.limit.optimize.fetch.max: The maximum number of rows allowed with simple limit sampling data, default 50000, query queries are limited, insert is not affected;
 
1. Local mode (small task): Handle All Tasks on a single machine
the following conditions need to be met:
the input data size of 1.job must be less than the parameter: Hive.exec.mode.local.auto.inputbytes.max (default 128MB)
2.job Map number must be less than parameter: Hive.exec.mode.local.auto.tasks.max (default 4)
reduce the number of 3.job must be 0 or 1
hive.exec.mode.local.auto.inputbytes.max=134217728
hive.exec.mode.local.auto.tasks.max=4
hive.exec.mode.local.auto=true
hive.mapred.local.mem: JVM memory size initiated by local mode

2. Concurrent execution: Parameters control whether different jobs in the same SQL can run concurrently, false by default.
   http://blog.csdn.net/jiedushi/article/details/7965604
Hive.exec.parallel=true, default = False
hive.exec.parallel.thread.number=8 controls the maximum number of jobs that can run concurrently for the same SQL, which defaults to 8. You can run up to 8 job runs at the same time.
3.Strict mode:nonstrict,strict, default is Nonstrict
hive.mapred.mode=true, Strict mode does not allow the following queries to be executed:
Partition not specified on partition table
ORDER BY statement without limit restriction
Cartesian product: No on statement when join

4. Dynamic partitioning:
hive.exec.dynamic.partition.mode=strict: A static partition must be specified in this mode
hive.exec.max.dynamic.partitions=1000 Default number of partitions that can be created
hive.exec.max.dynamic.partitions.pernode=100: The maximum number of partitions allowed to be created on each of the Mapper/reducer nodes
datanode:dfs.datanode.max.xceivers=8192: Allow DATANODE to open how many files

5. Speculative execution:
mapred.map.tasks.speculative.execution=true
mapred.reduce.tasks.speculative.execution=true
hive.mapred.reduce.tasks.speculative.execution=true;

6.Single MapReduce Multigroup by
hive.multigroupby.singlemar=true: When multiple group by statements have the same grouping column, they are optimized for an Mr Task

7. Hive.exec.rowoffset: Whether to provide virtual columns
Hive provides three virtual columns: Input__file__name,block__offset__inside__file and Row__offset__inside__block. But Row__offset__inside__block default is not available, you need to set Hive.exec.rowoffset to true. Can be used to troubleshoot problematic input data.
8. Grouping
Two aggregate functions cannot have different distinct columns, the following expression is wrong:
INSERT OVERWRITE TABLE pv_gender_agg SELECT Pv_users.gender, Count (DISTINCT Pv_users.userid), COUNT (DISTINCT pv_ USERS.IP) from Pv_users GROUP by Pv_users.gender;
only the group BY column or aggregate function can be in a SELECT statement.

9. Data Skew
Key Words
situation
Consequences
Join
One of the tables is small,
But the key is concentrated
data distributed to one or several reduce is far above average
large tables with large tables, but a bucket of judgment field 0 or null value too many
these null values are handled by a reduce, which is often slow
Group by
the group by dimension spends little,
an excessive number of values
reduce ash that handles a value often takes time
Count Distinct
too many special values
reduce time for handling this special value

Select *
From Log a
Left outer join Bmw_users b
On case a.user_id was null then concat (' Dp_hive ', rand ()) Else a.user_id end = b.user_id;

Ten
hive.map.aggr=true: A partial aggregation is done in the map, which is more efficient but requires more memory.
HIVE.MAP.AGGR controls how aggregations are made, the default is False, and if set to True,hive will be the first level of aggregation on the map side. This usually provides better results, but requires more memory to run successfully.

set hive.map.aggr=true;
SELECT COUNT (*) from table2;
set hive.exec.reducers.max=200;
set mapred.reduce.tasks=;---Increase the number of reduce
Set hive.groupby.mapaggr.checkinterval=100000;--This is the group key corresponding to the number of record bars above this value will be split, the value is set according to the specific data volume
Set hive.skewjoin.key=100000;--This is the join key corresponding to the number of record bar more than this value will be split, the value is set according to the specific amount of data
Set hive.optimize.skewjoin=true;--If it is a join process, the skew should be setting to true

hive.groupby.skewindata=true: Load balanced when data is tilted, the selected item is set to True, and the resulting query plan will have two mrjob. In the first mrjob,
The output set of the map is randomly distributed to reduce, with each reduce doing a partial aggregation operation and outputting the result, so the result is the same groupby Key
may be distributed to different reduce to achieve the purpose of load balancing; the second mrjob is then distributed to the GroupBy key according to the preprocessed data results.
reduce (This process ensures that the same GroupBy key is distributed to the same reduce), and finally completes the aggregation operation.

multi-group-by Inserts

Multi Insert is suitable for inserting different tables based on the same source table according to different granularity processing, so that only the source table is scanned once, the number of jobs is unchanged, and the number of source table scans is reduced .
UNION All is good, can reduce the number of scans of the table, reduce the number of jobs, usually pre-generated by different logic different conditions of the query union all, and then unified Group by calculation, the different tables of the union all equivalent to multiple inputs, UNION ALL of the same table, fairly map output multiple bars at a time
The output of a simple query or aggregation can be sent to multiple tables or even to a Hadoop DFS file (which can be managed using the HDFs tool). example, if accompanied by a breakdown of sex, by age, people need to find the number of visits per page, using the following query, you can do this:



From pv_users
INSERT OVERWRITE TABLE pv_gender_sum
SELECT Pv_users.gender, Count (DISTINCT Pv_users.userid)
GROUP by Pv_users.gender
INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum '
SELECT Pv_users.age, Count (DISTINCT Pv_users.userid)
GROUP by Pv_users.age;
12. Sorting
ORDER by ColName Asc/desc
hive.mapred.mode=strict need to be followed by the limit clause
hive.mapred.mode=nonstrict using a single reduce to complete sorting
Sort by colname Asc/desc: Sorting within each reduce
distribute by (used in subqueries): Controls which reducer a particular line should go to and does not guarantee the order of data in reduce
CLUSTER by: When sort by, distribute by uses the same column.

 

13. Merging Small Files
hive.merg.mapfiles=true: Merging map outputs
hive.merge.mapredfiles=false: Merging reduce output
hive.merge.size.per.task=256*1000*1000: The size of the merged file
hive.mergejob.maponly=true: Generates a map-only task to perform merge if Combinehiveinputformat is supported
hive.merge.smallfiles.avgsize=16000000: When the average size of a file is less than this value, an Mr Task is initiated to perform the merge.

 

Number of 14.map/reduce
reduce the number of maps:
Set Mapred.max.split.size
Set Mapred.min.split.size
Set Mapred.min.split.size.per.node
Set Mapred.min.split.size.per.rack
Set Hive.input.format=org.apache.hadoop.hive.ql.io.combinehiveinputformat
increase the number of maps:
when the input file is very large, the task logic is complex, map execution is very slow, you can consider increasing the number of maps, so that the amount of data processed by each map is reduced, thus improving the efficiency of the task execution.
Suppose you have a task like this:
Select Data_desc, Count (1), COUNT (distinct ID), sum (case when ...), sum (case when ...), sum (...) from a group by data_desc< /c2>
If the table A has only one file, the size is 120M, but contains tens of millions of of records, if the use of 1 map to complete this task, it is certainly more time-consuming, in this case, we have to consider the reasonable splitting of this file into multiple, so that you can use a number of map tasks to complete.
set mapred.reduce.tasks=10;
CREATE TABLE A_1 as SELECT * from a distribute by RAND (123);
this will be a table of records, randomly scattered into the a_1 table containing 10 files, and then replaced by a_1 in the SQL table A, you will use 10 map tasks to complete. Each map task handles more than 12M (millions of records) of data, which is certainly much more efficient.

reduce number setting:
parameter 1:hive.exec.reducers.bytes.per.reducer=1g: The amount of data processed by each reduce task
parameter 2:hive.exec.reducers.max=999 (0.95*tasktracker number): The maximum number of reduce per task
Reducer number =min (parameter 2, total input data volume/parameter 1)
Set Mapred.reduce.tasks: The default number of reduce for each task. Typically the number of 0.99*reduce slots, which hive sets to 1, automatically determines the number of reduce.
The maximum number of hive.exec.reducers.max:reducer, if the mapred.reduce.tasks is set to a negative value, then hive will take that as the maximum possible value for reducers. Of course also rely on (input file size/hive.exec.reducers.bytes.per.reducer) The resulting size, take its small value as the number of reducer, hive default is 999;
 

15. Use the index:
Hive.optimize.index.filter: Automatic use of indexes
HIVE.OPTIMIZE.INDEX.GROUPBY: Using aggregate indexes to optimize GROUP by operations

Compression:
hive.exec.compress.output: A query of the last Map/reduce task output is compressed flag, the default is false, but generally will turn on to true, the benefit of the words, save space do not say, Increased IO without regard to CPU pressure;

Hive.exec.compress.intermediate: Similar to the last, in the middle of a query Map/reduce task output is to be compressed, default false,



filter the data as early as possible, reduce the amount of data at each stage, partition the partition table, and select only the fields that need to be used
atomically, try to avoid a SQL containing complex logic, you can use the intermediate table to complete complex logic
the number of jobs for a single SQL is controlled as much as 5 or less
careful use of mapjoin, generally less than 2000 lines, the size of less than 1M (enlarge after expansion can be properly enlarged) of the table to be used, the small table should be noted on the left side of the join (TCL now many of the small table on the right side of the join).
failure to do so can cause significant disk and memory consumption
SQL should first understand the characteristics of the data itself, if there is a join, group operation, you should be aware of data skew
if the number of parts of union all is greater than 2, or if the amount of data in each union is large, it should be split into multiple insert into statements, and the execution time can be increased during the actual test 50%

Usage and optimization notes for hive

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.