Hive is the most widely used SQL on Hadoop tool, and recently many major data companies have introduced new SQL tools such as Impala,tez,spark, based on column or memory hot data, although many people have a view of hive, inefficient, slow query, and many bugs. But Hive is still the most widely used and ubiquitous SQL on Hadoop tool. Taobao survey analysis of the previous report, Taobao 90% of the business run in hive above. The proportion of storm audio and video is higher, more than 95% are running on hive. Undeniably hive is a groundbreaking tool that provides more room for imagination. Moreover, in the production cluster of data query procedures for the development of efficiency, hive much higher than their own write Mr.
By default, the hive configuration parameters are more conservative, so the efficiency will be almost, modify the configuration will make the query efficiency has a relatively large increase, record several of the query efficiency impact of more important parameters.
First get a hive to modify is his metadata storage, by default, hive is using Derby memory database storage metadata, do not understand, are embedded database, hive why not sqlite it, before the change of metadata stored in the article, not to repeat. Modify the metadata store's portal here.
Hive parameter optimization:
The more important is the first few and the latter, especially the last two, the performance improvement effect is most obvious. But with more Mr Tasks open at the same time, this requires a balance.
Nesting SQL parallel execution optimizations:
Set hive.exec.parallel=true;
Set hive.exec.parallel.thread.number=16;
Efficiency can be increased by at least 100%
A job requires 11 stage:
Non-parallel 35 minutes
8 Job execution 10 minutes in parallel
16 Job execution 6 minutes in parallel
Hive Query Optimization:
One, the data large table and the Data Quantity small table makes the correlation, the Data Quantity small table puts in the join front to select.
The reason is that in the reduce phase of a join operation, the contents of the table at the left of the join operator are loaded into memory, and the table with fewer entries on the left can effectively reduce the chance of a memory overflow error.
Second, join optimization
If more than one join is present in a join lookup operation and all the keys participating in the JOIN join are the same, all joins are merged into one mapred program.
Cases:
SELECT A.val, B.val, c.val from a join B in (A.key = b.key1) join C on (C.key = b.key1) performs a join in a mapre program
SELECT A.val, B.val, c.val from a join B in (A.key = b.key1) join C on (C.key = B.key2) Execute join in two mapred programs
The key to a MAP join is the small amount of data in a table in a JOIN operation
Cases:
SELECT/*+ Mapjoin (b)/A.key, a.value from a join B on a.key = B.key
Replace count (distinct) with sum () group BY.
Iv. Sorting optimization
Order BY implementation Global sort, a reduce implementation, inefficient
The Sort by implementation is partially orderly, and the result of a single reduce output is ordered and efficient, typically with the Distribute by keyword (distribute by keyword can specify the distribution key for map to reduce end)
CLUSTER by col1 is equivalent to distribute by Col1 SORT by Col1.
V. Merging of small documents
Excessive number of files can put pressure on HDFS and affect processing efficiency by merging the MAP and Reduce results files to minimize the impact
Hive.merge.mapfiles = True if and and Map output file, default to True
Hive.merge.mapredfiles = False merges the Reduce output file and defaults to False
Hive.merge.size.per.task = 256*1000*1000 the size of the merged file.
The parameters here are not written to the table above because they can be set temporarily depending on the task, not necessarily global settings. Sometimes the global setting has a performance impact on the operation of large files.
Vi. use of zoning, rcfile,lzo,orcfile, etc.
Each partition in the hive corresponds to a directory on the HDFs, and the partition column is not an actual field in the table, but one or more pseudo columns that do not actually save the information and data of the partitioned columns in the table's data file. The primary partition (only one) that is in the front of the partition keyword, followed by the secondary partition
Static partitioning: Static partitions need to be specified in SQL statements when loading data and using
Example: (stat_date= ' 20120625 ', province= ' Hunan ')
Dynamic partitioning: Using dynamic partitioning requires that the Hive.exec.dynamic.partition parameter value is true, the default value is False, and by default, Hive assumes that the primary partition is static, the secondary partition uses dynamic partitioning, and if you want to use dynamic partitioning, Set Hive.exec.dynamic.partition.mode=nostrick required, defaults to Strick
Example: (stat_date= ' 20120625 ', province)
Vii. use of external tables to minimize the use of internal tables, mainly from data security considerations.
This article comes from the "Practice Test Truth" blog, http://slaytanic.blog.51cto.com/2057708/1295222.
"Edit Recommendation"
Data analysis ≠hadoop+nosql, 10 shortcuts to improve existing technology from the business opportunities brought about by the security risks, Hadoop is just as beautiful. How to guarantee Hadoop security about Hadoop in the great data age Twitter releases Storm-hadoop hybrid system " Responsible editor: Xiao Yun TEL: (010) 68476606 "