The explanation of hive under HADOOPD

Last Update:2014-08-29 Source: Internet

Author: User

Keywords Hadoop

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hive is the most widely used SQL on Hadoop tool, and recently many major data companies have introduced new SQL tools such as Impala,tez,spark, based on column or memory hot data, although many people have a view of hive, inefficient, slow query, and many bugs. But Hive is still the most widely used and ubiquitous SQL on Hadoop tool. Taobao survey analysis of the previous report, Taobao 90% of the business run in hive above. The proportion of storm audio and video is higher, more than 95% are running on hive. Undeniably hive is a groundbreaking tool that provides more room for imagination. Moreover, in the production cluster of data query procedures for the development of efficiency, hive much higher than their own write Mr.

By default, the hive configuration parameters are more conservative, so the efficiency will be almost, modify the configuration will make the query efficiency has a relatively large increase, record several of the query efficiency impact of more important parameters.

First get a hive to modify is his metadata storage, by default, hive is using Derby memory database storage metadata, do not understand, are embedded database, hive why not sqlite it, before the change of metadata stored in the article, not to repeat. Modify the metadata store's portal here.

Hive parameter optimization:

The more important is the first few and the latter, especially the last two, the performance improvement effect is most obvious. But with more Mr Tasks open at the same time, this requires a balance.

Nesting SQL parallel execution optimizations:

Set hive.exec.parallel=true;

Set hive.exec.parallel.thread.number=16;

Efficiency can be increased by at least 100%

A job requires 11 stage:

Non-parallel 35 minutes

8 Job execution 10 minutes in parallel

16 Job execution 6 minutes in parallel

Hive Query Optimization:

One, the data large table and the Data Quantity small table makes the correlation, the Data Quantity small table puts in the join front to select.

The reason is that in the reduce phase of a join operation, the contents of the table at the left of the join operator are loaded into memory, and the table with fewer entries on the left can effectively reduce the chance of a memory overflow error.

Second, join optimization

If more than one join is present in a join lookup operation and all the keys participating in the JOIN join are the same, all joins are merged into one mapred program.

Cases：

SELECT A.val, B.val, c.val from a join B in (A.key = b.key1) join C on (C.key = b.key1) performs a join in a mapre program

SELECT A.val, B.val, c.val from a join B in (A.key = b.key1) join C on (C.key = B.key2) Execute join in two mapred programs

The key to a MAP join is the small amount of data in a table in a JOIN operation

Cases：

SELECT/*+ Mapjoin (b)/A.key, a.value from a join B on a.key = B.key

Replace count (distinct) with sum () group BY.

Iv. Sorting optimization

Order BY implementation Global sort, a reduce implementation, inefficient

The Sort by implementation is partially orderly, and the result of a single reduce output is ordered and efficient, typically with the Distribute by keyword (distribute by keyword can specify the distribution key for map to reduce end)

CLUSTER by col1 is equivalent to distribute by Col1 SORT by Col1.

V. Merging of small documents

Excessive number of files can put pressure on HDFS and affect processing efficiency by merging the MAP and Reduce results files to minimize the impact

Hive.merge.mapfiles = True if and and Map output file, default to True

Hive.merge.mapredfiles = False merges the Reduce output file and defaults to False

Hive.merge.size.per.task = 256*1000*1000 the size of the merged file.

The parameters here are not written to the table above because they can be set temporarily depending on the task, not necessarily global settings. Sometimes the global setting has a performance impact on the operation of large files.

Vi. use of zoning, rcfile,lzo,orcfile, etc.

Each partition in the hive corresponds to a directory on the HDFs, and the partition column is not an actual field in the table, but one or more pseudo columns that do not actually save the information and data of the partitioned columns in the table's data file. The primary partition (only one) that is in the front of the partition keyword, followed by the secondary partition

Static partitioning: Static partitions need to be specified in SQL statements when loading data and using

Example: (stat_date= ' 20120625 ', province= ' Hunan ')

Dynamic partitioning: Using dynamic partitioning requires that the Hive.exec.dynamic.partition parameter value is true, the default value is False, and by default, Hive assumes that the primary partition is static, the secondary partition uses dynamic partitioning, and if you want to use dynamic partitioning, Set Hive.exec.dynamic.partition.mode=nostrick required, defaults to Strick

Example: (stat_date= ' 20120625 ', province)

Vii. use of external tables to minimize the use of internal tables, mainly from data security considerations.

This article comes from the "Practice Test Truth" blog, http://slaytanic.blog.51cto.com/2057708/1295222.

"Edit Recommendation"

Data analysis ≠hadoop+nosql, 10 shortcuts to improve existing technology from the business opportunities brought about by the security risks, Hadoop is just as beautiful. How to guarantee Hadoop security about Hadoop in the great data age Twitter releases Storm-hadoop hybrid system " Responsible editor: Xiao Yun TEL: (010) 68476606 "

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More