The explanation of hive under HADOOPD

Source: Internet
Author: User
Keywords Hadoop

Hive is the most widely used SQL on Hadoop tool, and recently many major data companies have introduced new SQL tools such as Impala,tez,spark, based on column or memory hot data, although many people have a view of hive, inefficient, slow query, and many bugs. But Hive is still the most widely used and ubiquitous SQL on Hadoop tool. Taobao survey analysis of the previous report, Taobao 90% of the business run in hive above. The proportion of storm audio and video is higher, more than 95% are running on hive. Undeniably hive is a groundbreaking tool that provides more room for imagination. Moreover, in the production cluster of data query procedures for the development of efficiency, hive much higher than their own write Mr.

By default, the hive configuration parameters are more conservative, so the efficiency will be almost, modify the configuration will make the query efficiency has a relatively large increase, record several of the query efficiency impact of more important parameters.

First get a hive to modify is his metadata storage, by default, hive is using Derby memory database storage metadata, do not understand, are embedded database, hive why not sqlite it, before the change of metadata stored in the article, not to repeat. Modify the metadata store's portal here.

Hive parameter optimization:

The more important is the first few and the latter, especially the last two, the performance improvement effect is most obvious. But with more Mr Tasks open at the same time, this requires a balance.

Nesting SQL parallel execution optimizations:

Set hive.exec.parallel=true;

Set hive.exec.parallel.thread.number=16;

Efficiency can be increased by at least 100%

A job requires 11 stage:

Non-parallel 35 minutes

8 Job execution 10 minutes in parallel

16 Job execution 6 minutes in parallel

Hive Query Optimization:

One, the data large table and the Data Quantity small table makes the correlation, the Data Quantity small table puts in the join front to select.

The reason is that in the reduce phase of a join operation, the contents of the table at the left of the join operator are loaded into memory, and the table with fewer entries on the left can effectively reduce the chance of a memory overflow error.

Second, join optimization

If more than one join is present in a join lookup operation and all the keys participating in the JOIN join are the same, all joins are merged into one mapred program.

Cases:

SELECT A.val, B.val, c.val from a join B in (A.key = b.key1) join C on (C.key = b.key1) performs a join in a mapre program

SELECT A.val, B.val, c.val from a join B in (A.key = b.key1) join C on (C.key = B.key2) Execute join in two mapred programs

The key to a MAP join is the small amount of data in a table in a JOIN operation

Cases:

SELECT/*+ Mapjoin (b)/A.key, a.value from a join B on a.key = B.key

Replace count (distinct) with sum () group BY.

Iv. Sorting optimization

Order BY implementation Global sort, a reduce implementation, inefficient

The Sort by implementation is partially orderly, and the result of a single reduce output is ordered and efficient, typically with the Distribute by keyword (distribute by keyword can specify the distribution key for map to reduce end)

CLUSTER by col1 is equivalent to distribute by Col1 SORT by Col1.

V. Merging of small documents

Excessive number of files can put pressure on HDFS and affect processing efficiency by merging the MAP and Reduce results files to minimize the impact

Hive.merge.mapfiles = True if and and Map output file, default to True

Hive.merge.mapredfiles = False merges the Reduce output file and defaults to False

Hive.merge.size.per.task = 256*1000*1000 the size of the merged file.

The parameters here are not written to the table above because they can be set temporarily depending on the task, not necessarily global settings. Sometimes the global setting has a performance impact on the operation of large files.

Vi. use of zoning, rcfile,lzo,orcfile, etc.

Each partition in the hive corresponds to a directory on the HDFs, and the partition column is not an actual field in the table, but one or more pseudo columns that do not actually save the information and data of the partitioned columns in the table's data file. The primary partition (only one) that is in the front of the partition keyword, followed by the secondary partition

Static partitioning: Static partitions need to be specified in SQL statements when loading data and using

Example: (stat_date= ' 20120625 ', province= ' Hunan ')

Dynamic partitioning: Using dynamic partitioning requires that the Hive.exec.dynamic.partition parameter value is true, the default value is False, and by default, Hive assumes that the primary partition is static, the secondary partition uses dynamic partitioning, and if you want to use dynamic partitioning, Set Hive.exec.dynamic.partition.mode=nostrick required, defaults to Strick

Example: (stat_date= ' 20120625 ', province)

Vii. use of external tables to minimize the use of internal tables, mainly from data security considerations.

This article comes from the "Practice Test Truth" blog, http://slaytanic.blog.51cto.com/2057708/1295222.

"Edit Recommendation"

Data analysis ≠hadoop+nosql, 10 shortcuts to improve existing technology from the business opportunities brought about by the security risks, Hadoop is just as beautiful. How to guarantee Hadoop security about Hadoop in the great data age Twitter releases Storm-hadoop hybrid system " Responsible editor: Xiao Yun TEL: (010) 68476606 "

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.