12 tips for easy survival in Apache Hive

Source: Internet
Author: User

12 tips for easy survival in Apache Hive

Learn to live with Apache Hive in 12 easy steps

Hive allows you to use SQL on Hadoop, but optimizing SQL on a distributed system is different. Here are 12 tips that allow you to easily master Hive.

Hive is not a relational database service (RDBMS), but it most often acts like a relational database. It has tables, can run SQL, and supports JDBC and ODBC.

Hive does not execute SQL queries like relational databases. I spent a lot of time on Hive and spent more than 80 hours optimizing it at work alone. If you don't know it, you know, I just have my head buzzing like staying next to Hive. So, to protect you from this kind of pain, I decided to write them out so that you can escape this kind of suffering from your next Hive project.

 

1. Do not use MapReduce

Whether you think Tez, Spark, or Impala can work, don't count on MapReduce. It is slower than Hive. If you are using the Hortonwork version, you can write it in front of the script.set hive.execution.engine=tez; If Cloudera is used, Impala is used. If Impala is not applicable, I hope it can be used at that time.hive.execution.engine=spark.

 

2. do not match strings in SQL.

Never, especially in Hive! If you insist on using LIKE matching in the WHERE statement, a cross-product warning will be generated. Originally, your query may only take a few seconds, but it will take several minutes to use string matching. The best way is to use tools that can be searched in Hadoop. You can try the Hive integrated version of Elasticsearch or the Solr of Lucidwork, and Cloudera Search. RDS is not doing well in this aspect, but Hive is worse.

 

3. Do not use tables to connect to subqueries.

You 'd better create a temporary table and connect it, instead of letting Hive intelligently process subqueries. Do not do this:

  1. select a.*from something a inner join
  2. (select ...from somethingelse union b select ...from anotherthing c) d
  3. on a.key1 = d.key1 and a.key2 = b.key2 where a.condition=1

It should be like this:

  1. create var_temp as select ...from somethingelse b
  2. union select ...from anotherthing c
  3. andthen
  4. select a.*from something a inner joinfrom var_temp b
  5. where a.key1=b.key1 and a.key2=b.key2 where a.condition=1

Generally, this is much faster than Hive's own processing of subqueries.

4. Use Parquet or ORC, but do not convert and use

That is to say, Parquet or ORC is used instead of TEXTFILE. However, if you want to import text data to more structured data, you should convert the data and import it to the target table. You should not useLOAD DATATo load a text file to ORC, you should load it into a text file.

If you want to create another table and ultimately perform most analysis on it, you should perform ORC-Based on the table, because it takes a lot of time to convert to ORC or Parquet, it is not worth putting in your ETL processing. If you want to import a simple normal text without any optimization, you should load it into a temporary table and passselect createPut it in ORC or Parquet. However, this is a little slow.

 

5. Switch to vectoring.

Addset hive.vectorized.execution.enabled = trueAndset hive.vectorized.execution.reduce.enabled = trueAnd then try to open or close them. This is because the vectorization of Hive in the latest version is a bit problematic.

6. Do not use structs in table join

I have to admit that the SQL format in my brain is still in the SQL-92 era, so I don't want to use structs anyway. However, if you do some super complex operations, such as using the Union primary keyONStatement, so structs is very convenient. Unfortunately, Hive is not suitable for them, especially inONStatement. Of course, in most cases, there is no error in small datasets and yields. In Tez, you will get an interesting vector error. This restriction is not found in any of the documents I know. Maybe this is a good way to explore your execution engine.

 

7. Check the container size

You may need to increase the size of your container for Impala or Tez. If your node size is large, the recommended container size may not be suitable for your system. You may need to make sure that your YARN queue and the general YARN memory size are appropriate. You may be aware that the default queue is not suitable for all conventional use.

 

8. Enable statistics

Hive does some stupid things during table join unless statistics are enabled. You can also use the query prompt in Impala.

9. MapJoin Optimization

If you analyze your query, you may find that the latest Hive can be automatically optimized intelligently. But you may need to adjust it again.

 

10. If possible, put the large table to the end.

For example, title.

 

11. Partition will always help you, no matter how much

If you have something that appears in many places, such as a date in a statement (but not a date range) or a repeated location, you may need to partition it. Partitioning basically means "splitting to its own directory", rather than searching in a large file. When you retrieve onlylocation=’NC’In such a small data set, Hive can be searched in a file. In addition, unlike the column value, you canLOAD DATAStatement. In addition, remember that HDFS does not like small files.

 

12. Use hash to compare Columns

If you want to compare the same 10 fields in each query, consider usinghash()To compare their verification values. It may be useful to display them in an output table. Note: In Hive 0.12, the hash function is relatively poor, and the hash in Hive 0.13 is better.

The above is my 12-Point experience. I hope this will help you escape from the buzz of Hive.

Hive programming guide PDF (Chinese Version)

Hadoop cluster-based Hive Installation

Differences between Hive internal tables and external tables

Hadoop + Hive + Map + reduce cluster installation and deployment

Install in Hive local standalone Mode

WordCount word statistics for Hive Learning

Hive operating architecture and configuration and deployment

Hive details: click here
Hive: click here

This article permanently updates the link address:

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.