12 tips for easy survival in Apache Hive
Learn to live with Apache Hive in 12 easy steps
Hive allows you to use SQL on Hadoop, but optimizing SQL on a distributed system is different. Here are 12 tips that allow you to easily master Hive.
Hive is not a relational database service (RDBMS), but it most often acts like a relational database. It has tables, can run SQL, and supports JDBC and ODBC.
Hive does not execute SQL queries like relational databases. I spent a lot of time on Hive and spent more than 80 hours optimizing it at work alone. If you don't know it, you know, I just have my head buzzing like staying next to Hive. So, to protect you from this kind of pain, I decided to write them out so that you can escape this kind of suffering from your next Hive project.
1. Do not use MapReduce
Whether you think Tez, Spark, or Impala can work, don't count on MapReduce. It is slower than Hive. If you are using the Hortonwork version, you can write it in front of the script.set hive.execution.engine=tez
; If Cloudera is used, Impala is used. If Impala is not applicable, I hope it can be used at that time.hive.execution.engine=spark
.
2. do not match strings in SQL.
Never, especially in Hive! If you insist on using LIKE matching in the WHERE statement, a cross-product warning will be generated. Originally, your query may only take a few seconds, but it will take several minutes to use string matching. The best way is to use tools that can be searched in Hadoop. You can try the Hive integrated version of Elasticsearch or the Solr of Lucidwork, and Cloudera Search. RDS is not doing well in this aspect, but Hive is worse.
3. Do not use tables to connect to subqueries.
You 'd better create a temporary table and connect it, instead of letting Hive intelligently process subqueries. Do not do this:
select a.*from something a inner join
(select ...from somethingelse union b select ...from anotherthing c) d
on a.key1 = d.key1 and a.key2 = b.key2 where a.condition=1
It should be like this:
create var_temp as select ...from somethingelse b
union select ...from anotherthing c
andthen
select a.*from something a inner joinfrom var_temp b
where a.key1=b.key1 and a.key2=b.key2 where a.condition=1
Generally, this is much faster than Hive's own processing of subqueries.
4. Use Parquet or ORC, but do not convert and use
That is to say, Parquet or ORC is used instead of TEXTFILE. However, if you want to import text data to more structured data, you should convert the data and import it to the target table. You should not useLOAD DATA
To load a text file to ORC, you should load it into a text file.
If you want to create another table and ultimately perform most analysis on it, you should perform ORC-Based on the table, because it takes a lot of time to convert to ORC or Parquet, it is not worth putting in your ETL processing. If you want to import a simple normal text without any optimization, you should load it into a temporary table and passselect create
Put it in ORC or Parquet. However, this is a little slow.
5. Switch to vectoring.
Addset hive.vectorized.execution.enabled = true
Andset hive.vectorized.execution.reduce.enabled = true
And then try to open or close them. This is because the vectorization of Hive in the latest version is a bit problematic.
6. Do not use structs in table join
I have to admit that the SQL format in my brain is still in the SQL-92 era, so I don't want to use structs anyway. However, if you do some super complex operations, such as using the Union primary keyON
Statement, so structs is very convenient. Unfortunately, Hive is not suitable for them, especially inON
Statement. Of course, in most cases, there is no error in small datasets and yields. In Tez, you will get an interesting vector error. This restriction is not found in any of the documents I know. Maybe this is a good way to explore your execution engine.
7. Check the container size
You may need to increase the size of your container for Impala or Tez. If your node size is large, the recommended container size may not be suitable for your system. You may need to make sure that your YARN queue and the general YARN memory size are appropriate. You may be aware that the default queue is not suitable for all conventional use.
8. Enable statistics
Hive does some stupid things during table join unless statistics are enabled. You can also use the query prompt in Impala.
9. MapJoin Optimization
If you analyze your query, you may find that the latest Hive can be automatically optimized intelligently. But you may need to adjust it again.
10. If possible, put the large table to the end.
For example, title.
11. Partition will always help you, no matter how much
If you have something that appears in many places, such as a date in a statement (but not a date range) or a repeated location, you may need to partition it. Partitioning basically means "splitting to its own directory", rather than searching in a large file. When you retrieve onlylocation=’NC’
In such a small data set, Hive can be searched in a file. In addition, unlike the column value, you canLOAD DATA
Statement. In addition, remember that HDFS does not like small files.
12. Use hash to compare Columns
If you want to compare the same 10 fields in each query, consider usinghash()
To compare their verification values. It may be useful to display them in an output table. Note: In Hive 0.12, the hash function is relatively poor, and the hash in Hive 0.13 is better.
The above is my 12-Point experience. I hope this will help you escape from the buzz of Hive.
Hive programming guide PDF (Chinese Version)
Hadoop cluster-based Hive Installation
Differences between Hive internal tables and external tables
Hadoop + Hive + Map + reduce cluster installation and deployment
Install in Hive local standalone Mode
WordCount word statistics for Hive Learning
Hive operating architecture and configuration and deployment
Hive details: click here
Hive: click here
This article permanently updates the link address: