Hive Use summary __ optimization

Last Update:2018-08-20 Source: Internet

Author: User

Tags mysql command line sqoop hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hive The main features of each version

Introduction to Key new Feature of Hive versions

The website downloads the introduction of the page

Hive Foundation
command-line interface

The user interface provided by hive includes: CLI, Client, WebUI several ways, we usually mainly use CLI, the future cluster upgrade may have a visual interface to allow us to access directly from WebUI.

Earlier versions of Hive primarily used the Hive CLI (old), which was later developed to use the Beeline CLI (new), which is currently switched to use Beeline. In CLI mode, our common commands include the following: set

Used to view and set configuration parameters for hive, such as controlling reduce quantity, controlling compression format, and so on. Add Jars

Add additional jar packs to the Hive runtime environment, such as custom UDF/UDAF jar packages, which can be added in this way.

Hive>! ls; can run shell command DFS directly inside hive

For example, run hive> dfs-ls/src/gamein/g4_sdc/so that you don't have to open an extra session to run Hadoop fs-ls Hive Sql

The

Hive SQL syntax is similar to the MySQL syntax we use, specifically referring to the official website "DDL" and "DML", supported operators, strings, numbers, dates, and so on, built in UDF,UDAF,UDTF reference to the offical manual "Udf-wiki". UDTF we use less, but need to be very useful, such as lateral view and explode use can be used to expand the data "a lateral view example." In addition to using Udf-wiki to view hive functions, you can also view them from the command line. hive> show functions; --Show all functions
hive> describe function substr; --View the use of functions [substr]
OK
substr (str, pos[, Len])-Returns the Substring of str that starts at POS and are of length Len Orsubstr (Bin, pos[, Len])-Returns the slice of byte array that Starts at POS and are of length Len
time taken:1.368 seconds, fetched:1 row (s)
This feature is not as powerful as MySQL, support "?" under MySQL. syntax, for example, to view the usage syntax of load data under the MySQL command line mysql>? Load data;
Hive Serde

The frequency with which we use Hive Serde is going to be more and more high. Hive Serde is the serialization and deserialization of data Io, the main function of which is to parse the HDFs file and identify the format. Some of the early Hive-serde packages have been discarded, the latest documents refer to the "Official Serde-wiki", and the hive-supported built-in Serde is as follows, and the following are all available after we upgrade to hive1.0. Avro (Hive 0.9.1 and later) ORC (Hive 0.11 and later) RegEx Thrift parquet (Hive 0.13 and later) CSV (Hive, 0.14, later) Jsonserde (Hive 0.12 and later Inhcatalog-core)

There are a lot of teams directly using hive SQL data processing, the core is the custom serde,us over there more than we use. The most typical scenario we use for Hive-serde is to quickly query, filter out today's HDFs logs, use Regex (regular expression) to perform certain statistics (such as numbers, credits, etc.) by log type (such as number and recharge), which is more efficient than using Hadoop FS directly- Text is much faster, the specific use of a reference to another ETL internal wiki "HDFs data query method."

When it comes to file format resolution, you need to be aware of the default delimiter settings for the hive log, which are as follows: Field separator: ' \x01 ', after exporting to plain text, you can view it with cat-a, and you can see the separator displayed in the Linux shell as ^a mapitem separator: ' X02 ', i.e. the separator between the dictionary's kv item and the item, (KV ' \x02 ' kv) mapkv separator: ' \x03 ', that is, the separator between the dictionary's kv (K\X02V)

Mastering these will help us avoid errors when we import hive data into plain text.

Transform & UDF & Udaf Transform

If you want to embed a custom MapReduce script in hive, you can use transform to complete some complex operations, as well as some of the features of UDF and UDAF. Bi used less, US children's shoes before the use of a little more. Transform execution is actually not high, but you can write a feature quickly with Python, which is much faster than using Java to write UDF/UDAF extension classes, and to pack and publish. Here is an example of a hive+python. With the transform feature, it's easy to do things like, combine data into groups, ordered dictionaries, and other complex log split functions.

"Reference manual-transform" UDF & UDAF

Custom function UDF and custom aggregate function Udaf rely on some jar packages and must inherit the appropriate interfaces. Specific examples, you can search the Web, the process of writing and MapReduce code is very similar. UDF and UDAF perform more efficiently than transform. Some examples of our etl_hadoop_lib projects, such as the MaxRow method, are as follows: MaxRow (Compare_col,col1,col2,col3 ...) returns the maximum row, based on the specified compare column. Contains value compare-col,col1,col2 ... The return structure is struct and needs to be evaluated based on the struct structure.

Transform mode and UDF, Udaf comparison can refer to the above figure right. Hive Optimization Understanding Hive

Hive Detailed design framework is as follows.

After a certain development, now form the following levels

The top tier is the SQL parsing and execution plan; the middle tier is the computational framework that has evolved from the initial mapreduce to support Tez and spark based on DAG. The bottom layer is the storage of data, extending from the initial textfile/sequencefile to a faster and better format, such as Orc File.
Hive on Mapreduce

In Hive, the main use of Mr is to perform the engine, Tez is the optimization of Mr Execution process. Therefore, to optimize the execution efficiency of hive, the core needs to optimize the mapreduce process of hive actual execution environment. The detailed process of MapReduce is shown in the following figure (refer to the "Hadoop Authority Guide"), remember this picture, can help us a lot of things, such as data tilt caused by a single reduce too busy, the performance is reduce to 98%, 99% after the stuck, in order to make the data more evenly dispersed, You need to hash out the key to more partition. Understanding the various aspects of MapReduce can also help us understand the optimization direction of hive development, such as tez the traditional reduce to LocalFile-> copy-> map->reduce What these links see as reduce-> re Duce, in addition to hive to read and write HDFs, better compression, decompression efficiency needs to use a better file format such as orc files, and Spark is directly to avoid data writing hard disk, and instead of memory.

Hive JOIN

Understanding Hive's centralized join approach can help us better understand most of the data association issues, and interested child shoes try to use MapReduce to achieve hive-join similar functions. It is said that interviewers especially like to take hive-join to examine each other's understanding of Mr. Reduce-side join (Common join)

Note The order of execution, which associates the same key to reduce the mapreduce execution. Map Join

Load one of the small tables into memory. There is a limit to the size of the data in the load. Occasionally this is the reason that triggers the error. Hive extends the memory limit by implementing Bloom filter after a version. Bucket map Join Left-semi Join

On the specific process of hive-join can refer to the "Official Languagemanual joinoptimization" "Join Details" "Hive join the use of detailed". By the way, hive sort, there are sort By/order by/distribute by/cluster by, where the difference reference "manual-sortby"

Hive-sql optimization can refer to this article, said it is not bad, there are principles and detailed hive parameter settings "in simple Data Warehouse SQL Performance Optimization Hive Chapter." Hive the core of the parameter settings, the above several articles are, this article no longer do specific parameters to explain.

file format and compression

The hive data is stored on the HDFs, and by default 3 copies are backed up. Using different compression methods, the requirements for the disk is very different. In addition, the efficiency of the execution of the hive statement is very much related to the file format. In general, if the compression is particularly small, decompression efficiency will be poor, and ultimately need to be in the compression, decompression efficiency to achieve a balance. The following figure describes the size of the circle area to describe different file formats such as Textfile, Sequencefile, Rcfile, AVRO, parquet, Orcfile and different compression algorithms gz, snappy, bzip2, Lzo in the case of a combination of intuitive proportions.

The only text uncompressed format we can see with the naked eye. However, after textfile compression, it is not splitable, that is, in the mapr of the implementation process can only be regarded as a large, this will greatly affect the efficiency. From the performance point of view, the future log format will be more choice orc (rcfile), parquet (Google legend 3 seconds to query 1PB data Dremel Open Source implementation version). Interested children's shoes can be carefully studied in both of these file stores.

Our current intermediate results are stored in the Sequencefile, in the MapReduce script is the choice of compression is GZ (high compression rate, decompression slow), hive default is snappy. Future adjustments may be made to the ORC + Tez engine to further improve hive execution speed.

evolution of the Hive execution engine: Hive/tez

Tez is a computing framework for Apache Open Source support for DAG operations, which comes directly from the MapReduce framework. The Chinese can see Dong's blog "Apache Tez: A computing framework that runs on top of yarn to support DAG operations." The following diagram takes an example of a SQL statement, describing the main differences between HIVE/MR and Hive/tez.

The current version of Hive already supports the Tez engine, as long as one of the following hive commands can complete the engine switch.

Hive> set Hive.execution.engine=tez;

In addition, hive is actually supporting the spark engine. This article does not do additional introduction to SPARK, but Hive-on-spark performance is not very ideal, its performance is inferior to direct use spark-sql efficiency is high. We now have some of the SQL has been converted to Saprk-sql to perform, the main problems encountered Spark-sql include: 1. Special to eat memory, the need for different operating memory parameters are configured; 2. For a large amount of data, easy memory overflow, even if the memory does not overflow, execution efficiency is not ideal, such as calculating all the hands of the silent users, the calculation of slow you will want to cry.

Our current hive task's execution engine has been tez by default, but has not been tested with Orcfile. For small operations, you can directly use Spark-sql, the data is particularly large operations, the current more excellent choice is the Tez+orcfile mode of operation. So when do we need to change, unless spark to send more big strokes in the future? The following is Hortnworks,tez's main development optimizer, for Hive/tez,hive/spark,spark-sql a performance test comparison of a section of the paragraph, the final conclusion is probably spark is ruthless cattle, but Hive-tez is not false, See how the Hive-tez change to 100X. Spark-sql, Cloudera-impala, Hortnwork-tez all said they are the most cattle, this need us in the actual environment, according to the machine and data on their own comparative determination.

NEXT

The past, present, and future of Hive are in this picture below. Encounter unfamiliar nouns, please Google, even the use of Baidu, do not tell SA, will be they despise. Sorry, the writing is a bit long, crooked building.

Perhaps not for too long, hive can really enter the era of Sub-second. It may not be called hive, or spark-sql, or anything else at that time. In order to encourage everyone to use the hive, sacrifice the last big strokes of the picture.

other Similar tools Impala

Cloudera Open Source, a ' interactive ' SQL query engine, can query the HDFs and hbase stored in Hadoop, PB-Level large data, can be based on the yarn run, deployed or more convenient, before the group has published related Research data. A data test released by Cloudera's official blog in April 2016 said Impala's real-time response to SQL could kill Spark-sql and Hive-tez in seconds, see New Sql benchmarks:apache Impala ( incubating) uniquely delivers analytic Database performance, do not map, lest later do not want to use hive. Impala did not use the Hive+mapreduce framework, but was inspired by Google's Dremel, using the idea of MPP parallel database. (Sure enough, copy Google right). Drill

Official website description can be "Query any non-relational datastore", the domestic seems not very fire, verify a word, anything can, is nothing. Presto

An open source ' interactive ' SQL query engine built by Facebook. Support for PB-byte interactive data queries. The MapReduce framework is not applicable. At present Airbnb and Dropbox, as if the American regiment is also using Presto. Shark-> Spark-sql

Shark estimated to kneel, belonging to the Spark project, but gradually replaced by Spark-sql. The domestic Spark community ash is often hot, everybody thinks this is the future, spark-sql estimate will use more and more. Bigsql (IBM provides a distributed SQL query engine, not open source, the information is small, seemingly people do not care about) Apache Phoenix (also a SQL engine, mainly to check hbase, as if not in this said) Hawq (EMC offers, not open source)

Personal feeling focus on the Spark-sql,impala can be.

interaction with other databases interaction with the HBase

We can use hive to manipulate hbase. This includes reading hbase data and updating hbase by Hive-sql updating hive tables, provided that the mapping relationship between the hive table and the HBase table is established first. There are usually two ways: 1. A map,map of the hbase that maps to a hive contains all the column information. 2. All columns of each column family will be hbase and mapped to a column of the Hive table, respectively. Each has its merits and demerits, depending on the actual situation. Hive and HBase integration methods refer to "Hive-hbaseintegration".

In addition, hive can also help generate hfile, there are certain restrictions, such as only allow only a columnfamily. After the generation of hfile, the Bulkload method is the fastest way to write a large amount of data in the current hhase. the interaction with MySQL

The most widely used tool is Sqoop, which starts with a mapreduce process. "Apache Sqoop (TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastore S such as relational Databases.sqoop ". Sqoop is not limited to hive, but can also be used for hbase and so on. The latest stable version is 1.4.6, detailed parameter interpretation and use methods refer to the official "Guidelines document-sqoop"

The pits we usually encounter mainly lie in the textfile hdfs files such as sequencefile support, how to deal with the null data.

Description

1. The full text of the pictures are copied, before the collection of clips, do not remember the source. Probably source: "Hadoop Authority Guide" "Hadoop Combat" "Hadoop Technology Insider" Dong Xicheng blog hortnworks, cloudera blog Hadoop Summit Some of the PPT google/Baidu pictures

2. SA's Hive Entry command explanation

Hive--config/home/hadoop/hiveconf/metastore/sy_sdc/\

--auxpath/home/hadoop/hbase/lib \ # If you want to access hbase inside hive, you need to add a hbase package,

-hiveconf hive.new.job.grouping.set.cardinality=32 \ #cube维度的上限, if you have N dimensions, you need to set the upper limit >2^n

-hiveconf Hive.exec.scratchdir=/tmp/hive-${whoami} # #调度器账号和个人账号需要使用不同的临时目录

3. Export Select results

It is common to insert overwrite local directory, using the Os.popen () method under Python, and reading additional hive execution information.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More