Hive Basic knowledge and optimization (interview required) __hive

Source: Internet
Author: User
Tags field table joins shuffle advantage
Hive is a data Warehouse tool based on Hadoop that maps structured data files to a database table and provides a simple SQL query that translates SQL statements into MapReduce tasks.

Metastore (hive meta data)
Hive stores metadata in a database, such as MySQL, Derby. The metadata in hive includes the name of the table, the columns and partitions of the table and their attributes, and the table's data in the directory

Hive data is stored in HDFs, most queries and calculations are done by MapReduce hive Data Warehouse in database

(1) Because Hive uses the SQL query Language hql, it is easy to understand hive as a database. In fact, the structure of the hive and the database in addition to have a similar query language,
 there is no similar place.
(2) data storage location.  HDFs   Raw Local FS
(3) data format. Separator
(4) Data update. Hive Read more and write less. overwriting and adding of data is not supported in hive, all data are OK in loading.
INSERT into ... Values add data, use update ... The set modification data  does not support
HDFS write multiple reads
(5) execution. Hive is implemented through MapReduce  and the database usually has its own execution engine. 
(6) execution delay. Because there is no index, the entire table needs to be scanned, so the latency is high. Another factor that causes high hive execution latency is the MapReduce framework
(7) Scalability
(8) Data scale.
hive Several basic table types: Internal table, external table, partition table, bucket table the difference between an internal table (an administrative table) and an external table:

Create a table
When the table is created by the external table, it is not moved to the Data Warehouse directory (/user/hive/warehouse), only the path where the table data is stored is recorded
The internal table copies or cuts the data to the table's directory
Delete Table
The external table deletes the table's metadata information only when it deletes the table and does not delete the table data
When the internal table is deleted, the metadata information and table data are deleted at the same time table type one, manage table or internal table type:managed_table

CREATE table if not EXISTS dept (
Deptno int,
deptname string, address
string
)
row  format Delimited fields terminated by ' t ';

Load the HDFs file into the hive table,
load Data inpath '/input/dept.txt ' into table dept;

Use to specify the column delimiter for the original file
row  format delimited fields terminated by ' \ t ';
Load if the file on the operating HDFs represents a move or cut file

desc formatted dept;  Description table structure Information
Location:               hdfs://bigdata/user/hive/warehouse/db01.db/dept table  
Type:             
table type two, external table
Create  External Table emp (
empno int,
empname string,
empjob string,
Mgno int,
birthday String,
salary float,
bonus float,
depno int
)
row  format delimited fields terminated by ' \ T '
location '/input/demo ';

Description Table Structure
desc formatted emp;
Location:               hdfs://bigdata/input/demo    
table Type:             external_table 

deletes the internal Table
drop table dept;

Delete External Table
drop table emp;

Clears table Data
truncate TABLE student;
table type three, partition table

Partition table to create a table, you need to specify the partition field, the difference between the partition field and the normal field: the Partition field will generate a directory of the partition field names in the HDFs table directory, while the normal field will not, the query can be used as a normal field, generally not directly related to the business.

Create  Table Emp_part (
 empno int,
 empname string,
 empjob string,
 Mgrno int,
 birthday string,
 salary float,
 bonus float,
 deptno   int
 )
 partitioned by (province string)
row format Delimited fields terminated by ' t ';

Load data to the partition table
Inpath '/home/user01/emp.txt ' into table Emp_part partition (province= ' CHICAGO ');

Description Table Information
desc formatted  emp_part;

Query full table Data
select * from Emp_part;

Query partition Field table Data
select * from Emp_part where province= ' CHICAGO ';

View partition information show
partitions emp_part;

Add Partition
Aler table Emp_part Add [if not exist] partition (provine= ' Zhejiang ', city= ' Hangzhou ') 

//delete partition
Aler table Emp_part Drop [if  exist] partition (provine= ' Zhejiang ', city= ' Hangzhou ')
External partition Table
Create external  table  Dept_part (
deptno int,
deptname string, address
string
)
Partitioned by (province string)
row  format delimited fields terminated by ' t '
location '/input/demo ';

Manually increase the partition fields and external directories:
ALTER TABLE Dept_part add partition (province= ' BOSTON ') location '/input/demo/boston ';

Manually increase the partition field (automatically generate the partition directory)
ALTER TABLE Dept_part add partition (province= ' NEW YORK ');
table type four: Bucket table

Further organize the internal table, external table and partition table into bucket table
You can further decompose a table's columns into different file stores through a hash algorithm

CREATE TABLE test_bucket_table (
ID int,
name string
)
clustered by (ID) into 5 bucket;
How to create a table mode one create + load
    Create [external] table table_name (
    col1_name col1_type,
    ...
    Coln_name coln_type
    )
    row  format delimited fields terminated by  ' \ t ';

   Load load Data
    laod [local] inpth ' native file (Linux)/hdfs ' [Overwrite] into table  table_name;
mode two like + load
    Duplicate table structure
    CREATE table TableB like TableA;    First you have to have TableA

    //load load Data
    laod [local] inpth ' native file (Linux)/hdfs ' [Overwrite] into table  table_name ;
the way three as creates a table while loading the data
CREATE table TableB row format delimited filelds termianted by ', ' as  select * from TableA;   First you must have TableA

 create table emp_as row format delimited fields terminated by ', ' as select Empno,empname,salary from Emp_part1;
mode four create + Insert
CREATE table Creation Table
Emp_insert (
ID int,
name string,
job string,
salary float
)
row format  delimited fields terminated by ', ';

INSERT into load data
INSERT INTO table Emp_insert Select Empno,empname,empjob,salary from Emp_part1 where day= ' 20170308 ' and  hour= ' 14 ';
how data is loaded Load Way One
Load local file to hive table-   -use storage media (removable hard disk)
laod data local inpth ' native file (Linux) ' [Overwrite] into table  table_name;
Load Mode two
Load HDFs file to hive table   --
laod data  inpth ' HDFs file ' [Overwrite] into table table_name through log collection framework such as Flume  ;
Load mode three
Load data to overwrite existing data
laod [local] inpth ' file ' overwrite into  table

//**hdfs  dfs-put native directory and files The table is located in the HDFs directory
 DESC formatted table_name;  --Find the TABLE_NAME directory on the HDFS
Load Mode four
Load the data create table
 TableB row format delimited filelds termianted by ', ' as SELECT * from TableA, when creating tables with a SELECT query statement  ;  
Load Mode five
First create the table, insert into table Table_namea  Select  *  fom TableB
Load Mode six
Creates an external representation through location specifies the directory where
the data is created extrnal table_name (
     col1_name col1_type,
     ...
     ). Coln_name coln_type
)
row  format delimited fields terminated by  ' \ t ';
Location  ' HDFs on the list '
several ways to export data
    1.insert Overwrite ...  Export to local directory
     insert overwrite native directory '/home/user01/export ' row format delimited fields terminated by ' SELECT * from Emp_part1;

    2.insert Overwrite ... Export to HDFs
    insert overwrite directory '/export ' select * from Emp_part1 where day= ' 20170308 ';

    3.hive-e ' HQL query '  >> test
    bin/hive-e ' select * from Db01.student ' >> test.txt

    4) Sqoop
Hive Custom Function Functions

UDF in one out processing the contents of the original file some fields contain [] ""
Udaf one out of sum () avg () max () min ()
UDTF IP-> National Provinces and cities Hive4 sort

Order BY//can specify DESC descending ASC Ascending
The order by takes a global sort of input, so there is only one reducer (multiple reducer cannot guarantee global ordering), but only one reducer can result in a longer computation time when the input is larger.

Sort by "sorting data within a partition"
Sort by is not a global sort, which completes sorting before data enters reducer, so if you sort by with sort by and set mapred.reduce.tasks>1, sort by only guarantees that the output of each reducer is ordered. does not guarantee global order. The sort by differs from the order by, which is unaffected by the Hive.mapred.mode property, and the data in the sort by only guarantees that the data in the same reduce can be sorted by the specified field. With sort by you can specify the number of reduce performed (specified by Set mapred.reduce.tasks=n), and then perform a merge order on the output data to get the full result.

Distribute by "partitioning the map output"
Distribute by is the control of how the data is split at the map end to the reduce end. The hive is distributed according to the distribute by followed by the number of reduce, by default the hash algorithm is used. Sort by produces a sort file for each reduce. In some cases, you need to control which reducer a particular line should go to, usually for subsequent aggregation operations. Distribute by just can do this thing. Therefore, distribute by is often used in conjunction with the sort by.

Cluster by
Cluster by has the sort by function in addition to the functionality with distribute by. Cluster by substitution can be used when distribute by and sort by are the same field. However, the sort can only be a flashback sort, and you cannot specify that the collation is ASC or DESC. difference between three groupings Row_number: The line number is incremented regardless of whether the value of the Col2 field is equal, for example: Two records have the same value, but one is the first and the second rank: the col2 of the top and bottom two records are equal. The line number of the record is the same, but the next col2 value's line number is incremented N (n is the number of repetitions), for example: There are two side by side first, the next is third, there is no second Dense_rank: When the two records are col2 equal, the next col2 value is incremented by 1, for example: Two are tied first, Next is the second hive Optimization 1.fetch task task does not go mapreduce, you can set the Maximize and minimize fetch task tasks in the hive configuration file ; usually adjusted to more when using Hiveserver2;

Set the priority of the parameter: on the command line or code setting parameters > Hive-site.xml>hive-default.xml
 set hive.fetch.task.conversion=more;   In a single interactive mode,
 bin/hive--hiveconf hive.fetch.task.conversion=more The

above two methods can open the fetch task, but they are temporary If you want to enable this feature all the time, you can add the following configuration to ${hive_home}/conf/hive-site.xml:
<property>
  <name> hive.fetch.task.conversion</name>
  <value>more</value>
  <description>
    Some Select queries can is converted to the single FETCH task 
    minimizing latency. Currently the query should is single sourced not have any 
    subquery and should not have any
    aggregations or dis Tincts (which Incurrs RS), 
    lateral views and joins.
    1. Minimal:select STAR, filter on partition columns, LIMIT only
    2    : SELECT, filter, LIMIT only (+tablesa Mple, virtual columns)
  </description>
</property>
2.strict mode: Strict pattern setting, strict mode will restrict some query operation
File format, ORC parquet, etc.
Partition table
The select query does not add a where filter condition and does not execute
Open Strict Mode
hive provides strict mode, prohibit 3 kinds of cases of query mode.
A: When the table is a partitioned table, execution is not allowed when there are no partition fields and restrictions after the WHERE clause.
B: When you use the ORDER BY statement, you must use the Limit field, because the order by will only produce a reduce task.
C: A query that restricts Cartesian product. SQL statements do not add where where do not execute

<property>
  <name>hive.mapred.mode</name>
  <value>nonstrict </value>
  <description>the mode in which the Hive operations are being.
     In strict mode, some risky queries are don't allowed to run. They include:
       Cartesian Product.
       No partition being picked up for a query.
       Comparing bigints and strings.
       Comparing bigints and doubles.
       by without limit.
  </description>
</property>
3. Optimize the SQL statements, such as filtering and then join, first group and then do distinct;
Select COUNT (*) CNT from
store_sales SS
     join Household_demographics HD on (Ss.ss_hdemo_sk = Hd.hd_demo_sk)
     Join Time_dim t on (Ss.ss_sold_time_sk = T.t_time_sk)
     Join store s on (S.s_store_sk = Ss.ss_store_sk)
Where
     T. T_hour = 8
     t.t_minute >=
     Hd.hd_dep_count = 2 ORDER by
CNT;

Snappy compression of map, shuffle, reduce end of 4.MapReduce process

Need to replace Hadoop's native local pack to open compression
Enable compression and compression encoding on Mapred-site.xml file settings
Setting to enable compression and specify compression encoding when executing SQL execution

Set mapreduce.output.fileoutputformat.compress=true;
Set mapreduce.output.fileoutputformat.compress.codec=org Apache.hadoop.io.compress.SnappyCodec;

5. The large table splits into the child table, extracts the intermediate result set, reduces each load data
Multi-dimension analysis, multiple analysis modules
Each analysis module involves a different field, and is not a full field of the table

6. partition table and External table
Design level two partition table (First Level field is day, Level two field set hour)
Create an external table, specify the directory directly when creating the table, and no longer load data with load

7. Set map and reduce number: By default, a block corresponds to a map task, map data we generally do not adjust, reduce the number of data based on the size of the volume of the appropriate adjustments reflect the idea of "divide and conquer"

    Set mapred.reduce.tasks=3;
8.JVM reuse: A job may have more than one map reduce task, each task will open a JVM virtual machine, by default a task corresponds to a JVM, the task runs the JVM is destroyed, we can set JVM reuse parameters, generally no more than 5, Multiple tasks can be run continuously within such a JVM

JVM reuse is the content of the Hadoop tuning parameter, which has a very large impact on the performance of Hive, especially for scenarios where small files are difficult to avoid, or task-specific scenes where most of the execution time is short. The Hadoop default configuration is to use a derived JVM to perform the map and reduce tasks, which can cause considerable overhead when the JVM's startup process, especially if the execution job contains thousands of task tasks.
JVM reuse allows JVM instances to be reused n times in the same job, and the value of n can be set in the Mapre-site.xml file of Hadoop (recommended reference 5~10)
Mapred.job.reuse.jvm.num.tasks (old version)
Mapreduce.job.jvm.numtasks (New edition)
Hadoop.apache.org/docs/r2.5.2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
Http://hadoop.apache.org/docs/r2.5.2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml

You can also perform settings in hive:

Set mapred.job.reuse.jvm.num.tasks=10;
Hive (default) > Set mapred.job.reuse.jvm.num.tasks;
Mapred.job.reuse.jvm.num.tasks=1

9. Speculative execution: For example, a job application has 10 mapreduce tasks (map and reduce), of which 9 tasks have been completed, then application Master will start a different task to run the unfinished one. The last one to run, and kill the other.

The biggest advantage of enabling speculative is that when a map executes, the system will start the same map on other idle servers to run at the same time, which is the result of running quickly, and the other is slow to kill after having the result.

Hive-site.xml
hive.mapred.reduce.tasks.speculative.execution=true;
<property>
  <name>hive.mapred.reduce.tasks.speculative.execution</name>
  <value> true</value>
  <description>whether Speculative execution for reducers should to turned on. </ Description>
</property>
Data Skew

For a normal join operation, the hash value of the key at the map end is shuffle to a certain reduce, the join operation is done at the reduce end, the table on the left side of the join is cached in memory, the table on the right is traversed, and the join operation is followed. So when you do a join operation, put a table with a lot of data on the right side of the join.

When the data volume is large, and the key distribution is uneven, a large number of keys are shuffle to a reduce, there is the tilt of the data.

Common data skew appears in group by and join ... On.. Statement.

Join (data skew)
In the process of two table joins, since Hive are performed from left to right, note that the small table is preceded by a large table (the small table is cached first).

When the Map/reduce program executes, the reduce node performs most of the execution, but one or more of the reduce nodes run slowly, resulting in a lengthy processing time for the entire program, because a key has a much larger number of bars than the other keys (sometimes a hundredfold or thousand times). The reduce node, where the key is located, handles a much larger amount of data than the other nodes, causing a few nodes to run slowly, which is called data skew. Hive data skew often occurs in the running of data, so that the job often reduce completed after 99% has been stuck, the last 1% spent a few hours did not run, this situation is likely to be the reason for data skew,

Hive.groupby.skewindata=true; 
If the Group by procedure appears italic, this item should be set to true.
<property>
  <name>hive.groupby.skewindata</name>
  <value>false</value >
  <description>whether There is skew in data to optimize GROUP by queries</description>
</ Property>
Hive.optimize.skewjoin.compiletime=true;
If you are in the join process should set this to true
without affecting the result you can consider filtering nulls
<property>
  <name> hive.optimize.skewjoin.compiletime</name>
  <value>false</value>
</property>  

Hive.optimize.skewjoin.compiletime=true; If the join process appears italic should be set to True
The join statement is converted to two mapreduce tasks, and the first one adds random hash to the Jion field
Set hive.skewjoin.key=100000; This is the key to the join. The number of records that corresponds to this value is optimized.

You can add random hashes to the front of a null value 3 common join map-side join

The main meaning of mapjoin is that when the linked two tables are a smaller table and a particularly large table, we put the smaller table directly into memory, and then compare the larger table for the map operation. The join occurs at the time of the map operation, and whenever the data in a large table is scanned, it is necessary to look at the data in the small table, which matches it, and then connects. The join here does not involve the reduce operation. The advantage of the map-side join is that it's good to have no shuffle. In the actual application, we set this:
1. Set hive.auto.convert.join=true;
This setting allows the hive to automatically identify the smaller tables and then use Mapjoin to implement the Union of the two tables. Look at the connection to the two tables below.

<property>
  <name>hive.auto.convert.join.noconditionaltask.size</name>
  <value> 10000000</value> The default is 10MB
 </property>

Distributedcache is an implementation of distributed caching, which plays a very important role in the entire MapReduce framework, and he can support us in writing some fairly complex and efficient distributed processes

The first sentence here is to run the local map join task, and then dump the file to the xxx.hashtable below, in the file upload a file to the map join, then ran the Mr Code to run the counting task. To be blunt, in essence Mapjoin does not run the Mr Process at all, it is just a two-table union in memory.

Mapjoin Use scenes
1. There is a very small table in the association operation
2. Non-equivalent link operation

Automatic execution

Set hive.auto.convert.join=true;
hive.mapjoin.smalltable.filesize=25; The default value is 25MB   
<property>
  <name>hive.mapjoin.smalltable.filesize</name>
  <value>25000000</ Value>
 </property>

Manually perform a as a small table if a table exceeds 25M, you also want to use the map join;
Select/+mapjoin (a)/f.a,f.b from A t join B F on (F.A==T.A)

Hive Getting Started: Three ways to optimize join-haha's column-Blog channel-csdn.net
http://blog.csdn.net/liyaohhh/article/details/50697519 reduce-side Join

The hive JOIN operation uses the reduce join by default
The Reduce-side join principle is much simpler, and it does not guarantee that the same key but scattered data in different datasets can go into the same mapper, sorting the entire data set
Completed in the shuffle process after mapper. As opposed to Map-side Join, it does not require every mapper to read all the datasets, which is good, but also bad,
That is, the set of data to be sorted after mapper is very large, so the shuffle phase is less efficient than the map-side Join.
Reduce side join is one of the simplest ways to join, with the following main ideas:
In the map phase, the map function reads two files File1 and File2 at the same time, in order to distinguish between the two sources of key/value data pairs, a label for each piece of data (tag)

Semi Join small table to large table is a variant of the REUDCE join the map phase filters out fields that do not require a join equivalent to HIVW SQL plus where to filter SMB join (sort merge bucket)

The purpose of SMB existence is mainly to solve the problem of join between large table and large table, in fact, the bucket is to turn the large table into a "small table", then map-side join solution, this is a typical divide and conquer thought.

Set hive.enforce.bucketing=true;
Set hive.enforce.sorting=true;

Table Optimization Data goal: Gather the same data as much as possible

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.