Use of Hive 03

Source: Internet
Author: User

Four kinds of sorting in 1.hive

1.1 ORDER BY: Sort globally, with only one reduce

SELECT * from Hive.employee the order by ID;

    

1.2 Sort by: sort each of the internal data of reduce, the global result set is not sorted

Set mapreduce.job.reduces=3; Sets the number of reduce to 3

    

Insert overwrite local directory '/opt/data/employee_sort_by '
Row format delimited fields terminated by ' \ T ' collection items terminated by ' \ n '
SELECT * from Hive.employee sort by dept_id;

    

1.3 Distribute by: Partitioning the data, combined with sort by, similar to the partition in MapReduce in MapReduce, must precede sort by

Insert overwrite local directory '/opt/data/employee_distribute_by '
Row format delimited fields terminated by ' \ T ' collection items terminated by ' \ n '
SELECT * from Hive.employee distribute by dept_id Sort by ID ASC;

    

1.4 Cluster by: When the distribute by and sort by fields are the same, you can use cluster by instead

2. Using UDF custom functions

2.1 Writing UDF functions

Inherit extends UDF

Writing the Evaluate method

2.2 Importing custom functions into the Hive function library

Method One:

Add Jar/opt/data/jars/my_lower.jar;
Create temporary function My_lower as "COM.IBEIFENG.HIVE.UDF.LOWERUDF";

      

      

Method Two:

Create function Self_lower as ' com.ibeifeng.hive.udf.LowerUdf ' using jar ' hdfs://life-hadoop.life.com:8020/user/ Yanglin/data/jars/my_lower.jar ';

      

      

Use of 3.hiveserver2

3.1 Starting Hiveserver2 Bin/hiveserver2

    

3.2 Connecting using the Beeline

!connect jdbc:hive2://life-hadoop.life.com:10000 Yanglin Life@one

    

4. Data compression

Compression of 4.1 map output results

Set Mapreduce.map.output.compress =true

Set Mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.snappycodec

4.2 Reduce the compression of output results

Set Mapreduce.output.fileoutputformat.compress=true

Set Mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.snappycodec

4.3 Map input data compression

Storing data in a compressed format (for example: Orc,parquet)

CREATE table if not exists hive.employee_orc_snappy (ID int,name string,job string,manager_id int,apply_date string,salar Y Double,
Reward double,dept_id int)

Row format delimited fields terminated by ' \ t '
Stored as Orc tblproperties ("orc.compress" = "SNAPPY");

Where the table data storage format is orc, file compression format is snappy

5.hive Tuning

5.1 Modify the Hive.fetch.task.conversion parameter to minimize the use of MapReduce

<!--use mapreduce--> as little as possible
<property>
<name>hive.fetch.task.conversion</name>
<value>more</value>
</property>

5.2 Using large tables to split into small tables and sub-tables

5.3 Using an external table partition table

5.4 Data storage format for tables using ORC and parquet, and using snappy compression

5.5 Optimizing for SQL

Common Join/shuffle Join/reduce Join: Connection occurs in the Reduce task phase

Used between large tables and large tables, the data in each table is read from the file

Map join: Connection occurs in map task stage

Enables data from a large table to be read from a file between small tables and large tables, and the data for small tables is loaded into memory by Distributedcache

Note: You can use the map join or reduce join to automatically identify the program by setting Hive.auto.convert.join = True.

SMB Join:sort-merge-bucket join is an optimization of the reduce join

You declare the table when you create it [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS], and the partition fields of the two tables are consistent.

Set hive.auto.convert.sortmerge.join=true;

Set hive.optimize.bucketmapjoin = true;

Set hive.optimize.bucketmapjoin.sortedmerge = true;

5.6 Setting up job parallel execution

Set Hive.exec.parallel = True

Set Hive.exec.parallel.thread.number = 8 recommended 10~20, generally not more than 20

5.7 Setting up JVM reuse

Set mapreduce.job.jvm.numtasks = 1 generally no more than 9

5.8 Set the number of reduce

Set mapreduce.job.reduces = 1

5.9 Setting Speculative execution

Set hive.mapred.reduce.tasks.speculative.execution = True

Set mapreduce.map.speculative = True

Set mapreduce.reduce.speculative = True

5.10 Setting the number of maps

Set hive.merge.size.per.task = 256000000

  

  

Use of Hive 03

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.