Four kinds of sorting in 1.hive
1.1 ORDER BY: Sort globally, with only one reduce
SELECT * from Hive.employee the order by ID;
1.2 Sort by: sort each of the internal data of reduce, the global result set is not sorted
Set mapreduce.job.reduces=3; Sets the number of reduce to 3
Insert overwrite local directory '/opt/data/employee_sort_by '
Row format delimited fields terminated by ' \ T ' collection items terminated by ' \ n '
SELECT * from Hive.employee sort by dept_id;
1.3 Distribute by: Partitioning the data, combined with sort by, similar to the partition in MapReduce in MapReduce, must precede sort by
Insert overwrite local directory '/opt/data/employee_distribute_by '
Row format delimited fields terminated by ' \ T ' collection items terminated by ' \ n '
SELECT * from Hive.employee distribute by dept_id Sort by ID ASC;
1.4 Cluster by: When the distribute by and sort by fields are the same, you can use cluster by instead
2. Using UDF custom functions
2.1 Writing UDF functions
Inherit extends UDF
Writing the Evaluate method
2.2 Importing custom functions into the Hive function library
Method One:
Add Jar/opt/data/jars/my_lower.jar;
Create temporary function My_lower as "COM.IBEIFENG.HIVE.UDF.LOWERUDF";
Method Two:
Create function Self_lower as ' com.ibeifeng.hive.udf.LowerUdf ' using jar ' hdfs://life-hadoop.life.com:8020/user/ Yanglin/data/jars/my_lower.jar ';
Use of 3.hiveserver2
3.1 Starting Hiveserver2 Bin/hiveserver2
3.2 Connecting using the Beeline
!connect jdbc:hive2://life-hadoop.life.com:10000 Yanglin Life@one
4. Data compression
Compression of 4.1 map output results
Set Mapreduce.map.output.compress =true
Set Mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.snappycodec
4.2 Reduce the compression of output results
Set Mapreduce.output.fileoutputformat.compress=true
Set Mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.snappycodec
4.3 Map input data compression
Storing data in a compressed format (for example: Orc,parquet)
CREATE table if not exists hive.employee_orc_snappy (ID int,name string,job string,manager_id int,apply_date string,salar Y Double,
Reward double,dept_id int)
Row format delimited fields terminated by ' \ t '
Stored as Orc tblproperties ("orc.compress" = "SNAPPY");
Where the table data storage format is orc, file compression format is snappy
5.hive Tuning
5.1 Modify the Hive.fetch.task.conversion parameter to minimize the use of MapReduce
<!--use mapreduce--> as little as possible
<property>
<name>hive.fetch.task.conversion</name>
<value>more</value>
</property>
5.2 Using large tables to split into small tables and sub-tables
5.3 Using an external table partition table
5.4 Data storage format for tables using ORC and parquet, and using snappy compression
5.5 Optimizing for SQL
Common Join/shuffle Join/reduce Join: Connection occurs in the Reduce task phase
Used between large tables and large tables, the data in each table is read from the file
Map join: Connection occurs in map task stage
Enables data from a large table to be read from a file between small tables and large tables, and the data for small tables is loaded into memory by Distributedcache
Note: You can use the map join or reduce join to automatically identify the program by setting Hive.auto.convert.join = True.
SMB Join:sort-merge-bucket join is an optimization of the reduce join
You declare the table when you create it [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS], and the partition fields of the two tables are consistent.
Set hive.auto.convert.sortmerge.join=true;
Set hive.optimize.bucketmapjoin = true;
Set hive.optimize.bucketmapjoin.sortedmerge = true;
5.6 Setting up job parallel execution
Set Hive.exec.parallel = True
Set Hive.exec.parallel.thread.number = 8 recommended 10~20, generally not more than 20
5.7 Setting up JVM reuse
Set mapreduce.job.jvm.numtasks = 1 generally no more than 9
5.8 Set the number of reduce
Set mapreduce.job.reduces = 1
5.9 Setting Speculative execution
Set hive.mapred.reduce.tasks.speculative.execution = True
Set mapreduce.map.speculative = True
Set mapreduce.reduce.speculative = True
5.10 Setting the number of maps
Set hive.merge.size.per.task = 256000000
Use of Hive 03