The hive version used in the project is less than 0.11 and cannot use the newly added window analysis function in Hive 0.11.
where the row_number () function needs to be used in the project, a UDF has been written to implement this function.
New Java project, BuildPath add Hadoop-core. Jar and Hive-exec...jar.
Package Myudf;import Java.util.arrays;import Org.apache.hadoop.hive.ql.exec.udf;public final class rank extends UDF { private int counter;private string[] _paras = null;public int Evaluate (String ... paras) {if (_paras! = null && ARR Ays.equals (_paras, paras)) {this.counter++;} else {this.counter = 1;_paras = paras;} return this.counter;}}
To create a data file for a secondary table:
This digital auxiliary table is useful when you are assisting in generating data.
Create a nums table in hive and load the sample data you just created;
Hive-e "CREATE table nums (num int) row format delimited stored as textfile;" HIVE-E "Load data local inpath '/home/hadoop/data/nums.txt ' overwrite into table nums;"
To generate data using a secondary table:
Hive-e "CREATE table MyAccount as select ' Account1 ' as account, ' Evan ' as Maker, Rand () *10000 as Trans_amount from Nums a Join Nums B; " Insert Overwrite table MyAccount select A.*from myaccount ajoin (SELECT * from nums where num <10) b;--check file H As how many blocksinsert overwrite table MyAccount Select A.*from myaccount a order by Account,maker;
So the data for this table is about 300 m.
By default, hive runs are handled using 2 maper.
Package and upload the Java project to HDFs for use by hive.
Hadoop Fs-mkdir/user/hive/warehouse/udf/;hadoop fs-copyfromlocal/home/hadoop/jar/myhiveudf.jar/user/hive/ Warehouse/udf/;add jar hdfs:/user/hive/warehouse/udf/myhiveudf.jar;create temporary function Myrank as " Myudf.rank ";
Data testing using a UDF for hive:
CREATE TABLE Myaccountrank as select Account,maker,trans_amount,myrank (Account,maker) as Rank1 from MyAccount;
According to our data, the result we want is max (RANK1) = 1000000; But the actual result is: 8348163;
The reason is that when the CREATE table is executed, two mapper are used to process the results, so the same account and maker are split to two map tasks to handle, so the result is problematic.
The question is how does the hive UDF determine whether to execute on map phase or reduce phase?
UDF is divided into UDAF,UDTF and ordinary UDF, it seems that our common UDF is directly in the map phase completed, and UDFA should be done in the reduce phase.
The same needs, if we do it correctly, can be done using Hive's own Row_number, as follows:
CREATE TABLE Myaccountrank2 as select Account,maker,trans_amount,row_number () over (partition by Account,maker ORDER by AC Count,maker) as Rank1 from MyAccount;
When executed, the display requires two mapper, a reduce to handle. The Java heap space error was reported at the first processing time.
Set mapred.child.java.opts; --default 200M
Adjust this parameter to 1g:set mapred.child.java.opts =-xmx1024m;
Execution is complete again.
Other common parameters:
Set mapred.max.split.size; ---256mhive.merge.mapfiles = true; --whether and and Map output file, default is TrueSet hive.merge.mapredfiles;hive.merge.mapredfiles = false; --whether to merge the Reduce output file, the default is Falseset hive.merge.size.per.task;hive.merge.size.per.task = 256*1000*1000; --The size of the merged file set Hive.input.format=org.apache.hadoop.hive.ql.io.combinehiveinputformat;
Set Hive.input.format;
Hive.input.format=org.apache.hadoop.hive.ql.io.combinehiveinputformat:d efault Value
View all parameters can use the command: SET–V;
Other hive tips found on the web are not tested and need to be fixed in practice, citing the following:
1. Filter the data as early as possible, reduce the amount of data at each stage, partition the partition table, and select only the fields that need to be used
2. Try to atomically, try to avoid a SQL containing complex logic, you can use the intermediate table to complete the complex logic
3.join operation of the small table to be careful on the left side of the join. You can specify the use of a map join if there is a small table.
4. If the number of parts of union all is greater than 2, or if the data volume of each union portion is large, it should be split into multiple insert into statements, and the execution time can be increased by 50% during the actual test.
5. Write SQL to understand the characteristics of the data itself, if there is a join, group operation, you should pay attention to whether there will be data skew
Hive UDF Experiment 1