Hive UDF Experiment 1

Last Update:2015-03-14 Source: Internet

Author: User

Tags hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The hive version used in the project is less than 0.11 and cannot use the newly added window analysis function in Hive 0.11.

where the row_number () function needs to be used in the project, a UDF has been written to implement this function.

New Java project, BuildPath add Hadoop-core. Jar and Hive-exec...jar.

Package Myudf;import Java.util.arrays;import Org.apache.hadoop.hive.ql.exec.udf;public final class rank extends UDF { private int counter;private string[] _paras = null;public int Evaluate (String ... paras) {if (_paras! = null && ARR Ays.equals (_paras, paras)) {this.counter++;} else {this.counter = 1;_paras = paras;} return this.counter;}}

To create a data file for a secondary table:

This digital auxiliary table is useful when you are assisting in generating data.

Create a nums table in hive and load the sample data you just created;

Hive-e "CREATE table nums (num int) row format delimited stored as textfile;" HIVE-E "Load data local inpath '/home/hadoop/data/nums.txt ' overwrite into table nums;"

To generate data using a secondary table:

Hive-e "CREATE table MyAccount as select ' Account1 ' as account, ' Evan ' as Maker, Rand () *10000 as Trans_amount from Nums a Join Nums B; " Insert Overwrite  table MyAccount select A.*from myaccount ajoin (SELECT * from nums where num <10) b;--check file H As how many blocksinsert overwrite  table MyAccount Select A.*from myaccount a order by Account,maker;

So the data for this table is about 300 m.

By default, hive runs are handled using 2 maper.

Package and upload the Java project to HDFs for use by hive.

Hadoop Fs-mkdir/user/hive/warehouse/udf/;hadoop fs-copyfromlocal/home/hadoop/jar/myhiveudf.jar/user/hive/ Warehouse/udf/;add jar  hdfs:/user/hive/warehouse/udf/myhiveudf.jar;create temporary function Myrank as " Myudf.rank ";

Data testing using a UDF for hive:

CREATE TABLE Myaccountrank as select Account,maker,trans_amount,myrank (Account,maker) as Rank1 from MyAccount;

According to our data, the result we want is max (RANK1) = 1000000; But the actual result is: 8348163;

The reason is that when the CREATE table is executed, two mapper are used to process the results, so the same account and maker are split to two map tasks to handle, so the result is problematic.

The question is how does the hive UDF determine whether to execute on map phase or reduce phase?

UDF is divided into UDAF,UDTF and ordinary UDF, it seems that our common UDF is directly in the map phase completed, and UDFA should be done in the reduce phase.

The same needs, if we do it correctly, can be done using Hive's own Row_number, as follows:

CREATE TABLE Myaccountrank2 as select Account,maker,trans_amount,row_number () over (partition by Account,maker ORDER by AC Count,maker) as Rank1 from MyAccount;

When executed, the display requires two mapper, a reduce to handle. The Java heap space error was reported at the first processing time.

Set mapred.child.java.opts; --default 200M

Adjust this parameter to 1g:set mapred.child.java.opts =-xmx1024m;

Execution is complete again.

Other common parameters:

Set mapred.max.split.size; ---256mhive.merge.mapfiles = true; --whether and and Map output file, default is TrueSet hive.merge.mapredfiles;hive.merge.mapredfiles = false; --whether to merge the Reduce output file, the default is Falseset hive.merge.size.per.task;hive.merge.size.per.task = 256*1000*1000; --The size of the merged file set Hive.input.format=org.apache.hadoop.hive.ql.io.combinehiveinputformat;

Set Hive.input.format;
Hive.input.format=org.apache.hadoop.hive.ql.io.combinehiveinputformat:d efault Value

View all parameters can use the command: SET–V;

Other hive tips found on the web are not tested and need to be fixed in practice, citing the following:

1. Filter the data as early as possible, reduce the amount of data at each stage, partition the partition table, and select only the fields that need to be used
2. Try to atomically, try to avoid a SQL containing complex logic, you can use the intermediate table to complete the complex logic
3.join operation of the small table to be careful on the left side of the join. You can specify the use of a map join if there is a small table.
4. If the number of parts of union all is greater than 2, or if the data volume of each union portion is large, it should be split into multiple insert into statements, and the execution time can be increased by 50% during the actual test.
5. Write SQL to understand the characteristics of the data itself, if there is a join, group operation, you should pay attention to whether there will be data skew

Hive UDF Experiment 1

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More