Hive UDF Experiment 1

Source: Internet
Author: User
Tags hadoop fs

The hive version used in the project is less than 0.11 and cannot use the newly added window analysis function in Hive 0.11.

where the row_number () function needs to be used in the project, a UDF has been written to implement this function.

New Java project, BuildPath add Hadoop-core. Jar and Hive-exec...jar.

Package Myudf;import Java.util.arrays;import Org.apache.hadoop.hive.ql.exec.udf;public final class rank extends UDF { private int counter;private string[] _paras = null;public int Evaluate (String ... paras) {if (_paras! = null && ARR Ays.equals (_paras, paras)) {this.counter++;} else {this.counter = 1;_paras = paras;} return this.counter;}}

To create a data file for a secondary table:

This digital auxiliary table is useful when you are assisting in generating data.

Create a nums table in hive and load the sample data you just created;

Hive-e "CREATE table nums (num int) row format delimited stored as textfile;" HIVE-E "Load data local inpath '/home/hadoop/data/nums.txt ' overwrite into table nums;"

To generate data using a secondary table:

Hive-e "CREATE table MyAccount as select ' Account1 ' as account, ' Evan ' as Maker, Rand () *10000 as Trans_amount from Nums a Join Nums B; " Insert Overwrite  table MyAccount select A.*from myaccount ajoin (SELECT * from nums where num <10) b;--check file H As how many blocksinsert overwrite  table MyAccount Select A.*from myaccount a order by Account,maker;

So the data for this table is about 300 m.

By default, hive runs are handled using 2 maper.

Package and upload the Java project to HDFs for use by hive.

Hadoop Fs-mkdir/user/hive/warehouse/udf/;hadoop fs-copyfromlocal/home/hadoop/jar/myhiveudf.jar/user/hive/ Warehouse/udf/;add jar  hdfs:/user/hive/warehouse/udf/myhiveudf.jar;create temporary function Myrank as " Myudf.rank ";

Data testing using a UDF for hive:

CREATE TABLE Myaccountrank as select Account,maker,trans_amount,myrank (Account,maker) as Rank1 from MyAccount;

According to our data, the result we want is max (RANK1) = 1000000; But the actual result is: 8348163;

The reason is that when the CREATE table is executed, two mapper are used to process the results, so the same account and maker are split to two map tasks to handle, so the result is problematic.

The question is how does the hive UDF determine whether to execute on map phase or reduce phase?

UDF is divided into UDAF,UDTF and ordinary UDF, it seems that our common UDF is directly in the map phase completed, and UDFA should be done in the reduce phase.

The same needs, if we do it correctly, can be done using Hive's own Row_number, as follows:

CREATE TABLE Myaccountrank2 as select Account,maker,trans_amount,row_number () over (partition by Account,maker ORDER by AC Count,maker) as Rank1 from MyAccount;
When executed, the display requires two mapper, a reduce to handle. The Java heap space error was reported at the first processing time.
Set mapred.child.java.opts; --default 200M
Adjust this parameter to 1g:set mapred.child.java.opts =-xmx1024m;
Execution is complete again.
Other common parameters:
Set mapred.max.split.size; ---256mhive.merge.mapfiles = true; --whether and and Map output file, default is TrueSet hive.merge.mapredfiles;hive.merge.mapredfiles = false; --whether to merge the Reduce output file, the default is Falseset hive.merge.size.per.task;hive.merge.size.per.task = 256*1000*1000; --The size of the merged file set Hive.input.format=org.apache.hadoop.hive.ql.io.combinehiveinputformat;

Set Hive.input.format;
Hive.input.format=org.apache.hadoop.hive.ql.io.combinehiveinputformat:d efault Value

View all parameters can use the command: SET–V;

Other hive tips found on the web are not tested and need to be fixed in practice, citing the following:

1. Filter the data as early as possible, reduce the amount of data at each stage, partition the partition table, and select only the fields that need to be used
2. Try to atomically, try to avoid a SQL containing complex logic, you can use the intermediate table to complete the complex logic
3.join operation of the small table to be careful on the left side of the join. You can specify the use of a map join if there is a small table.
4. If the number of parts of union all is greater than 2, or if the data volume of each union portion is large, it should be split into multiple insert into statements, and the execution time can be increased by 50% during the actual test.
5. Write SQL to understand the characteristics of the data itself, if there is a join, group operation, you should pay attention to whether there will be data skew

Hive UDF Experiment 1

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.