How do I customize UDF functions for Apache Pig?

Last Update:2014-12-31 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently due to work needs, need to use to pig to analyze the search log data on the line, scattered fairy intended to use hive to analyze, but for various reasons, no use, and pig (PIG0.12-CDH) scattered fairy has not been contacted, so can only cramming, spent two days, roughly read the documents of the Pig official website, during the document, is also the side of the actual learning, so since the study of pig, will be more easy, of course, this article is not to introduce how to learn a frame or language quickly, as the standard The question shows, the scattered fairy intends to introduce how in pig, using the user-defined UDF function, about learning experience, the scattered fairy will be introduced in the following article.

Once you have learned the use of UDFs, it means that you can use pig in a more flexible way, allowing it to extend some of the special features that are customized for our business scenarios, which are not available in the generic pig, for example:

The data format you read from HDFs, if loaded using the default Pigstorage (), the storage may only support limited data encoding and typing, and if we define a special encoding store or serialization method, then when we use the default pig to load, we will find that loading is not possible, This is where our UDF comes in handy, and we just need to customize a loadfunction and a storefunction to solve this problem.

This article is based on official documentation examples, to combat, and on the Hadoop cluster using pig test through:
Let's take a look at the definition of a UDF extension class, which requires several steps:

Serial number	Steps	Description
1	Create a new Java project in Eclipse and import the pig's core package	Java Project
2	Create a new package, inherit a specific interface or class, override the custom section	Core business
3	When you are finished writing, use Ant to package as a jar	Pig dependency is required at compile time, but it is not necessary to get pig's jar package into UDF
4	Upload the packaged jar to HDFs	Pig run time needs to be loaded using
5	In the pig script, register the jar package of our custom UDF	Injection run-time environment
6	Write our core business Pig script run	Test is running successfully

The project works as follows:

The core code is as follows:

Java code

Package com.pigudf;
Import java.io.IOException;
Import Org.apache.pig.EvalFunc;
Import Org.apache.pig.data.Tuple;
Import org.apache.pig.impl.util.WrappedIOException;
/**
* Custom UDF class, uppercase for string conversion
* @author Qindongliang
* */
Public class MyUDF extends evalfunc<string> {
@Override
Public String EXEC (Tuple input) throws IOException {
//To determine if null or empty, skip
if (input==null| | Input.size () = =0) {
return null;
}
try{
//Get first element
String str= (String) Input.get (0);
//Turn into uppercase to return
return Str.touppercase ();
}catch (Exception e) {
throw Wrappedioexception.wrap ("caught exception processing input row", E);
}
}
}

As for the packaged ant script, the scatter fairy will upload attachments at the end of the text, and look at some of the test data (note that the file must be uploaded to HDFs, unless you are in local mode):

Java code

Grunt> Cat S.txt
Zhang San,
Song,
Long,
AbC,
Grunt>

We're looking at the action file and the jar package are put together:

Java code

Grunt> ls
HDFs://dnode1:8020/tmp/udf/pudf.jar<r 3> 1295
HDFs://dnode1:8020/tmp/udf/s.txt<r 3>
Grunt>

Finally, let's look at the definition of the pig script:

Pig Code

--Register the custom jar package
REGISTER Pudf.jar;
--load the data of the test file, comma as delimiter
A = Load ' s.txt ' using pigstorage (', ');
--traverse the data and turn the Name column into uppercase
b = foreach a Generate Com.pigudf.MyUDF ((chararray) $0);
--Start the MapReduce job for data analysis
Dump B

Finally, let's take a look at the result, as long as the process does not appear to be abnormal and the task fails, it proves that our UDF is used successfully:

Java code

Counters:
Total Records written: 4
Total bytes written:
Spillable Memory Manager spill count: 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1419419533357_0147
2014-12- :ten,394 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapreducelayer.mapreducelauncher-success!
2014-12- :ten,395 [main] INFO org.apache.hadoop.conf.Configuration.deprecation- Fs. default.name is deprecated. Instead, use Fs.defaultfs
2014-12- :ten,396 [main] INFO Org.apache.pig.data.schematuplebackend-key [ Pig.schematuple] is not set ... would not generate code.
2014-12- :ten,405 [main] INFO Org.apache.hadoop.mapreduce.lib.input.fileinputformat-total input paths to process: 1 /c10>
2014-12- :ten,405 [main] INFO Org.apache.pig.backend.hadoop.executionengine.util.mapredutil-total input paths to process: 1
(ZHANG SAN,)
(SONG, he)
(LONG,)
(ABC,)

The result is no problem, our UDF loading succeeds, and if we want to write our output directly to HDFs, we can remove the dump command at the end of the pig script and join
Store e into '/tmp/dongliang/result/'; Storing the results on HDFs, of course, we can customize the storage functions, write the results to the database, lucene,hbase and other relational or some NoSQL database.

How do I customize UDF functions for Apache Pig?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How do I customize UDF functions for Apache Pig?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

How do I customize UDF functions for Apache Pig?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support