How do I customize UDF functions for Apache Pig?

Source: Internet
Author: User

Recently due to work needs, need to use to pig to analyze the search log data on the line, scattered fairy intended to use hive to analyze, but for various reasons, no use, and pig (PIG0.12-CDH) scattered fairy has not been contacted, so can only cramming, spent two days, roughly read the documents of the Pig official website, during the document, is also the side of the actual learning, so since the study of pig, will be more easy, of course, this article is not to introduce how to learn a frame or language quickly, as the standard The question shows, the scattered fairy intends to introduce how in pig, using the user-defined UDF function, about learning experience, the scattered fairy will be introduced in the following article.



Once you have learned the use of UDFs, it means that you can use pig in a more flexible way, allowing it to extend some of the special features that are customized for our business scenarios, which are not available in the generic pig, for example:

The data format you read from HDFs, if loaded using the default Pigstorage (), the storage may only support limited data encoding and typing, and if we define a special encoding store or serialization method, then when we use the default pig to load, we will find that loading is not possible, This is where our UDF comes in handy, and we just need to customize a loadfunction and a storefunction to solve this problem.


This article is based on official documentation examples, to combat, and on the Hadoop cluster using pig test through:
Let's take a look at the definition of a UDF extension class, which requires several steps:

Serial number Steps Description
1 Create a new Java project in Eclipse and import the pig's core package Java Project
2 Create a new package, inherit a specific interface or class, override the custom section Core business
3 When you are finished writing, use Ant to package as a jar Pig dependency is required at compile time, but it is not necessary to get pig's jar package into UDF
4 Upload the packaged jar to HDFs Pig run time needs to be loaded using
5 In the pig script, register the jar package of our custom UDF Injection run-time environment
6 Write our core business Pig script run Test is running successfully


The project works as follows:





The core code is as follows:

Java code
  1. Package com.pigudf;
  2. Import java.io.IOException;
  3. Import Org.apache.pig.EvalFunc;
  4. Import Org.apache.pig.data.Tuple;
  5. Import org.apache.pig.impl.util.WrappedIOException;
  6. /**
  7. * Custom UDF class, uppercase for string conversion
  8. * @author Qindongliang
  9. * */
  10. Public class MyUDF extends evalfunc<string> {
  11. @Override
  12. Public String EXEC (Tuple input) throws IOException {
  13. //To determine if null or empty, skip
  14. if (input==null| | Input.size () = =0) {
  15. return null;
  16. }
  17. try{
  18. //Get first element
  19. String str= (String) Input.get (0);
  20. //Turn into uppercase to return
  21. return Str.touppercase ();
  22. }catch (Exception e) {
  23. throw Wrappedioexception.wrap ("caught exception processing input row", E);
  24. }
  25. }
  26. }



As for the packaged ant script, the scatter fairy will upload attachments at the end of the text, and look at some of the test data (note that the file must be uploaded to HDFs, unless you are in local mode):

Java code
    1. Grunt> Cat S.txt
    2. Zhang San,
    3. Song,
    4. Long,
    5. AbC,
    6. Grunt>




We're looking at the action file and the jar package are put together:

Java code
    1. Grunt> ls
    2. HDFs://dnode1:8020/tmp/udf/pudf.jar<r 3> 1295
    3. HDFs://dnode1:8020/tmp/udf/s.txt<r 3>
    4. Grunt>



Finally, let's look at the definition of the pig script:

Pig Code
    1. --Register the custom jar package
    2. REGISTER Pudf.jar;
    3. --load the data of the test file, comma as delimiter
    4. A = Load ' s.txt ' using pigstorage (', ');
    5. --traverse the data and turn the Name column into uppercase
    6. b = foreach a Generate Com.pigudf.MyUDF ((chararray) $0);
    7. --Start the MapReduce job for data analysis
    8. Dump B


Finally, let's take a look at the result, as long as the process does not appear to be abnormal and the task fails, it proves that our UDF is used successfully:

Java code
  1. Counters:
  2. Total Records written: 4
  3. Total bytes written:
  4. Spillable Memory Manager spill count: 0
  5. Total bags proactively spilled: 0
  6. Total records proactively spilled: 0
  7. Job DAG:
  8. job_1419419533357_0147
  9. 2014-12- :ten,394 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapreducelayer.mapreducelauncher-success!
  10. 2014-12- :ten,395 [main] INFO org.apache.hadoop.conf.Configuration.deprecation- Fs. default.name is deprecated. Instead, use Fs.defaultfs
  11. 2014-12- :ten,396 [main] INFO Org.apache.pig.data.schematuplebackend-key [  Pig.schematuple] is not set ... would not generate code.
  12. 2014-12- :ten,405 [main] INFO Org.apache.hadoop.mapreduce.lib.input.fileinputformat-total input paths to process: 1 /c10>
  13. 2014-12- :ten,405 [main] INFO Org.apache.pig.backend.hadoop.executionengine.util.mapredutil-total input paths to process: 1
  14. (ZHANG SAN,)
  15. (SONG, he)
  16. (LONG,)
  17. (ABC,)


The result is no problem, our UDF loading succeeds, and if we want to write our output directly to HDFs, we can remove the dump command at the end of the pig script and join
Store e into '/tmp/dongliang/result/'; Storing the results on HDFs, of course, we can customize the storage functions, write the results to the database, lucene,hbase and other relational or some NoSQL database.

How do I customize UDF functions for Apache Pig?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.