How to customize UDF for Apache Pig?

Source: Internet
Author: User

How to customize UDF for Apache Pig?

Recently, Pig needs to be used to analyze online search log data because of work requirements. I originally intended to use hive for analysis. However, for various reasons, it is useless, pig (pig0.12-cdh) has never been in touch with it, so it only takes two days to get rid of it. After reading the documents on the pig official website, it is also a matter of practice and learning. Since then, it is easier to learn pig. Of course, this article is not an article about how to quickly learn a framework or language, as shown in the title, I plan to introduce how to use User-Defined UDF functions in Pig. I will introduce the learning experience in the following articles.

Once you have learned how to use udfs, it means that you can use Pig in a more flexible way to extend some of the special features customized for our business scenarios, it does not exist in the general pig. For example:

The data format you read from HDFS. If you use the default PigStorage () for loading, the storage may only support limited data encoding and types, if we define a special encoding storage or serialization method, when we use the default Pig for loading, we will find that the loading fails, at this time, our UDF will be used. We only need to customize a LoadFunction and a StoreFunction to solve this problem.

Based on the examples in the official documents, this article demonstrates how to use Pig in a Hadoop cluster:

Let's take a look at the following steps to define a UDF extension class:

Serial number Procedure Description
1 Create a java project in eclipse and import the pig core package Java Project
2 Create a new package, inherit specific interfaces or classes, and rewrite the custom part. Core Business
3 After compilation, use ant to package it into jar Pig dependencies are required during compilation, but pig jar packages do not need to be imported into UDF.
4 Upload the package jar to HDFS. Load and use pig Runtime
5 Register the udf jar package in the pig script. Inject Runtime Environment
6 Compile pig scripts for running our core business Test whether the operation is successful

The project is as follows:

The core code is as follows:

Package com. pigudf;

  • Import java. io. IOException;
  • Import org. apache. pig. EvalFunc;
  • Import org. apache. pig. data. Tuple;
  • Import org. apache. pig. impl. util. WrappedIOException;
  • /**
  • * Custom UDF class, which converts strings in uppercase.
  • * @ Author qindongliang
  • **/
  • Public class MyUDF extends EvalFunc <String> {
  • @ Override
  • Public String exec (Tuple input) throws IOException {
  • // Skip if it is null or null
  • If (input = null | input. size () = 0 ){
  • Return null;
  • }
  • Try {
  • // Obtain the first element
  • String str = (String) input. get (0 );
  • // Returns the result in uppercase.
  • Return str. toUpperCase ();
  • } Catch (Exception e ){
  • Throw WrappedIOException. wrap ("Caught exception processing input row", e );
  • }
  • }
  • }



For the packaged ant script, xianxian will upload the attachment at the end of the text. Next, let's take a look at some test data (note that the file must be uploaded to HDFS unless you are in local mode ):

Java code
  1. Grunt> cat s.txt
  2. Zhang san, 12
  3. Song, 34
  4. Long, 34
  5. AbC, 12
  6. Grunt>




We can see that the operating file and jar package are put together:

Java code
  1. Grunt> ls
  2. Hdfs: // dnode1: 8020/tmp/udfs/pudf. jar <r 3> 1295
  3. Hdfs: // dnode1: 8020/tmp/udf/s.txt <r 3> 36
  4. Grunt>



Finally, let's take a look at the pig script definition:

Pig code
  1. -- Register a custom jar package
  2. REGISTER pudf. jar;
  3. -- Load the data of the test file. Use a comma as the separator.
  4. A = load 's.txt 'using PigStorage (',');
  5. -- Traverses data and converts the name column to uppercase
  6. B = foreach a generate com. pigudf. MyUDF (chararray) $0 );
  7. -- Start MapReduce Job for data analysis
  8. Dump B


Finally, let's look at the results. As long as there is no exception or task failure in the process, it proves that our udf is successfully used:

Java code
  1. Counters:
  2. Total records written: 4
  3. Total bytes written: 64
  4. Spillable Memory Manager spill count: 0
  5. Total bags proactively spilled: 0
  6. Total records proactively spilled: 0
  7. Job DAG:
  8. Job_1419419533357_0147
  9. 18:10:24, 394 [main] INFO org.apache.pig.backend.hadoop.exe cutionengine. mapReduceLayer. MapReduceLauncher-Success!
  10. 18:10:24, 395 [main] INFO org. apache. hadoop. conf. Configuration. deprecation-fs. default. name is deprecated. Instead, use fs. defaultFS
  11. 18:10:24, 396 [main] INFO org. apache. pig. data. SchemaTupleBackend-Key [pig. schematuple] was not set... will not generate code.
  12. 18:10:24, 405 [main] INFO org. apache. hadoop. mapreduce. lib. input. FileInputFormat-Total input paths to process: 1
  13. 18:10:24, 405 [main] INFO org.apache.pig.backend.hadoop.exe cutionengine. util. MapRedUtil-Total input paths to process: 1
  14. (Zhang san, 12)
  15. (SONG, 34)
  16. (LONG, 34)
  17. (ABC, 12)


The result is okay. Our UDF is loaded and executed successfully. If we want to write the output result directly to HDFS, we can remove the dump command at the end of the pig script and add
Store e into '/tmp/dongliang/result/'; store the results to HDFS. Of course, we can customize the storage function to write the results to the database, Lucene, hbase and other relational databases or NOSQL databases.

Installation and testing of Pig

Pig installation and configuration tutorial

Pig installation and deployment and testing in MapReduce Mode

Install Pig and test in local mode.

Installation configuration and basic use of Pig

Hadoop Pig advanced syntax

This article permanently updates the link address:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.