This article is the last in the Pig System Analysis series, which mainly discusses how to extend the pig function, not only to introduce the UDFs extension mechanism provided by pig itself, but also to explore the possibility of pig extension from the architecture.
Supplemental Note: A few days ago colleagues found Twitter-driven pig on Spark project: Spork, ready to study.
UDFs
by UDFs (user-defined function), you can customize the data processing methods and extend the pig function. In fact, UDFs needs to be register/define except before it is used, and the built-in functions are no different.
The basic Evalfunc
Take the built-in ABS function for example:
public class ABS extends evalfunc<double>{
/**
* Java level APIs
* @param input Expectsa single numeric V Alue
* @return Output Returns a single numeric value, absolute value of the argument
/public Double exec (tup Le input) throws IOException {
if (input = = NULL | | input.size () = 0) return
null;
Double D;
try{
d = datatype.todouble (input.get (0));
} catch (NumberFormatException nfe) {
System.err.println (" Failed to process input; Error-"+ nfe.getmessage ());
return null;
} catch (Exception e) {
throw new IOException ("Caught Exception processing input row", e);
}
Return Math.Abs (d);
}
..... Public schema Outputschema (schema input);
Public list<funcspec> getargtofuncmapping () throws frontendexception;
}
function inherits the Evalfunc interface, and the generic parameter double represents the return type.
Exec method: Enter the parameter type as a tuple, representing a row of records.
Outputschema method: For processing input and output schemas
Getargtofuncmapping: Used to support various data type overloads.
Aggregate functions
The Evalfuc method can also implement aggregate functions because the group operation returns a record for each group, including a bag in each set, so the Exec method iterates through the bag record.
Take the Count function for example:
Public Long exec (Tuple input) throws IOException {
try {
databag bag = (databag) input.get (0);
if (bag==null) return
null;
Iterator it = Bag.iterator ();
Long cnt = 0;
while (It.hasnext ()) {
Tuple t = (Tuple) it.next ();
if (t!= null && t.size () > 0 && t.get (0)!= null)
cnt++;
}
return cnt;
} catch (execexception ee) {
throw ee;
} catch (Exception e) {
int errcode = 2106;
String msg = "Error while computing count in" + This.getclass (). Getsimplename ();
throw new Execexception (msg, Errcode, Pigexception.bug, E);
}
Algebraic and Accumulator interfaces
As mentioned earlier, aggregate functions with algebraic properties can be combiner optimized during the map-reduce process. Intuitively, a functional process with algebraic properties can be divided into three parts: initial (initialization, processing of partial input data), intermediate (intermediate process, processing of the results of the initialization process) and final (closure, processing of the results of the intermediate process). such as the Count function, the initialization process is the count count operation, and the middle process and the closure are sum operations. Further, if the function can perform the same operation in these three stages, then the function has a distributive property, such as the SUM function.
Pig provides the algebraic interface:
The public interface algebraic{
/** * Get the
initial function.
* @return A function name of F_init. F_init shouldbe an eval func.
* The return type off_init.exec () has to be Tuple/public
String getinitial ();
/**
* Get the intermediatefunction.
* @return A function name of f_intermed. F_intermedshould is an eval func.
* The return type off_intermed.exec () has to be Tuple/public
String getintermed ();
/**
* Get the final function.
* @return A function name of f_final. F_final shouldbe an eval func parametrized by * The same datum as the Evalfunc
implementing this interface.
*
/Public String getfinal ();
}
Each of these methods returns the name of the Evalfunc implementation class. Continue with the Count function as an example, count implements the algebraic interface for the following statement:
input= load ' data ' as (x, y);
Grpd= group input by x;
cnt= foreach GRPD generate group, COUNT (input);
Storecnt into the ' result ';
Pig will rewrite the Mr Execution plan:
Map
Load,foreach (group,count. Initial)
Combine
foreach (Group,count. Intermediate)
Reduce
foreach (Group,count. Final), Store
The algebraic interface optimizes the amount of data transfer through combiner, while the accumulator interface focuses on memory usage. After the UDF implements the accumulator interface, pig guarantees that all key data (through shuffle) be passed incrementally to the UDF (default pig.accumulative.batchsize=20000). Similarly, Count also implements the accumulator interface.
/* Accumulator Interface implementation/private long intermediatecount = 0L; @Override public void accumulate (Tuple b) throws IOException {try {databag bag = (databag) b.
Get (0);
Iterator it = Bag.iterator ();
while (It.hasnext ()) {Tuple T = (Tuple) it.next ();
if (t!= null && t.size () > 0 && t.get (0)!= null) {intermediatecount = 1;
A catch (execexception ee) {throw EE;
catch (Exception e) {int errcode = 2106;
String msg = "Error while computing min in" + This.getclass (). Getsimplename ();
throw new Execexception (msg, Errcode, Pigexception.bug, E);
@Override public void Cleanup () {intermediatecount = 0L; } @Override/* * The current key is processed and then called * * publiC Long GetValue () {return intermediatecount; }