Pig System Analysis (8) Pig scalability

Source: Internet
Author: User
Tags abs foreach closure eval extend final numeric numeric value

This article is the last in the Pig System Analysis series, which mainly discusses how to extend the pig function, not only to introduce the UDFs extension mechanism provided by pig itself, but also to explore the possibility of pig extension from the architecture.

Supplemental Note: A few days ago colleagues found Twitter-driven pig on Spark project: Spork, ready to study.

UDFs

by UDFs (user-defined function), you can customize the data processing methods and extend the pig function. In fact, UDFs needs to be register/define except before it is used, and the built-in functions are no different.

The basic Evalfunc

Take the built-in ABS function for example:

public class ABS extends evalfunc<double>{  
    /** 
     * Java level APIs 
     * @param input Expectsa single numeric V Alue 
     * @return Output Returns a single numeric value, absolute value of the argument
    /public Double exec (tup Le input) throws IOException {  
        if (input = = NULL | | input.size () = 0) return  
            null;  
       
        Double D;  
        try{  
            d = datatype.todouble (input.get (0));  
        } catch (NumberFormatException nfe) {  
            System.err.println (" Failed to process input; Error-"+ nfe.getmessage ());  
            return null;  
        } catch (Exception e) {  
            throw new IOException ("Caught Exception processing input row", e);  
        }  
        Return Math.Abs (d);  
    }  
    ..... Public schema Outputschema (schema input);  
    Public list<funcspec> getargtofuncmapping () throws frontendexception;  
       
}

function inherits the Evalfunc interface, and the generic parameter double represents the return type.

Exec method: Enter the parameter type as a tuple, representing a row of records.

Outputschema method: For processing input and output schemas

Getargtofuncmapping: Used to support various data type overloads.

Aggregate functions

The Evalfuc method can also implement aggregate functions because the group operation returns a record for each group, including a bag in each set, so the Exec method iterates through the bag record.

Take the Count function for example:

Public Long exec (Tuple input) throws IOException {  
    try {  
        databag bag = (databag) input.get (0);  
        if (bag==null) return  
            null;  
        Iterator it = Bag.iterator ();  
        Long cnt = 0;  
        while (It.hasnext ()) {  
            Tuple t = (Tuple) it.next ();  
            if (t!= null && t.size () > 0 && t.get (0)!= null)  
                cnt++;  
        }  
        return cnt;  
    } catch (execexception ee) {  
        throw ee;  
    } catch (Exception e) {  
        int errcode = 2106;                 
        String msg = "Error while computing count in" + This.getclass (). Getsimplename ();  
        throw new Execexception (msg, Errcode, Pigexception.bug, E);  
    }  

Algebraic and Accumulator interfaces

As mentioned earlier, aggregate functions with algebraic properties can be combiner optimized during the map-reduce process. Intuitively, a functional process with algebraic properties can be divided into three parts: initial (initialization, processing of partial input data), intermediate (intermediate process, processing of the results of the initialization process) and final (closure, processing of the results of the intermediate process). such as the Count function, the initialization process is the count count operation, and the middle process and the closure are sum operations. Further, if the function can perform the same operation in these three stages, then the function has a distributive property, such as the SUM function.

Pig provides the algebraic interface:

The public interface algebraic{  
    /** * Get the 
     initial function. 
     * @return A function name of F_init. F_init shouldbe an eval func. 
     * The return type off_init.exec () has to be Tuple/public
    String getinitial ();  
       
    /** 
     * Get the intermediatefunction. 
     * @return A function name of f_intermed. F_intermedshould is an eval func. 
     * The return type off_intermed.exec () has to be Tuple/public
    String getintermed ();  
       
    /** 
     * Get the final function. 
     * @return A function name of f_final. F_final shouldbe an eval func parametrized by * The same datum as the Evalfunc 
     implementing this interface. 
     *
    /Public String getfinal ();  
}

Each of these methods returns the name of the Evalfunc implementation class. Continue with the Count function as an example, count implements the algebraic interface for the following statement:

input= load ' data ' as (x, y);  
Grpd= group input by x;  
cnt= foreach GRPD generate group, COUNT (input);  
Storecnt into the ' result ';

Pig will rewrite the Mr Execution plan:

Map  
Load,foreach (group,count. Initial)  
Combine  
foreach (Group,count. Intermediate)  
Reduce  
foreach (Group,count. Final), Store

The algebraic interface optimizes the amount of data transfer through combiner, while the accumulator interface focuses on memory usage. After the UDF implements the accumulator interface, pig guarantees that all key data (through shuffle) be passed incrementally to the UDF (default pig.accumulative.batchsize=20000). Similarly, Count also implements the accumulator interface.

/* Accumulator Interface implementation/private long intermediatecount = 0L; @Override public void accumulate (Tuple b) throws IOException {try {databag bag = (databag) b.  
           Get (0);  
           Iterator it = Bag.iterator ();  
                while (It.hasnext ()) {Tuple T = (Tuple) it.next ();  
                if (t!= null && t.size () > 0 && t.get (0)!= null) {intermediatecount = 1;  
       A catch (execexception ee) {throw EE;  
           catch (Exception e) {int errcode = 2106;  
           String msg = "Error while computing min in" + This.getclass (). Getsimplename ();            
       throw new Execexception (msg, Errcode, Pigexception.bug, E);  
    @Override public void Cleanup () {intermediatecount = 0L; } @Override/* * The current key is processed and then called * * publiC Long GetValue () {return intermediatecount; }

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.