Pig custom filtering UDF and loading UDF

Source: Internet
Author: User

Pig is a data stream programming language. It consists of a series of operations and transformations. Each operation or transformation processes the input and generates output results. The overall operation represents a data stream. Pig's execution environment translates data streams into executable internal representations. Within Pig, these transformation operations are converted into a series of MapReduce jobs.

Pig has many methods. Sometimes we need to customize a specific processing method, namely UDF.

The UDF procedure is as follows:

Step 1: inherit from the computing class or filter class or load class or storage class, overwrite the implementation method in it, and package the written class to generate a jar file. For example

Step 2: Go to Pig's grunt and use register to register the packaged file into Pig. Enter the grunt of Pig. The current local path is the path where the user inputs Pig. The path of the packaged file must be added. For example, register example. jar.

Step 3: Use the UDF directly. You must add the permission of this class to set the package name. If the package structure of example. jar is com. whut. FilterFunct. The reference is com. whut. FilterFunct (parameter ). Note that the class name is the method name used, which must be case sensitive.

Step 4: Define an alias for your UDF, so you are not allowed to add a package name when using it, as shown in figure

Define Goog com. whut. FilterFunct (). In this way, Goog is used directly.

Custom filtering UDF:

To filter udfs, you must inherit FilterFunc. Execute the exec method. This method returns the boolean type. When collecting temperature statistics, you can use the filter UDF to filter the correct temperature.


Package whut; import java. io. IOException; import java. util. arrayList; import java. util. list; import org. apache. pig. filterFunc; import org. apache. pig. funcSpec; import org.apache.pig.backend.exe cutionengine. execException; import org. apache. pig. data. dataType; import org. apache. pig. data. tuple; import org. apache. pig. impl. logicalLayer. frontendException; // delete a record that does not meet the requirements // pig custom function, filter function public class IsGoodQuality extends FilterFunc {@ Override public Boolean exec (Tuple tuple) throws IOException {// TODO Auto-generated method stub if (tuple = null | tuple. size () = 0) return false; try {Object obj = tuple. get (0); if (obj = null) return false; // convert it to an Integer int I = (Integer) obj; return I = 0 | I = 1 | I = 2 | I = 3;} catch (ExecException e) {throw new IOException (e );}}}

The parameter here is a tuple that can contain multiple input parameters and can be directly obtained using get (index location) in the method.

UDF

In Pig, external files are often used for loading. Load is generally used for loading, for example, Load 'input/tempdata' as (a: chararray, B: int ). PigStorage is the internal storage function by default.

Load 'input/tempdata' using PigStorage () as (a: chararray, B: int ). Here, the field delimiter of each row in PigStorage is a Tab character by default. Of course, you can also pass your own field delimiter. Sometimes each row is a string and you need to define a function to load a field. The following is an example.

aaaaa1990aaaaaa0039abbbbb1991bbbbbb0045accccc1992cccccc0011cddddd1993dddddd0043deeeee1994eeeeee0047eaaaaa1990aaaaaa0037abbbbb1991bbbbbb0027accccc1992cccccc0032cddddd1993dddddd0090deeeee1994eeeeee0091eaaaaa1980aaaaaa0041abbbbb1981bbbbbb0050accccc1992cccccc0020cddddd1993dddddd0033deeeee1984eeeeee0061eaaaaa1980aaaaaa0054abbbbb1991bbbbbb0075accccc1982cccccc0011cddddd1993dddddd0003deeeee1974eeeeee0041eaaaaa1990aaaaaa0039abbbbb1961bbbbbb0041accccc1972cccccc0070cddddd1993dddddd0042deeeee1974eeeeee0043eaaaaa1990aaaaaa0034abbbbb1971bbbbbb0025accccc1992cccccc0056cddddd1993dddddd0037deeeee1984eeeeee0038eaaaaa1990aaaaaa0049abbbbb1991bbbbbb0011accccc1962cccccc0012cddddd1993dddddd0023deeeee1984eeeeee0031eaaaaa1980aaaaaa0094abbbbb1971bbbbbb0045accccc1992cccccc0041cddddd1993dddddd0003deeeee1984eeeeee0081eaaaaa1960aaaaaa0099abbbbb1971bbbbbb0050accccc1952cccccc0055cddddd1963dddddd0043deeeee1994eeeeee0041eaaaaa1990aaaaaa0031abbbbb1991bbbbbb0020accccc1952cccccc0030cddddd1983dddddd0013deeeee1974eeeeee0061eaaaaa1980aaaaaa0071abbbbb1961bbbbbb0060accccc1992cccccc0080cddddd1953dddddd0033deeeee1964eeeeee0051eaaaaa1960aaaaaa0024abbbbb1951bbbbbb0035accccc1952cccccc0048cddddd1953dddddd0053deeeee1954eeeeee0048e

In order to retrieve the year and temperature, you need to define the Loading Function by yourself. The sequence number of each column starts with 0. The custom loading function must inherit LoadFunc. The specific code is as follows.

Package whut; import java. io. IOException; import java. util. arrayList; import java. util. list; import org. apache. commons. logging. log; import org. apache. commons. logging. logFactory; import org. apache. hadoop. io. text; import org. apache. hadoop. mapreduce. inputFormat; import org. apache. hadoop. mapreduce. job; import org. apache. hadoop. mapreduce. recordReader; import org. apache. hadoop. mapreduce. lib. input. fileInputFormat; I Mport org. apache. hadoop. mapreduce. lib. input. textInputFormat; import org. apache. pig. loadFunc; import org.apache.pig.backend.exe cutionengine. execException; import org.apache.pig.backend.hadoop.exe cutionengine. mapReduceLayer. pigSplit; import org. apache. pig. data. dataByteArray; import org. apache. pig. data. tuple; import org. apache. pig. data. tupleFactory; class Range {// the index of the column starts with 0 // The Position of the column separated by the field is private int start; pri Vate int end; // The string format must be 2 ~ 3, 5 ~ 6) public static List <Range> parse (String cutStr) throws Exception {List <Range> rangeList = new ArrayList <Range> (); // first, you must determine whether the format is correct boolean state = cutStr. matches ("\ d ++ ~ \ D + (, \ d + ~ \ D +) * "); if (! State) {throw new Exception ("InputForat Error: \ n" + "Usage: number ~ Number, number ~ Number; Such 2 ~ 7,10 ~ 19 ");} // extract the start and end positions of several fields, for example, 2 ~ 8 String [] splits = cutStr. split (","); // set the traversal length Range for (int I = 0; I <splits. length; I ++) {Range range = new Range (); String sub = splits [I]; String [] subSplits = sub. split ("~ "); Int subStart = Integer. parseInt (subSplits [0]); int subEnd = Integer. parseInt (subSplits [1]); if (subStart> subEnd) throw new Exception ("InputForat Error: \ n" + "Detail: first number must less than second number "); range. setStart (subStart); range. setEnd (subEnd); rangeList. add (range) ;}return rangeList;} public int getStart () {return start;} public void setStart (int start) {this. start = start;} publ Ic int getEnd () {return end;} public void setEnd (int end) {this. end = end;} public String getSubString (String inStr) {String res = inStr. substring (start, end); return res ;}// defines the loading function. The year is proposed from each line of string. The temperature is public class LineLoadFunc extends LoadFunc {private static final Log LOG = LogFactory. getLog (LineLoadFunc. class); // generate the fields private final TupleFactory tupleFactory = TupleFactory. getInstance (); // read Input record private RecordReader reader; // Save the set of each field private List <Range> ranges; // pass the parameter to set the column location to public LineLoadFunc (String cutPattern) throws Exception {ranges = Range. parse (cutPattern);} // sets the file loading location @ Override public void setLocation (String location, Job job) throws IOException {FileInputFormat. setInputPaths (job, location);} // sets the format of the input file to be loaded. // create a RecordReader @ Override public InputFormat getInputFo for each part. Rmat () throws IOException {return new TextInputFormat () ;}@ Override public void prepareToRead (RecordReader reader, PigSplit split) throws IOException {this. reader = reader;} @ Override public Tuple getNext () throws IOException {// TODO Auto-generated method stub try {if (! Reader. nextKeyValue () return null; // TextInputFormat // key: LongWritable, value: Text value = (Text) reader. getCurrentValue (); String line = value. toString (); // you can specify the fields Tuple tuple = tupleFactory. newTuple (ranges. size (); for (int I = 0; I <ranges. size (); I ++) {Range range = ranges. get (I); if (range. getEnd ()> line. length () {throw new ExecException ("InputFormat: Error \ n" + "field length more than total length");} // you must use DataByteArray to construct the field type tuple. set (I, new DataByteArray (range. getSubString (line);} return tuple;} catch (InterruptedException e) {throw new ExecException ();}}}


The specific method is as follows.


This article from the "dream in the cloud" blog, please be sure to keep this source http://computerdragon.blog.51cto.com/6235984/1288228

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.