Pig custom filtering UDF and loading UDF

Last Update:2013-12-28 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Pig is a data stream programming language. It consists of a series of operations and transformations. Each operation or transformation processes the input and generates output results. The overall operation represents a data stream. Pig's execution environment translates data streams into executable internal representations. Within Pig, these transformation operations are converted into a series of MapReduce jobs.

Pig has many methods. Sometimes we need to customize a specific processing method, namely UDF.

The UDF procedure is as follows:

Step 1: inherit from the computing class or filter class or load class or storage class, overwrite the implementation method in it, and package the written class to generate a jar file. For example

Step 2: Go to Pig's grunt and use register to register the packaged file into Pig. Enter the grunt of Pig. The current local path is the path where the user inputs Pig. The path of the packaged file must be added. For example, register example. jar.

Step 3: Use the UDF directly. You must add the permission of this class to set the package name. If the package structure of example. jar is com. whut. FilterFunct. The reference is com. whut. FilterFunct (parameter ). Note that the class name is the method name used, which must be case sensitive.

Step 4: Define an alias for your UDF, so you are not allowed to add a package name when using it, as shown in figure

Define Goog com. whut. FilterFunct (). In this way, Goog is used directly.

Custom filtering UDF:

To filter udfs, you must inherit FilterFunc. Execute the exec method. This method returns the boolean type. When collecting temperature statistics, you can use the filter UDF to filter the correct temperature.

Package whut; import java. io. IOException; import java. util. arrayList; import java. util. list; import org. apache. pig. filterFunc; import org. apache. pig. funcSpec; import org.apache.pig.backend.exe cutionengine. execException; import org. apache. pig. data. dataType; import org. apache. pig. data. tuple; import org. apache. pig. impl. logicalLayer. frontendException; // delete a record that does not meet the requirements // pig custom function, filter function public class IsGoodQuality extends FilterFunc {@ Override public Boolean exec (Tuple tuple) throws IOException {// TODO Auto-generated method stub if (tuple = null | tuple. size () = 0) return false; try {Object obj = tuple. get (0); if (obj = null) return false; // convert it to an Integer int I = (Integer) obj; return I = 0 | I = 1 | I = 2 | I = 3;} catch (ExecException e) {throw new IOException (e );}}}

The parameter here is a tuple that can contain multiple input parameters and can be directly obtained using get (index location) in the method.

UDF

In Pig, external files are often used for loading. Load is generally used for loading, for example, Load 'input/tempdata' as (a: chararray, B: int ). PigStorage is the internal storage function by default.

Load 'input/tempdata' using PigStorage () as (a: chararray, B: int ). Here, the field delimiter of each row in PigStorage is a Tab character by default. Of course, you can also pass your own field delimiter. Sometimes each row is a string and you need to define a function to load a field. The following is an example.

aaaaa1990aaaaaa0039abbbbb1991bbbbbb0045accccc1992cccccc0011cddddd1993dddddd0043deeeee1994eeeeee0047eaaaaa1990aaaaaa0037abbbbb1991bbbbbb0027accccc1992cccccc0032cddddd1993dddddd0090deeeee1994eeeeee0091eaaaaa1980aaaaaa0041abbbbb1981bbbbbb0050accccc1992cccccc0020cddddd1993dddddd0033deeeee1984eeeeee0061eaaaaa1980aaaaaa0054abbbbb1991bbbbbb0075accccc1982cccccc0011cddddd1993dddddd0003deeeee1974eeeeee0041eaaaaa1990aaaaaa0039abbbbb1961bbbbbb0041accccc1972cccccc0070cddddd1993dddddd0042deeeee1974eeeeee0043eaaaaa1990aaaaaa0034abbbbb1971bbbbbb0025accccc1992cccccc0056cddddd1993dddddd0037deeeee1984eeeeee0038eaaaaa1990aaaaaa0049abbbbb1991bbbbbb0011accccc1962cccccc0012cddddd1993dddddd0023deeeee1984eeeeee0031eaaaaa1980aaaaaa0094abbbbb1971bbbbbb0045accccc1992cccccc0041cddddd1993dddddd0003deeeee1984eeeeee0081eaaaaa1960aaaaaa0099abbbbb1971bbbbbb0050accccc1952cccccc0055cddddd1963dddddd0043deeeee1994eeeeee0041eaaaaa1990aaaaaa0031abbbbb1991bbbbbb0020accccc1952cccccc0030cddddd1983dddddd0013deeeee1974eeeeee0061eaaaaa1980aaaaaa0071abbbbb1961bbbbbb0060accccc1992cccccc0080cddddd1953dddddd0033deeeee1964eeeeee0051eaaaaa1960aaaaaa0024abbbbb1951bbbbbb0035accccc1952cccccc0048cddddd1953dddddd0053deeeee1954eeeeee0048e

In order to retrieve the year and temperature, you need to define the Loading Function by yourself. The sequence number of each column starts with 0. The custom loading function must inherit LoadFunc. The specific code is as follows.

Package whut; import java. io. IOException; import java. util. arrayList; import java. util. list; import org. apache. commons. logging. log; import org. apache. commons. logging. logFactory; import org. apache. hadoop. io. text; import org. apache. hadoop. mapreduce. inputFormat; import org. apache. hadoop. mapreduce. job; import org. apache. hadoop. mapreduce. recordReader; import org. apache. hadoop. mapreduce. lib. input. fileInputFormat; I Mport org. apache. hadoop. mapreduce. lib. input. textInputFormat; import org. apache. pig. loadFunc; import org.apache.pig.backend.exe cutionengine. execException; import org.apache.pig.backend.hadoop.exe cutionengine. mapReduceLayer. pigSplit; import org. apache. pig. data. dataByteArray; import org. apache. pig. data. tuple; import org. apache. pig. data. tupleFactory; class Range {// the index of the column starts with 0 // The Position of the column separated by the field is private int start; pri Vate int end; // The string format must be 2 ~ 3, 5 ~ 6) public static List <Range> parse (String cutStr) throws Exception {List <Range> rangeList = new ArrayList <Range> (); // first, you must determine whether the format is correct boolean state = cutStr. matches ("\ d ++ ~ \ D + (, \ d + ~ \ D +) * "); if (! State) {throw new Exception ("InputForat Error: \ n" + "Usage: number ~ Number, number ~ Number; Such 2 ~ 7,10 ~ 19 ");} // extract the start and end positions of several fields, for example, 2 ~ 8 String [] splits = cutStr. split (","); // set the traversal length Range for (int I = 0; I <splits. length; I ++) {Range range = new Range (); String sub = splits [I]; String [] subSplits = sub. split ("~ "); Int subStart = Integer. parseInt (subSplits [0]); int subEnd = Integer. parseInt (subSplits [1]); if (subStart> subEnd) throw new Exception ("InputForat Error: \ n" + "Detail: first number must less than second number "); range. setStart (subStart); range. setEnd (subEnd); rangeList. add (range) ;}return rangeList;} public int getStart () {return start;} public void setStart (int start) {this. start = start;} publ Ic int getEnd () {return end;} public void setEnd (int end) {this. end = end;} public String getSubString (String inStr) {String res = inStr. substring (start, end); return res ;}// defines the loading function. The year is proposed from each line of string. The temperature is public class LineLoadFunc extends LoadFunc {private static final Log LOG = LogFactory. getLog (LineLoadFunc. class); // generate the fields private final TupleFactory tupleFactory = TupleFactory. getInstance (); // read Input record private RecordReader reader; // Save the set of each field private List <Range> ranges; // pass the parameter to set the column location to public LineLoadFunc (String cutPattern) throws Exception {ranges = Range. parse (cutPattern);} // sets the file loading location @ Override public void setLocation (String location, Job job) throws IOException {FileInputFormat. setInputPaths (job, location);} // sets the format of the input file to be loaded. // create a RecordReader @ Override public InputFormat getInputFo for each part. Rmat () throws IOException {return new TextInputFormat () ;}@ Override public void prepareToRead (RecordReader reader, PigSplit split) throws IOException {this. reader = reader;} @ Override public Tuple getNext () throws IOException {// TODO Auto-generated method stub try {if (! Reader. nextKeyValue () return null; // TextInputFormat // key: LongWritable, value: Text value = (Text) reader. getCurrentValue (); String line = value. toString (); // you can specify the fields Tuple tuple = tupleFactory. newTuple (ranges. size (); for (int I = 0; I <ranges. size (); I ++) {Range range = ranges. get (I); if (range. getEnd ()> line. length () {throw new ExecException ("InputFormat: Error \ n" + "field length more than total length");} // you must use DataByteArray to construct the field type tuple. set (I, new DataByteArray (range. getSubString (line);} return tuple;} catch (InterruptedException e) {throw new ExecException ();}}}

The specific method is as follows.

This article from the "dream in the cloud" blog, please be sure to keep this source http://computerdragon.blog.51cto.com/6235984/1288228

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Pig custom filtering UDF and loading UDF

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Pig custom filtering UDF and loading UDF

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support