Pig upgrade 0.13.0 stepped on a pit

Source: Internet
Author: User

Background: Before the pig version is 0.12, see the Community 0.13.0 has been released for a long time, there are many new patches and feature. One of the feature is to set the jar package cache parameters, pig.user.cache.enabled. This parameter can improve the execution speed of pig. Specifically, look at the following:

https://issues.apache.org/jira/browse/PIG-3954

User Jar Cache    Jars required for user defined functions (UDFs) is Copied todistributed cache by pig to make them available on task nodes. To put Thesejars in distributed cache, pig clients copy these jars to HDFS under Atemporary location. For scheduled jobs, these jars does not change frequently. Also, creating a lot of small jar files on HDFs are not HDFs friendly. To avoidcopying these small jar files to HDFS again and again, Pig allows users toconfigure a user level jar cache (Readab Le only to the user for securityreasons). IF pig.user.cache.enabled flag is set to True, UDF jars be copied Tojar cache location (configurable) under a directory n Amed with the hash (SHA) of the jar. Hash of the jar is used to identify the existence of the jar Insubsequent uses of the jar by the user. If a jar with same hash and filename isfound in the cache, it's used avoiding copy of the jar to HDFs. 

You can set the "values for these" properties in order to configurethe jar cache:

· Pig.user.cache.enabled-turn on/off user jar cache feature (false by default).

· Pig.user.cache.location-path on HDFS that'll be used a staging directory for the user jar cache (defaults to Pig.temp.di R or/tmp).

User jar cache feature is fail safe. If jars cannot is copied Tojar cache due to any permission/configuration problems, pig would default Oldbehavior.

The process of upgrading is very simple, the main is to clone the 0.13 code git down, and then recompile, generate a new jar package, and then replace the old jar package with the new jar package.

1. Finding Problems

After the pig upgrade is complete, the pig script that ran before, quickly found the problem. I wrote a bunch of pig functions before, and found that in the new pig environment, when a column's value is null, the execution function is skipped. The function is simple, as follows.

Public class Gfchannelevalfunc extends Evalfunc<String>{

Public Stringexec(tupletuple)throws IOException {

/////// before the function, when tuple.get (0) ==null the time, exec function will still execute, but in the new Pig Environment, ///////tuple.get (0) ==null the time, Pig will skip executing the function you defined.

int fieldnum = 1;

if (tuple = = Null | | Tuple.size ()! = Fieldnum) {

return null;

}

String GF = (string) tuple.get (0);

return Gfchannel.getchannel (GF);

}

}

2. Troubleshoot problems2.1 Custom UDF function issues

At first, I thought it was a problem with the UDF I wrote. So, decide to take the system custom UDF as an example to see if the bug can be reproduced. The function used is isint. A simple test example is constructed to see if there is a problem with using a system-customized UDF. The pig code is as follows:

a= foreach Raw_log generate pbstr# ' G_f ' as G_f, uuid as UUID;

c= foreach a generateorg.apache.pig.piggybank.evaluation.IsInt (g_f) as Channel,g_f;

The result of the above code execution, if G_f is null, should return FALSE. But when the above code finishes executing, the return is not false. Preliminary judgment, the system comes with UDF, custom UDF when handling null values, there is a problem.

2.2 The problem of pig execution logic

Judging by the previous step, not the problem of your own UDF, I presume that the new PIG,UDF's execution logic is not the same as before.

First, it needs to be clear where pig is calling a custom UDF.

The method used is relatively simple. is to throw a exception in the UDF, so that in the log, the stack information for the method call when the error occurs is printed. The Modified method section is shown below.

Publicclass Gfchannelevalfunc extends Evalfunc<string> {

Public String exec (tuple tuple) Throwsioexception {

String test = null;

if (test.equals ("tag")) {

System.out.println ("tag");

throw new IOException ("IOException");}

The next UDF is then re-executed under the pig environment. Find the log information printed at the time the MapReduce is executed on the Hadoop cluster.

Error log information as follows, very clear print out the method call information, the more important part of the Red section.

2014-12-10 11:25:20,352 WARN [main]org.apache.hadoop.mapred.yarnchild:exception running child: Org.apache.pig.backend.executionengine.ExecException:ERROR 0:exception while executing [Pouserfunc (Name:pouserfunc (Com.sogou.wap.pig.eval.GFChannelEvalFunc) [Chararray]-scope-12operator key:scope-12) children:null at []]: java.lang.NullPointerException

Atorg.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext (Physicaloperator.java : 339)

Atorg.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan ( poforeach.java:378)

Atorg.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple ( poforeach.java:298)

Atorg.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline ( piggenericmapbase.java:282)

Atorg.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map (Piggenericmapbase.java : 277)

Atorg.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map (piggenericmapbase.java:64 )

Atorg.apache.hadoop.mapreduce.Mapper.run (mapper.java:145)

Atorg.apache.hadoop.mapred.MapTask.runNewMapper (maptask.java:791)

Atorg.apache.hadoop.mapred.MapTask.run (maptask.java:342)

Atorg.apache.hadoop.mapred.yarnchild$2.run (yarnchild.java:168)

Atjava.security.AccessController.doPrivileged (Native Method)

Atjavax.security.auth.Subject.doAs (subject.java:415)

Atorg.apache.hadoop.security.UserGroupInformation.doAs (usergroupinformation.java:1796)

Atorg.apache.hadoop.mapred.YarnChild.main (yarnchild.java:163)

caused by:java.lang.NullPointerException

Atcom.sogou.wap.pig.eval.GFChannelEvalFunc.exec (GFCHANNELEVALFUNC.JAVA:16)

Atcom.sogou.wap.pig.eval.GFChannelEvalFunc.exec (gfchannelevalfunc.java:11)

at Org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext ( pouserfunc.java:345)

Atorg.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextString ( pouserfunc.java:445)

Atorg.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext (Physicaloperator.java : 316)

3, solve the problem

After the above troubleshooting, found to be Pouserfunc this class of problems. Read the code below and find that it might be a problem with the code below. The Pig 's code is hosted on GitHub.

1) Find the Pig project on GitHub Https://github.com/apache/pig

2) Find Pouserfunc this class

https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/ Expressionoperators/pouserfunc.java

3) Look at the code for this class to see the commit history of the code, and find that the code is introduced in PIG-3679 . But in PIG-4184 's submission , this code has been canceled.

4) Google down PIG-4184, find https://issues.apache.org/jira/browse/PIG-4184, Found the code that caused the bug is the job started in order to fix 3679 bug, but did not want to introduce a more serious bug, so in 0.14.0 and submitted the PIG-4184. The author of the code has noticed the seriousness of the problem.

5) know the cause of the problem, the solution is relatively simple. Because do not want to upgrade to 0.14.0, a little radical. On the basis of 0.13.0, the PIG-4184 patch was merged into 0.13.0.

after the merge is complete, re-hit the jar package, re-execute, test OK. Problem solving.

Pig upgrade 0.13.0 stepped on a pit

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.