Spork: Pig on Spark Implementation Analysis

Source: Internet
Author: User

Introduction: Spork is the highly experimental version of Pig on Spark, and the dependent version is also relatively long. As mentioned in the previous article, I have maintained Spork on my github: flare-spork. This article analyzes the implementation method and specific content of Spork.
Spark Launcher writes a Spark initiator in the path of the hadoop executionengine package. Similar to MapReduceLauncher, Spark launchPig translates the input physical execution plan. MR starters translate MR operations and further MR JobControl. The Spark initiator directly translates some physical operations in the physical execution plan into RDD operations. One disadvantage is that, after being translated into RDD operators, the optimization process is missing, that is, the ing translation of direct physical operations. The specific execution logic is completely handed over to Spark DAGScheduler for splitting and TaskScheduler for task scheduling. For example, for Pig, the entire translation and launch will not be triggered until you see Dump/Store. In this physical execution plan, Spark may correspond to multiple tasks.
In the current implementation mode, physical translation operations are handed over to multiple Convertor implementation classes,

public interface POConverter
 
   {    RDD
  
    convert(List
   
    > rdd, T physicalOperator) throws IOException;    }
   
  
 
Abstract class POConvertor provides the convert method, the List in the input parameter It is the RDDs generated by the pioneers of this physical operation. It can be considered as the parent RDDs that will be dependent on. The conversion result is to generate nextRDD, and whether nextRDD actually triggers computing on spark is not controlled at present, that is, as mentioned above, one Pig physical execution plan may have Spark to execute multiple tasks.
When using spark, you can start the Pig environment with Spark as the backend engine in the form of-x spark.
Next, let's take a look at what PO operations are being converted and how to convert them.
Load/Store

They all take the NewHadoopRDD route.

In terms of Load, the file path is obtained through POLoad, pigContext obtains necessary configuration information, and then SparkContext calls newAPIHadoopFile to obtain NewHadoopRDD, and Tuple2 Map to RDD with only values .

In terms of Store, the latest rdd is first transferred to Tuple2 with the Key blank Text And then map it to PairRDDFunctions, use pigContext to generate POStore operations, and finally call the saveAsNewAPIHadoopFile of RDD to save it to HDFS.


Foreach, Filter, Limit

Implement an Iterator [T] => Iterator [T] Method in ForEach to convert foreach to rdd. mapPartitions () method.

Iterator [T] => Iterator [T] method implementation relies on the original POForEach to obtain nextTuple and perform other operations to implement a new Iterator.

For the abstract class PhysicalOperator in hadoop backend executionengine,

The setInput () and attachInput () methods are put into the tuple data with processing,

ProcessTuple () is triggered when getNextTuple () is used, and the processing object is the internal Input Tuple.

Therefore, when the ForEach operation implements Iterator, the preceding Input data setting operation is added to the readNext () method, and the getNextTuple () method is called before the return operation to return the processed result.

POFilter also uses setInput (), attachInput (), and getNextTuple () to return processing results.

Therefore, when implementing the RDD operation, wrap the above steps into a FilterFunction and pass in rdd. filter (Function) for processing.

POLimit is exactly the same as POFilter.


Distinct

Now RDD has the distinct (numPartitions: Int) method.


The distinct implementation here is exactly the same as the distinct logic in rdd.

Step 1: map the rdd type of Tuple to Tuple2 , Where the value part is null;

Step 2: Perform the rdd. performancebykey (merge_function, parallelism) operation. merge_function does not process the objects in the two values, that is, the objects in the key reduce mode are not processed in the value Field;

Step 3: Perform rdd. map (function, ClassTag) processing on the result of step 2, and function is to get Tuple2 In. _ 1, that is, the key value: Tuple.



Union

Union is a merge process, directly new UnionRDD .

UnionRDD processes Seq. So use JavaConversions. asScalaBuffer (List >.



Sort

Sort process:

Step 1: Convert the Tuple-type RDD to Tuple2 Type. The Object is empty.

Step 2: according to the result of step 1, new OrderedRDDFunctions >

The sortByKey method generates an RDD in an out-of-order manner. >. The Key type in OrderedRDDFunctions must be sortable, and the comparator must reuse the POSort mComparator. The result returned by sortByKey is ShuffleRDD, and its Partitioner is RangePartitioner. After sorting, each Partition stores a sorted value in a range.

Step 3: Call rdd. mapPartition (function, xx, xx). function is used to call Iterator > Spit into Iterator , That is, the Key value is retrieved again.


Split

POSplit directly returns the First Ancestor RDD.


LocalRearrange

LocalRearrange-> Global Rearrange-> Package appears together.



Local rearrange direct dependency

  physicalOperator.setInputs(null);  physicalOperator.attachInput(t);  result = physicalOperator.getNextTuple();

The result is obtained in three steps. The returned Tuple format is (index, key, value ).

It depends on the processing of input tuple in POLocalRearrange.


GlobalRearrange

The format of the Tuple to be processed is (index, key, value ). The final result is (key, {values })

If the parent RDD has only one:

Perform groupBy by key first. The result is Tuple2. >

Then perform a map operation to obtain the (key, {values}) form of RDD, that is, Tuple

If the parent RDD has multiple:

Let the map operation through rdd first convert Tuple from (index, key, value) to (key, value) form, then the rdd set is new to CoGroupRDD, contains one (Seq) javaConversions. asScalaBuffer (rddPairs) conversion. Finally, call the map method of CoGroupRDD to set Tuple2 > Convert to Tuple (Key, {values}) form. In fact, the map method of CoGroupRDD is used to merge Iterator sets in each Key.

Package

The Package must process the key and Seq of the global rearrange. Group. The specific structure of the Tuple to be processed is as follows: (key, Seq : {(Index, key, value without key )})

Tuple. get (0) is keyTuple, and tuple. get (1) is Iterator , And finally return (key, {values}), that is, Tuple


Full Text :)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.