From physical plan to Map-reduce plan
Note: Since our focus is on the pig on Spark for the Rdd execution plan, the backend references after the physical execution plan are not significant, and these sections mainly analyze the process and ignore implementation details.
The entry class Mrcompiler,mrcompilier traverses the nodes in the physical execution plan in a topological order, converts them to mroperator, and each mroperator represents a map-reduce job. The entire complete plan is stored in the Mroperplan class. The following special handling is done for the load and store operations:
Store must be a leaf node, otherwise throw an exception
Load creates a new Mroperator and joins the Mroperplan.
Here is the Mroperplan schematic:
See more highlights of this column: http://www.bianceng.cnhttp://www.bianceng.cn/database/storage/
From Map-reduce plan to the Hadoop Job
Jobcontrolcompiler compiles map-reduce plan into a Hadoop Job.
The entry method is:
Public Jobcontrol compile (Mroperplan, String grpname) throwsjobcreationexception
The Compile method invokes the Getjob method for each mroperator and generates the Hadoop JOB:
Private Job Getjob (Mroperplan plan, Mapreduceoper MRO, Configuration Config,pigcontext Pigcontext) throws Jobcreationexception.
The Mapper&reducer inheritance structure implemented by Pig is as follows:
which
Xxxwithpartitionindex is used for skewedjoin.
Xxxwithcomparator is used for UDF functions that require sorting.
Xxxcounter to data count, for rank operations