Spark Core Source Analysis 8 see transformation from a simple example

Source: Internet
Author: User

One of the simplest examples of Spark's own is mentioned earlier, as well as the section on Sparkcontext, which describes the transformation in the rest of the content.

Object SPARKPI {  def main (args:array[string]) {    val conf = new sparkconf (). Setappname ("Spark Pi")    val spark = New Sparkcontext (conf)    val slices = if (args.length > 0) args (0). ToInt Else 2    val n = math.min (100000L * Slice S, int.maxvalue). ToInt//Avoid overflow    val count = spark.parallelize (1 until n, slices). map {i =      val x = Ra Ndom * 2-1      val y = random * 2-1      if (x*x + y*y < 1) 1 Else 0    }.reduce (_ + _)    println ("Pi is ROUGHL Y "+ 4.0 * count/n)    spark.stop ()  }}
Call the parallelize method of Sparkcontext. This method creates a distributed dataset that can be manipulated concurrently on an existing Scala collection , which returns an RDD.

We are all in writing to say the RDD is what, I might as well use the source of the way to show more intuitive. As can be seen, the most basic rdd is actually sparkcontext and its own dependence.

Abstract class Rdd[t:classtag] (    @transient private var _sc:sparkcontext,    @transient private var deps:seq[dep Endency[_]]  ) extends Serializable with Logging {  ...  ...}
We look at the contents of the parallelize method. Seq is a Scala collection type. Numslices is the degree of parallelism that is set, with a default value (configuration item or driver obtained by executor the total core registered)

Override Def defaultparallelism (): Int = {  Conf.getint ("Spark.default.parallelism", Math.max (Totalcorecount.get ( ), 2)}

def Parallelize[t:classtag] (      seq:seq[t],      numslices:int = defaultparallelism): rdd[t] = withscope {    assertnotstopped ()    new Parallelcollectionrdd[t] (this, SEQ, numslices, Map[int, Seq[string]] ())  }

We continue to look inside the Parallelcollectionrdd. It overrides some of the RDD methods and sets the dependency to nil

Private[spark] class Parallelcollectionrdd[t:classtag] (@transient sc:sparkcontext, @transient data:seq[t], nu Mslices:int, Locationprefs:map[int, seq[string]]) extends Rdd[t] (SC, Nil) {//Todo:right now, each split sends Along its full data, even if later down the RDD chain it gets//cached.  It might be worthwhile to write the data to a file in the DFS and read it in the split//instead.  Update:a Parallel collection can be checkpointed to HDFS, which achieves this goal. Override def Getpartitions:array[partition] = {val slices = parallelcollectionrdd.slice (data, numslices). ToArray s Lices.indices.map (i = new parallelcollectionpartition (ID, I, slices (i))). ToArray} override def compute (s:partition , Context:taskcontext): iterator[t] = {New Interruptibleiterator (context, s.asinstanceof[parallelcollectionpartition [T]].  Iterator)} override def getpreferredlocations (s:partition): seq[string] = {Locationprefs.getorelse (s.index, Nil)}} 
to Here, The operation of the parallelize is over. This is actually what people often call spark transformation, and does not trigger task scheduling.

then, in this The map operation is performed on the Parallelcollectionrdd. In fact, it is the implementation of the Rdd abstract class map method.

/**   * Return a new rdd by applying a function to all elements of the this RDD.   *  /def Map[u:classtag] (f:t = U): rdd[u] = withscope {    val cleanf = Sc.clean (f)    new Mappartitionsrdd[u, T ] (this, (context, PID, iter) = Iter.map (cleanf))  }
the F parameter in the Map method is actually the operation in the map written in our program, which then produces aMappartitionsrdd.

The prev parameter is the Parallelcollectionrdd before the map conversion is performed

Private[spark] class Mappartitionsrdd[u:classtag, T:classtag] (    prev:rdd[t],    F: (Taskcontext, Int, iterator[t ] = = Iterator[u],  //(Taskcontext, partition index, Iterator)    Preservespartitioning:boolean = False)  Extends Rdd[u] (prev) {  override Val Partitioner = if (preservespartitioning) Firstparent[t].partitioner else none
   override def getpartitions:array[partition] = firstparent[t].partitions  override def compute (split:partition, Context:taskcontext): Iterator[u] =    f (context, Split.index, Firstparent[t].iterator (split, context))}

Looking at the construction of the parent Rdd, it is the default to build a one-to-one dependency,onetoonedependency means that the RDD conversion process is a one-to-one, that is, the partition number is one by one corresponding. Note that the rdd stored in the dependency is the RDD before the conversion, and here is the Parallelcollectionrdd

/** Construct an RDD with just a one-to-one dependency on one parent *  /def This (@transient oneparent:rdd[_]) =    t His (Oneparent.context, List (New Onetoonedependency (oneparent)))

Partitioner actually contains the number of partitions and a wrapper to get the partition number based on key.
Abstract class Partitioner extends Serializable {  def numpartitions:int  def getpartition (key:any): Int}
by this point, the map operation is complete. There is also no dispatch to trigger the task, just the conversion of the RDD.

Now look at the final reduce operation

/** * reduces the Elements of this RDD using the specified commutative and * associative binary operator. */def reduce (f: (t, t) = t): T = withscope {val cleanf = Sc.clean (f)///define a specific method to add the values from the iterator parameter to the left by Val Reducepartit      ION:ITERATOR[T] = option[t] = ITER = {if (iter.hasnext) {Some (iter.reduceleft)}} else { None}} var jobresult:option[t] = None//Define a method to merge each partition Val Mergeresult = (index:int, taskresult:option[t]) =&G T {if (taskresult.isdefined) {Jobresult = Jobresult match {case Some (value) = = Some (f (Value, Taskresult  . get) Case None = = Taskresult}}}//Execute Runjob sc.runjob (this, reducepartition, mergeresult)//Get The final result out of our Option, or throw a exception if the RDD was empty jobresult.getorelse (throw new unsupported Operationexception ("Empty Collection")} 
finally call this runjob method, parameters:
Rdd:mappartitionsrdd
Func: Wrapping The Reducepartition method
partitions:0 until Rdd.partitions.size,rdd.partitions is called the GetPartitions method of Mappartitionsrdd. The RDD after each conversion saves all previous dependencies, so it can be traced back to the first RDD based on the dependency relationship. The GetPartitions method for Mappartitionsrdd here is to get the GetPartitions method of the first Rdd.
Allowlocal:false
Resulthandler:mergeresult method
/** * Run a function on a given set of partitions in an RDD and pass the results to the given * handler function. This is the main entry point for all actions in Spark. The allowlocal * flag Specifies whether the scheduler can run the computation on the driver rather than * shipping it   Out to the cluster, for short actions like first (). */def Runjob[t, U:classtag] (Rdd:rdd[t], func: (Taskcontext, iterator[t]) = = U, Partitions:seq[int] , Allowlocal:boolean, Resulthandler: (Int, U) = = Unit) {if (Stopped.get ()) {throw new illegalstate Exception ("Sparkcontext have been Shutdown")} val callSite = getcallsite val cleanedfunc = Clean (func) loginfo  ("Starting job:" + callsite.shortform) if (Conf.getboolean ("Spark.loglineage", False)) {Loginfo ("RDD ' s recursive      dependencies:\n "+ rdd.todebugstring)} dagscheduler.runjob (Rdd, Cleanedfunc, partitions, CallSite, allowlocal,    Resulthandler, Localproperties.get)Progressbar.foreach (_.finishall ()) Rdd.docheckpoint ()} 
here, finally, Dagscheduler's Runjob method, SubmitJob, is called.
This generates Jobid, sending the jobsubmitted message to the DAG's event loop, and handlejobsubmitted
So, this reduce actually triggers the runjob operation, which is called action in Spark.
The remainder of the reduce operation is described in the next section

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Spark Core Source Analysis 8 see transformation from a simple example

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.