The shuffle mechanism in spark

Source: Internet
Author: User
Tags shuffle

What is shuffle in spark doing?

Shuffle in Spark is a new rdd by re-partitioning the kv pair in the parent Rdd by key. This means that the data belonging to the same partition as the parent RDD needs to go into the different partitions of the child Rdd.

But this is only a shuffle process, but it is not the cause of shuffle. Why do we need shuffle?

Shuffle and stage

In the distributed computing framework, such as map-reduce, data localization is a very important consideration, that is, the computation needs to be distributed to the location of the data, thus reducing the movement of data and improving the efficiency of operation.

The input data for map-reduce is usually the file in HDFs, so data localization requires that the map task be dispatched to the node where the input file was saved. However, there are some computational logic that cannot simply fetch local data, and the logic of reduce is the same. For reduce, the input to the handler function is all value of the same key, but the data set in which the value resides (that is, the output of the map) is on a different node, so the output of the map needs to be re-organized so that the same key enters the same reducer. Shuffle has moved a large amount of data, which is a huge drain on compute, memory, network, and disk, so shuffle should be done only where shuffle is really needed.

Division of Stage

For Spark, the logic of the calculation exists in the conversion logic of the RDD. The scheduler for Spark is also localized on the basis of data in the dispatch task, except that the "local" here includes not only the disk files, but also the RDD partitions, and Spark makes the data as small as possible to be moved, thus dagscheduler a job into multiple stages, Within a stage, the data is not required to be moved, and the data is processed locally through a series of functions until it really needs to be shuffle.

For example, in the Getparentstages method of Dagscheduler, when looking for the parent stage, the following code snippet is used

         for (DEP <- r.dependencies) {          dep match {            case Shufdep:shuffledependency[_, _, _] = = 
    + = getshufflemapstage (SHUFDEP, jobId            )Case _ = =               Waitingforvisit.push ( Dep.rdd)          }

That is to find the shuffledependency will be divided into one of the most stage (except for the RDD without the parent rdd, such as Hadooprdd, its dependencies is nil).

In the above code, the Shufflemapstage is mentioned, in fact, Spark's stage only has two sub-classes: Shufflestage and Resultstage. Accordingly, the task also has only two subclasses, Resulttask and Shufflemaptask. The links between these classes can be represented in the Dagscheduler Submitmissingtasks method. Here is a piece of code in this method:

Val Tasks:seq[task[_]] =Try{stage Match { CaseStage:shufflemapstage =Partitionstocompute.map {ID=Val locs=Getpreferredlocs (Stage.rdd, id) Val part=stage.rdd.partitions (ID)NewShufflemaptask (Stage.id, Taskbinary, part, locs)} CaseStage:resultstage =Val Job=stage.resultOfJob.get partitionstocompute.map {ID=Val P:int=job.partitions (ID) Val part=stage.rdd.partitions (P) Val locs=Getpreferredlocs (Stage.rdd, p)NewResulttask (Stage.id, Taskbinary, part, locs, id)}} } Catch {       CaseNonfatal (E) =Abortstage (stage, S"Task creation failed: $e \n${e.getstacktracestring}") Runningstages-=Stagereturn    }

This code is used to generate a task, rather, to generate a task for a stage. From the above code can be seen, for Resultstage generated is resulttask, for Shufflemapstage generated is shufflemaptask.

What's so special about Shufflemaptask?

For jobs with more than one stage, there is definitely a shuffle, which means that the stage's parent stage is shufflemapstage. The data for the last rdd of Shufflemaptask in Shufflemapstage will be shuffle, which is also the difference between it and resulttask. Below is a piece of code in the Runtask method of Shufflemaptask, executor indirectly calls the Runtask method

      Val Manager = SparkEnv.get.shuffleManager// Caryopteris take Shufflemanager      // get writer, Attention will pass the Shuffledependency.shufflehander over .      writer = Manager.getwriter[any, any] (Dep.shufflehandle, PartitionID, context)      <: Product2[any, any]]      )returntrue). Get

Writer.write (Rdd.iterator (partition, context). asinstanceof[iterator[_ <: Product2[any, any]]) This sentence computes a partition of the last Rdd and then writes the data to the partition using writer, which can be thought of as the map phase in shuffle.

So how does the reduce phase trigger?

This is actually triggered naturally by Spark's computational logic on the RDD.

The operational logic of spark is driven by the calculation of the partition of the RDD (as mentioned in the previous article), that is, the calculation of the partition of the child RDD triggers the calculation of the corresponding partition of the parent RDD, thus triggering the partition of the first computable Rdd. So the most original rdd in the shuffle relationship stage is bound to contain logic related to the shuffle process, and this particular RDD has two classes, Shuffledrdd and Cogroupedrdd, (The latter is not necessarily the result of shuffle), which means that reduce is triggered by the calculation of the special Rdd . The following is an example of Shuffledrdd, where a single rdd shuffle generates the RDD.

Shuffledrdd

The characteristics of Shuffledrdd can be embodied by three parts. First, it includes some field related to shuffle:

  private var serializer:option[serializer] = None  private var keyordering:option[ Ordering[k]] = None  private var aggregator:option[aggregator[k, V, C]] = none   C12>privatefalse

Where aggregator is used primarily to indicate the value corresponding to the same key, how to aggregate, but not only for this. This is a very interesting class, its domain is a series of functions.

Second, its dependency is shuffledependency, so Dagscheduler will take it as a starting point for the new stage, and its parent RDD is treated as the end of the previous stage.

  Override Def getdependencies:seq[dependency[_]] = {    List (new  shuffledependency (prev, part, Serializer, keyordering, aggregator, mapsidecombine)  }

Finally, when a partition of SHUFFLEDRDD is compute, it triggers a fetch of the map output, as well as the aggregate of value, that is, the reduce phase.

  Override Def compute (split:partition, Context:taskcontext): iterator[(K, C)] = {    =  Dependencies.head.asinstanceof[shuffledependency[k, V, C]]    + 1, context)      . Read ()      . asinstanceof[iterator[(K, C)]  }
So how is Shuffledrdd generated?

Of course, the transformation that will cause shuffle will generate SHUFFLEDRDD, taking Reducebykey as an example.

Reducebykey actually has a number of overloaded methods with the same name, with the simplest example

  Def Reducebykey (func: (V, v) = v): rdd[(K, v)] = Self.withscope {    Reducebykey (Defaultpartitioner (self), fun c)  }
  def reducebykey (Partitioner:partitioner, func: (V, v) = v): rdd[(K, v)] = Self.withscope {    = = V, Func, Func, Partitioner)  }

Reducebykey is called on an RDD, set this rdd to a, call Reducebykey to generate an RDD of B. The partitioner in the above code, then, refers to the partitioner used to generate B, which indicates which partition of B should be in each kv pair in a. The reason to be aware of this is that in Combinebykey the RDD is determined based on this partitioner, and in certain cases reducebykey will not cause shuffle.

Here's the code for deciding which rdd to generate in Combinebykey:

  if (Self.partitioner = = Some (partitioner))      {        = = Taskcontext.get ()          New  interruptibleiterator (context, Aggregator.combinevaluesbykey (ITER, context))      True   else {      new shuffledrdd[k, V, C] (self,      Partitioner)        . Setserializer (Serializer).        Setaggregator (aggregator).        Setmapsidecombine ( mapsidecombine)    }

It will be based on

if (Self.partitioner = = Some (partitioner))

To decide whether to generate Shuffledrdd. Where Self.partitioner refers to the partitioner of the RDD, which specifies the partition of each key in the RDD. The partitioner to the right of the equal sign indicates which partition each key of the RDD is in. When both = =, the Mappartitionsrdd is generated with the self.mappartitions, which is the same as the transformation generated by map, and Reducebykey does not raise shuffle at this time.

Partitioner has several subclasses, some of which override the default Equals method (Note that = = in Scala calls the Equals method, which differs from Java). Typical, such as the Equals method in Hashpartitioner

  Override Def equals (Other:any): Boolean = other match {    case h:hashpartitioner =      == numpartitions    Case _ =      false  }

When the number of partitions of two Hashpartitioner is the same, they are considered equal. However, that is, A and B have the same partitioner, and only decided that the same key in the two rdd in the same partition, does not mean that the same key in A is the corresponding value has been aggregate, Therefore, when the Mappartitions method is called in the Combinebykey operation, a special conversion method of iterator to iterator is specified.

New Interruptibleiterator (context, Aggregator.combinevaluesbykey (ITER, context))

In other words, the iterator of partition in a is executed Combinevaluesbykey operation to aggregate the value. For Reducebykey, the aggregate of value is performed regardless of whether shuffle is required. For example, in Shuffledrdd's Compute method, the Read method of Shufflereader is called. Shufflereader at present there is only one, called Hashshufflereader, whether it is used in sort or hash shuffle,reduce end is the use of this reader, It aggregate the iterator generated after fetching data from the map end

Val aggregatediter:iterator[product2[k, C]] =if(dep.aggregator.isDefined) {if(dep.mapsidecombine) {Newinterruptibleiterator (Context, Dep.aggregator.get.combineCombinersByKey (ITER, Context))}Else {        Newinterruptibleiterator (Context, Dep.aggregator.get.combineValuesByKey (ITER, Context))}} Else{require (!dep.mapsidecombine, "map-side combine without aggregator specified!")      //Convert the product2s to pairs since this is what downstream RDDs currently expectIter.asinstanceof[iterator[product2[k, c]]].map (pair =(Pair._1, pair._2))}

In the Combinebykey code above, you can see that when it generates SHUFFLEDRDD, Aggreator is set, and Mapsidecombine uses the default parameter, which is true, so Combinecombinerbykey is called. To combine the value that has been combine good.

Summarize

With the above content, it is essential to understand how Dagscheduler handles dividing the stage according to shuffle, generating a special task, and how the map and reduce two phases are triggered during spark execution.

Overall, the RDD conversion operation will try to avoid shuffle, if you have to shuffle, will generate a special RDD, its dependencies will be shuffledependency. Dagscheduler when dividing the stage, it will use shuffledependency to determine the stage boundary, and this will generate shufflemaptask to complete the work of the map end. The transformation that raises the shuffle generates a special RDD that will be the starting point of the shuffle neutron stage, triggering the execution of the reduce operation when the compute method of the RDD is called. There are two types of this special RDD:

Shuffledrdd, which has only one parent Rdd, is the result of a shuffle to an rdd.

Cogroupedrdd, which has multiple rdd, is the result of shuffle multiple rdd.

The shuffle mechanism in spark

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.