The process of shuffle in spark------don't see you regret __spark

Source: Internet
Author: User
Tags filegroup shuffle

At the Spark conference, all the speakers thought shuffle was the most influential place, but there was no alternative. Before going to Baidu to interview Hadoop, also was asked the question, directly answered don't know.

This article is mainly carried out along the following issues:

1, the Division of the shuffle process.

2, the middle result of shuffle how to store.

3, shuffle data how to pull over. the division of Shuffle process

Spark's operating model is based on RDD, and when you call Rdd's Reducebykey, Groupbykey, and other similar operations, you need shuffle. Then take out the Reducebykey.

Def Reducebykey (func: (V, v) => V, numpartitions:int): rdd[(K, v)] = {
    Reducebykey (new Hashpartitioner (numpartitio NS), func)
  }

When Reducebykey, we can manually set the number of reduce, if not specified, it may be out of control.

def defaultpartitioner (Rdd:rdd[_], others:rdd[_]*): Partitioner = {
    val bysize = (Seq (RDD) + + others). SortBy (_.parti tions.size). Reverse for
    (R <-bysize if r.partitioner.isdefined) {return
      r.partitioner.get
    }
    if ( Rdd.context.conf.contains ("Spark.default.parallelism")) {
      new Hashpartitioner (rdd.context.defaultParallelism
    } else {
      new Hashpartitioner (BySize.head.partitions.size)}
    }
  
View Code

If you do not specify the number of reduce, go by default:

1, if the custom partition function Partitioner, then according to your partition function to walk.

2, if there is no definition, then if you set the Spark.default.parallelism, you use the partitioning of the hash, the number of reduce is set this value.

3, if this is not set, it is based on the number of fragments of input data set. If the input data of Hadoop, this is more ... Everyone should be careful.

After setting it, it will do three things, that is, 3 times before the RDD conversion.

The map end is merged first with key
val combined = self.mappartitionswithcontext (context, ITER) => {
  Aggregator.combinevaluesbykey (ITER, context)
 }, preservespartitioning = True)
//reduce fetching data
val partitioned = new Shuffledrdd[k, C, (K, C)] (combined, partitioner). Setserializer (Serializer)
//merge data, perform reduce calculation
Partitioned.mappartitionswithcontext (context, ITER) => {
  new interruptibleiterator, Aggregator.combinecombinersbykey (ITER, context))
 }, preservespartitioning = True)
View Code

1, in the first mappartitionsrdd here first do a map-side aggregation operations.

2, Shuffledrdd is mainly to do from this crawl data work.

3, the second Mappartitionsrdd to crawl over the data again to aggregate operations.

4, Step 1 and step 3 will involve the spill process.

How to do the aggregation operation, go back to see the Rdd chapter. how the intermediate results of the shuffle are stored

When the job is submitted, Dagscheduler will cut the shuffle process into map and reduce two stage (formerly known as Shuffle before and after shuffle), where the exact shard is on the dotted line above.

The map-side task is submitted as a shufflemaptask, and its Runtask method is finally called inside the Taskrunner.

Override Def Runtask (context:taskcontext): Mapstatus = {val Numoutputsplits = dep.partitioner.numPartitions MetR ICS = Some (context.taskmetrics) Val Blockmanager = SparkEnv.get.blockManager val Shuffleblockmanager = Blockmanag Er.shuffleblockmanager var shuffle:shufflewritergroup = null var success = False try {//serializer is empty The default Javaserializer can also be set by Spark.serializer to another Val ser = Serializer.getserializer (dep.serializer)//instantiation WR The number of Iter,writer =numoutputsplits= the number of reduce that we mentioned earlier shuffle = Shuffleblockmanager.formaptask (Dep.shuffleid, PartitionID, Numoutputsplits, Ser)//Traverse RDD element, calculate its bucketid according to Key, and then bucketid to find the corresponding writer write for (Elem < -Rdd.iterator (split, context)) {val pair = Elem.asinstanceof[product2[any, any]] val bucketID = Dep.part Itioner.getpartition (pair._1) shuffle.writers (bucketID). Write (pair)}//Commit writes. Calculates the size of each bucket block var totalbytes = 0L var TotalTime = 0L val Compressedsizes:array[byte] = shuffle.writers.map {writer:blockobjectwriter => WRI Ter.commit () Writer.close () Val size = Writer.filesegment (). length TotalBytes + = size Tota
      Ltime + = writer.timewriting () mapoutputtracker.compresssize (size)}//update shuffle monitoring parameters. Val shufflemetrics = new Shufflewritemetrics Shufflemetrics.shufflebyteswritten = TotalBytes shuffleMetrics.sh Ufflewritetime = TotalTime Metrics.get.shuffleWriteMetrics = Some (shufflemetrics) Success = True New Ma
      Pstatus (Blockmanager.blockmanagerid, compressedsizes)} catch {case e:exception =>//error, Cancel previous operation, close writer if (Shuffle!= null && shuffle.writers!= null) {for (writer <-shuffle.writers) {WRI
      Ter.revertpartialwrites () Writer.close ()}} throw E} finally {//close writer if (Shuffle!= null && Shuffle.writers!= null) {try {shuffle.releasewriters (Success)} catch {case E:ex Ception => logerror ("Failed to release Shuffle Writers", E)}//Perform registered callback functions, generally doing cleanup work context . Executeoncompletecallbacks ()}
View Code

Iterate through each record, through its key to determine its bucketid, and then through this bucket writer write data.

Let's take a look at Shuffleblockmanager's Formaptask method.

def formaptask (Shuffleid:int, Mapid:int, Numbuckets:int, Serializer:serializer) = {new Shufflewritergroup { Shufflestates.putifabsent (Shuffleid, New Shufflestate (Numbuckets)) private Val shufflestate = Shufflestates (Shuffle Id) Private var filegroup:shufflefilegroup = null val Writers:array[blockobjectwriter] = if (Consolidateshu Fflefiles) {fileGroup = Getunusedfilegroup () Array.tabulate[blockobjectwriter] (numbuckets) {bucketID =&G
          T
          Val blockid = Shuffleblockid (Shuffleid, MapId, bucketID)//Select files from existing filegroups, one bucket a file, that is, the data to be sent to the same reduce is written to the same file Blockmanager.getdiskwriter (Blockid, FileGroup (bucketID), Serializer, buffersize)}} else {ARR Ay.tabulate[blockobjectwriter] (numbuckets) {bucketID =>//Generate files according to Blockid, number of map *reduce number of Val Blo 
          Ckid = Shuffleblockid (Shuffleid, MapId, bucketID) val blockfile = BlockManager.diskBlockManager.getFile (blockid)if (blockfile.exists) {if (Blockfile.delete ()) {Loginfo (s) removed existing shuffle file $block
          File ")} else {logwarning (S" Failed to remove existing Shuffle file $blockFile)} } blockmanager.getdiskwriter (Blockid, Blockfile, Serializer, buffersize)}
View Code

1, the middle result of the map is written to the local hard disk, not memory.

2, the default is a map of the intermediate result file is m*r (number of M=map, r=reduce number), After setting Spark.shuffle.consolidateFiles to True, r files are written to a file according to the bucketid of the result to be divided into one reduce.

3, Consolidatefiles adopted is a reduce a file, it also records the write start location of each map, so when looking, first through the Reduceid to find which file, and then sit mapid find the index in the starting position offset, Length length= (mapId + 1). Offset-(mapId). Offset so that you can determine a filesegment (file, offset, length).

4, Finally, after the storage, returned a new Mapstatus (Blockmanager.blockmanagerid, compressedsizes), the Blockmanagerid and block size are returned together.

Individual ideas, shuffle the mechanism of this and Hadoop is not very different, tez such an engine will catch up with spark speed. Let's just wait and see. Shuffle's data how to pull over

After the shufflemaptask is over, finally in the Dagscheduler handletaskcompletion method.

case smt:shufflemaptask => val status = Event.result.asinstanceof[mapstatus] val execid = status.location.executorId if (Failedepoch.contains (execid) && Smt.epoch <= Failedepoch (execid)) {Loginfo ("ignoring possibly bogus Shu Fflemaptask completion from "

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.