Spark Core source Analysis Shuffle detailed-write process

Source: Internet
Author: User


Blog Address: http://blog.csdn.net/yueqian_zhu/


Shuffle is a more complicated process, it is necessary to analyze the internal logic of writing

Shufflemanager is divided into Sortshufflemanager and Hashshufflemanager.

First, Sortshufflemanager

Each shufflemaptask does not generate a separate file for each reducer; instead, it writes all the results to a local file and generates an index file that reducer can use to get the data it needs to process. The direct benefit of avoiding large amounts of files is to save memory usage and the low latency of sequential disk IO.

when writing partition data, it will first sort the data in different ways according to the actual situation, the bottom line is at least according toReducePartitioningPartitionto be sorted so that the sameMapTaskShuffleto a differentReduceall data in the partition can be written to the same external disk file, with a simpleOffsetlogo is differentReduceThe offset of the partition's data in this file. Such aMapThe task just needs to generate aShufflefile, thus avoiding the aboveHashshufflemanagerThe number of files that may be encountered is a huge problem


/** Get A writer for a given partition. Called on executors by map tasks. *  /override def Getwriter[k, V] (handle:shufflehandle, Mapid:int, Context:taskcontext)      : shufflewriter[k, V] = {    val baseshufflehandle = Handle.asinstanceof[baseshufflehandle[k, V, _]]    shufflemapnumber.putifabsent ( Baseshufflehandle.shuffleid, Baseshufflehandle.nummaps)    new Sortshufflewriter (      shuffleblockresolver, Baseshufflehandle, MapId, context)  }

Shufflemapnumber is a hashmap<shuffleid,nummaps>

The Sortshufflewriter provides a write interface for real data writes, while the write interface uses Shuffleblockresolver to deal with the underlying file


See below for Sortshufflewriter, calling write to write

Writer.write (Rdd.iterator (partition, context). asinstanceof[iterator[_ <: Product2[any, any]])
The write parameter is actually called the Rdd compute method to perform the calculation, and the returned partition iterator
/** write a bunch of records to this task ' output */override Def Write (Records:iterator[product2[k, V]]): Unit = {      if (dep.mapsidecombine) {require (dep.aggregator.isDefined, "map-side combine without aggregator specified!")      Sorter = new Externalsorter[k, V, C] (Dep.aggregator, Some (Dep.partitioner), dep.keyordering, Dep.serializer) Sorter.insertall (Records)} else {//In this case we pass neither an aggregator nor a ordering to the sorter, be Cause we don ' t//care whether the keys get sorted in each partition;      That would be was done on the reduce side//If the operation being run is sortbykey.    Sorter = new Externalsorter[k, V, v] (none, Some (Dep.partitioner), none, Dep.serializer) Sorter.insertall (Records) }//Don ' t bother including the time to open the merged output file in the shuffle write time,//because it just op    Ens a single file, so are typically too fast to measure accurately//(see SPARK-3570). Val Outputfile = Shuffleblockresolver.getdatafile (Dep.shuffleid, mapId) val blockid = Shuffleblockid (Dep.shuffleid, MapId, IndexSh    uffleblockresolver.noop_reduce_id) Val partitionlengths = Sorter.writepartitionedfile (blockId, context, OutputFile) Shuffleblockresolver.writeindexfile (Dep.shuffleid, MapId, partitionlengths) Mapstatus = Mapstatus ( Blockmanager.shuffleserverid, Partitionlengths)}

As you can see, the mapsidecombine required to pass aggregator and keyordering into the externalsorter, otherwise the above two parameters are set to none. Then call the Insertall method

def insertall (Records:iterator[_ <: Product2[k, V]]): Unit = {//Todo:stop combining if we find that the reductio n factor isn ' t high val shouldcombine = aggregator.isdefined if (shouldcombine) {//Combine values in-memory F Irst using our appendonlymap val mergevalue = aggregator.get.mergeValue val Createcombiner = Aggregator.get.crea Tecombiner var kv:product2[k, V] = null val update = (Hadvalue:boolean, oldvalue:c) + = {if (hadval        UE) Mergevalue (OldValue, kv._2) Else Createcombiner (kv._2)} while (Records.hasnext) {addelementsread () KV = Records.next () map.changevalue ((Getpartition (kv._1), kv._1), update) maybespillcollection (USINGM AP = True)}} else if (Bypassmergesort) {//Spark-4479:also bypass buffering If merge sort is bypassed to Avoid defensive copies if (Records.hasnext) {spilltopartitionfiles (writablepartitionediterator.from Iterator (Records.map {kv = ((Getpartition (kv._1), kv._1), Kv._2.asinstanceof[c])})}} else {//S        Tick values into we buffer while (Records.hasnext) {Addelementsread () val kv = Records.next ()   Buffer.insert (Getpartition (kv._1), Kv._1, Kv._2.asinstanceof[c]) maybespillcollection (UsingMap = False)}} }
explain the internal logic:

(1) If it is shouldcombine, the k-v information is recorded in an array, the default size is 64*2, the storage format is key0,value0,key1,value1,key2,value2 .... The Map.changevalue method is to constantly call the value of the Mergevalue method, to update the value of the specified position in the array. If the amount of k-v reaches 0.7 of the array size, it is automatically expanded.

After calling Maybespillcollection, first of all to determine whether it is necessary to spill, based on the spillingenabled flag (does not open the risk of oom, in fact, the above rehash expansion should be the risk of oom), And the element read is an integer multiple of 32, and currently occupies more memory than the set threshold (5M), go to Shufflememorymanager to request memory (Shufflememorymanager has a threshold, each shuffle When the task is applied to him, it will be recorded), the requested capacity is twice times the current use of the capacity minus the threshold (5M), if the application is successful increase the threshold. If the current memory consumption is still greater than the new threshold, it must be spill, otherwise the memory is considered sufficient. After the true spill operation, release the memory that you just requested from Shufflememorymanager and restore the threshold value to the initial value (5M).

Spill method: If the number of partition is <=200 and no combine on the map side is set, the Spilltopartitionfiles method is called, otherwise the Spilltomergeablefile method is called, which is then discussed.

So in this branch, we are shouldcombine, so the Spilltomergefile method is called.

It is important to note that before spill, we have a data structure to hold the information, with map and buffer selectable. Since Shouldcombine is likely to update the data, call our Mergevalue method and so on, so we use map.

(2) If it is Bypassmergesort (partition number <=200, and no combine is set on the map side), the Spilltopartitionfiles method is called. This mode directly write partition file, there is no cache this said.

(3) If shouldcombine, non-bypassmergesort, because we do not need the merge operation, use buffer directly as the pre-spill cache structure. The Maybespillcollection method is then called.


Take a look at the Spilltomergeablefile method:

(1) Create a write shuffle file in the subdirectory below Localdirs

(2) To sort the data in the cache, the principle is to sort by the key in PartitionID and partition, the data format is ((PARTITIONID_0,KEY_0), Value_0), ((partitionid_0,key_1) , Value_1) ... ((partitionid_100,key_100), value_100).

(3) Gradually write to the file, each write 10,000, sync a. Save a Spilledfile structure in memory at the same time.

That is, a map task, each time spill generates a file (since it is possible for a map task to have multiple spill), the file is in order.

In this way, a spill is completed.


Take a look at the Spilltopartitionfiles method:

Each map task establishes a different file for each reduce partition and does not need to be sorted.


The Insertall method is finished, followed by the introduction.

Create a data file based on shuffleid+ mapid information and call the Writepartitionedfile method:

(1) If the former is Bypassmergesort, that is, the call is Spilltopartitionfiles, the remaining buffer information is written to the specified reduce partition corresponding to the file. Then merge all the output files into one data file

(2) If there is no spilledfile information in memory, that is, all the information is in memory, write directly to the data file.

(3) Otherwise, is also the most complex situation, the map task output of all files, press partition to consolidate into a data file, the format is approximately (Partition0, the map task in the entire data partition 0), (Partition1, All data for partition 1 in this map task) ...


It should be noted that (2) and (3) when the case is written to a data file, the size of each partition in the data file is recorded.

Create an index file corresponding to the data file, which records the starting offset for each partition in the data file. It is conceivable that the offset of each partition is recorded, in fact, that it knows which part of each partition in the data file.

Finally the Shuffleserverid (recorded Host,port, Executorid), each partition file length is encapsulated into mapstatus return.


Second, Hashshufflemanager

Spark creates a bucket for each reducer in each of the mapper, and puts the RDD calculation results into the bucket. each bucket has a diskobjectwriter, each write handler has a buffer size, using this write handler to write the map output to the file. This means that the key-value pair of the map output is written to disk one after the other rather than storing all of the data in memory in the overall flush to disk, which can be much less stressful for memory. Of course, the number of maps running at the same time is limited by resources, so the required memory is probably cores*reducer num*buffer size. However, when the number of reduce and the number of maps is large, the memory overhead required is staggering.

The process of Hashshufflemanager writing is relatively simple.

/** write a bunch of records to this task ' output *  /override def Write (Records:iterator[product2[k, V]]): Unit = { C1/>val iter = if (dep.aggregator.isDefined) {      if (dep.mapsidecombine) {        Dep.aggregator.get.combineValuesByKey (records, context)      } else {        records      }    } else {      require (! Dep.mapsidecombine, "map-side combine without aggregator specified!")      Records    } for    (Elem <-iter) {      val bucketID = dep.partitioner.getPartition (elem._1)      Shuffle.writers (bucketID). Write (Elem._1, elem._2)    }  }
(1) If mapsidecombine is defined, similar to the Shouldcombine branch in the Insertall method, the K-v is merged. Otherwise, do not do the processing.

(2) All k-v are then computed to which partition to output to, written to the specified partition file.

This pattern naturally does not require sequencing, merge, and other complex operations, because eventually each map task outputs a file for each reduce partition.

Finally it is also assembled into a mapstatus structure to return.

At this point, shuffle's writing process is over.

The next section describes the reading process for shuffle.


Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Spark Core source Analysis Shuffle detailed-write process

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.