The shuffle in Spark is probably a process where the map output is written as a local file, the reduce side reads the files, and then the reduce operation is performed.
So, here's the question:
How does reducer know where the input is?
First, Mapper will definitely be able to provide information about its output after writing the file. This information, represented by Mapstatus in spark
Private [spark] sealed trait mapstatus { def location:blockmanagerid def getsizeforblock (reduceid:int): Long}
When the Shufflemaptask execution is complete, Mapstatus is passed as the execution result to driver. The declaration of Shufflemaptasks's Runtask method is like this
Reducer If you get the mapstatus associated with yourself from the driver, it knows which Blockmanager stores the map output that you need.
However, the following issues still exist:
1. How did driver get the mapstatus?
2. How did reducer get to Mapstatus?
3. How does the reducer get the map output based on the mapstatus?
Driver get mapstatus How to deal with it?
First, executor will use mapstatus as the result of the task execution and pass the Statusupdate method to driver
override def statusupdate (Taskid:long, State:taskstate, Data:bytebuffer) { = statusupdate ( Executorid, TaskId, State, data) driver match { case Some (driverref) = driverref.send ( msg) Case None = logwarning (S "Drop $msg because have not yet connected to driver ") } }
The Statusupdate method of TaskScheduler is called after Driverendpoint receives statusupdate
Case Statusupdate (Executorid, TaskId, state, data) = scheduler.statusupdate (taskId, State, Data.value)
Then go through a long call chain ... Calls the Handletaskcompletion method to Dagscheduler, which matches the type of the task
Case Smt:shufflemaptask =
After the match has performed a lot of operations, and shuffle related to some of the following
Val shufflestage = Stage.asinstanceof[shufflemapstage] updateaccumulators (event) = Event.result.asinstanceof[mapstatus] = status.location.executorId if ( Failedepoch.contains (execid) && Smt.epoch <= Failedepoch (execid)) { Loginfo ("ignoring Possibly bogus shufflemaptask completion from "+ execid) Else { Shufflestage.addoutputloc (Smt.partitionid, status) }
The point is that the output location will be added to the Shufflemapstage Outputloc, which is a mapstatus array held by Shufflemapstage. When all the tasks on this stage have been completed, the mapstatus of all tasks in this stage will be told to Mapoutputtracker
mapoutputtracker.registermapoutputs ( shuffleStage.shuffleDep.shuffleId, if NullElse list.head). ToArray, true)
Like Mapoutputtracker and Blockmanager, both are master-worker structures, and the worker requests master via RPC to provide information.
As a result, Mapstatus's information was passed from executor to driver, which was eventually registered to Mapoutputtracker.
How did reducer get to Mapstatus?
First, the transformation that raises shuffle will generate special Rdd,shuffledrdd and Cogroupedrdd, which will trigger the reduce process when the compute method of the RDD is called.
Let's take Shuffledrdd as an example.
Override Def compute (split:partition, Context:taskcontext): iterator[(K, C)] = { = Dependencies.head.asinstanceof[shuffledependency[k, V, C]] + 1, context) . Read () . asinstanceof[iterator[(K, C)] }
Currently, the Getreader method of Shufflemanager only returns reader of type Hashshufflereader, which is the only subclass of Shufflereader.
Its Read method will call Blockstoreshufflefetcher's fetch method to get the output of the map
Val iter = Blockstoreshufflefetcher.fetch (Handle.shuffleid, startpartition, context, Ser)
This fetch method requests Mapoutputtracker to obtain the location and size of the map output, and the Mapoutputtracker Getserverstatus method obtains the reducer corresponding mapstatus.
// statuses:array[(Blockmanagerid, Long)] Get this shuffleid, Reduceid the location and size of the corresponding map output Val statuses = SparkEnv.get.mapOutputTracker.getServerStatuses (Shuffleid, Reduceid)
How does reducer get the map output based on Mapstatus?
The type of statuses is array[(Blockmanagerid, Long)], which is the two information that Mapstatus can provide.
The Fetch method is assembled using the information obtained in the Mapstatus Shuffleblockid
val splitsbyaddress = new hashmap[ Blockmanagerid, arraybuffer[(Int, Long)]] for ((address, size), index) & Lt;- Statuses.zipwithindex) {splitsbyaddress.getorelseupdate (address, ArrayBuffer () += ((index, size)} Val blocksbyaddress:seq[(Blockmanagerid, seq[(Blockid, Long)]] = SplitsByAddress.toSeq.map { Case (address, splits) => (address, Splits.map (s => (Shuffleblockid (Shuffleid, S._1, Reduceid), s._2))}
Note that the information in this array of statuses includes the output of each map, even if there is no output from the map corresponding to this reduce. The information at the index of the array I, which is the output information of the map mapid to I. Therefore, Splitsbyaddress uses Statues.zipwithindex to obtain the mapid when it is generated. And the process of assembling blocksbyaddress generates Shuffleblockid
Case class extends Blockid { = "Shuffle_" + Shuffleid + "_" + MapId + "_" + Reduceid}
This blocksbyaddress will be used to construct the Shuffleblockfetcheriterator, which will request Blockmanager to obtain the corresponding shuffleblock. Here is the code that constructs shuffleblockfetcheriterator in the Fetch method
New Shuffleblockfetcheriterator ( context, SparkEnv.get.blockManager.shuffleClient, Blockmanager, blocksbyaddress, Serializer, // note:we use getsizeasmb when no suffix yes provided for b Ackwards compatibility SparkEnv.get.conf.getSizeAsMb ("Spark.reducer.maxSizeInFlight", "48m") * 1024 * 1024)
Shuffleblockfetcheriterator is an iterator whose main constructor invokes the Initialize method for initialization. The main function of this initialize is to generate a fetch request to shuffleblock and send these requests.
Private[ This] Def initialize (): Unit = { //ADD A task completion callback (called in both success case and failure case) to cleanup.Context.addtaskcompletionlistener (_ =Cleanup ())//separate the local and remote blocks.Val remoterequests =splitlocalremoteblocks ()//The remote block is randomly arranged and added to the queue .Fetchrequests ++=utils.randomize (remoterequests)//send a request to a remote block while(Fetchrequests.nonempty &&(Bytesinflight= = 0 | | Bytesinflight + fetchRequests.front.size <=maxbytesinflight)) {SendRequest (Fetchrequests.dequeue ())} Val numfetches= Remoterequests.size-fetchrequests.size Loginfo ("Started" + numfetches + "remote fetches in" +Utils.getusedtimems (startTime))//get the local blockfetchlocalblocks () logdebug ("Got local blocks in" +Utils.getusedtimems (startTime))}
It distinguishes the remote or local block, and the local block is the block that is currently managed by the executor Blockmanager. It can be judged by whether the block is Blockmanagerid equal to the local blockmanagerid.
The fetchlocalblocks process is simple, just ask for the local blockmanager.
Val buf = Blockmanager.getblockdata (blockid)
Getting the block at the far end is a bit troublesome and requires shuffleclient to help
shuffleclient.fetchblocks (Address.host, Address.port, Address.executorid, Blockids.toarray, New Blockfetchinglistener { ... } )
This shuffleclient is provided by Blockmanager. It has two kinds of
Private if (externalshuffleserviceenabled) { = sparktransportconf.fromsparkconf (conf, numusablecores) New externalshuffleclient (transconf, SecurityManager, securitymanager.isauthenticationenabled () , securitymanager.issaslencryptionenabled ()) Else { Blocktransferservice }
By default, Blocktransferservice is used. There are two kinds of this thing.
Val blocktransferservice = conf.get ("Spark.shuffle.blockTransferService", "Netty"). toLowerCase Match { case "Netty" = new nettyblocktransferservice (conf, SecurityManager, Numusablecores) case "NiO" = new nioblocktransferservice (conf, SecurityManager) }
Nettyblocktransferservice is used by default. This thing will start a nettyblockrpcserver that provides the transport service for the block. Shuffleclient will contact it via host and port.
After a string of calls, the server receives a message of type openblocks, and then it handles
message Match { case openblocks:openblocks = OpenBlocks.blockIds.map (blockid.apply). Map (blockmanager.getblockdata) = Streammanager.registerstream (Blocks.iterator) logtrace (S"registered Streamid $streamId with ${ Blocks.size} buffers ") responsecontext.onsuccess (new streamhandle (Streamid, blocks.size). Tobytearray)
Here, it calls Blockdatamanager's Getblockdata method to get the block. Blockmanager inherited the Blockdatamanager, and it will register itself to Blocktransferservice
This registration occurs in the Intialize method of the Blockmanager
def initialize (appid:string): Unit = { blocktransferservice.init (this) // Register yourself with Blocktransferservice so that Blocktransferservice can access the block by himself
So, the Getblockdata method to Blockmanager will eventually be called
Override Def Getblockdata (blockid:blockid): Managedbuffer = { if(blockid.isshuffle) {shuffleManager.shuffleBlockResolver.getBlockData (Blockid.asinstanceof[shuffleblockid]) } Else{val blockbytesopt= Dogetlocal (blockid, Asblockresult =false). Asinstanceof[option[bytebuffer]]if(blockbytesopt.isdefined) {val buffer=Blockbytesopt.getNewniomanagedbuffer (buffer)}Else { Throw Newblocknotfoundexception (blockid.tostring)}} }
So for Shuffleblockid, it calls Shuffleblockresover to get the block's data.
This shuffleblockresolver is a magical thing.
Spark's shuffle has two types, sort and hash, using Hashshufflemanager and Sortshufflemanager, respectively. The hash method writes each map a file for each reduce output, but sort is one file per map. Therefore, Shuffleblockresolver must deal with both cases, and indeed, there are two kinds of shuffleblockresolver. Hashshufflemanager uses Fileshuffleblockresolver, Sortshufflemanager uses Indexshuffleblockresolver.
The difference between the two shuffleblockresolver shows the difference between the hash and sort two shuffle modes reducer reading the map output file.
Hash and sort two shuffle ways to read the map output file when the difference
Hashshufflemanager uses Fileshuffleblockresolver, and its Getblockdata method is based on whether consolidate shuffle has different execution modes, consolidate Shuffle is not enabled by default and is executed at
Override Def Getblockdata (blockid:shuffleblockid): Managedbuffer = { if ( Consolidateshufflefiles) { ... Else { = blockManager.diskBlockManager.getFile (blockid) New Filesegmentmanagedbuffer (transportconf, file, 0, file.length) } }
Will directly according to Blockid to Diskblockmanager get the corresponding file, and then generate a Filesegmentmanagedbuffer object, this buffer offset starting from 0, The length is file.length, which is the entire file.
Sortshufflemanager use Indexshuffleblockresolver. Because each map in the sort mode shuffle writes a data file and an index file, the data file will have data corresponding to multiple reducer, so you need to read the index file first to determine which reducer to read from.
Override Def Getblockdata (blockid:shuffleblockid): Managedbuffer = { //The block is actually going to being a range of a single map output file for this map, so//find out the consolidated file, then the offset within that from our indexVal Indexfile =getindexfile (Blockid.shuffleid, Blockid.mapid) Val in=NewDataInputStream (NewFileInputStream (indexfile))Try{bytestreams.skipfully (in, Blockid.reduceid* 8) Val Offset=In.readlong () Val nextoffset=In.readlong ()NewFilesegmentmanagedbuffer (transportconf, Getdatafile (Blockid.shuffleid, blockid.mapid), offset, Nextoffset-offset)} finally{in.close ()}}
This index file is remembered as a series of long values, and the I value represents the offset of the data in the data file for the I-reducer. As a result, the filesegmentmanagedbuffer it returns does not include the entire file, but one fragment of the file, as it does in hash mode.
Information transfer in the process of shuffle