Information transfer in the process of shuffle

Source: Internet
Author: User
Tags shuffle

The shuffle in Spark is probably a process where the map output is written as a local file, the reduce side reads the files, and then the reduce operation is performed.

So, here's the question:

How does reducer know where the input is?

First, Mapper will definitely be able to provide information about its output after writing the file. This information, represented by Mapstatus in spark

Private [spark] sealed trait mapstatus {  def location:blockmanagerid  def getsizeforblock (reduceid:int): Long} 

When the Shufflemaptask execution is complete, Mapstatus is passed as the execution result to driver. The declaration of Shufflemaptasks's Runtask method is like this



Reducer If you get the mapstatus associated with yourself from the driver, it knows which Blockmanager stores the map output that you need.

However, the following issues still exist:

1. How did driver get the mapstatus?

2. How did reducer get to Mapstatus?

3. How does the reducer get the map output based on the mapstatus?

Driver get mapstatus How to deal with it?

First, executor will use mapstatus as the result of the task execution and pass the Statusupdate method to driver

  override def statusupdate (Taskid:long, State:taskstate, Data:bytebuffer) {    = statusupdate ( Executorid, TaskId, State, data)    driver match {      case Some (driverref) = driverref.send ( msg)      Case None = logwarning (S "Drop $msg because have not yet connected to driver ")    }  }

The Statusupdate method of TaskScheduler is called after Driverendpoint receives statusupdate

       Case Statusupdate (Executorid, TaskId, state, data) =        scheduler.statusupdate (taskId, State, Data.value) 

Then go through a long call chain ... Calls the Handletaskcompletion method to Dagscheduler, which matches the type of the task

Case Smt:shufflemaptask =

After the match has performed a lot of operations, and shuffle related to some of the following

            Val shufflestage = Stage.asinstanceof[shufflemapstage]            updateaccumulators (event)            =  Event.result.asinstanceof[mapstatus]            = status.location.executorId            if ( Failedepoch.contains (execid) && Smt.epoch <= Failedepoch (execid)) {              Loginfo ("ignoring Possibly bogus shufflemaptask completion from "+ execid)            Else  {              Shufflestage.addoutputloc (Smt.partitionid, status)            }

The point is that the output location will be added to the Shufflemapstage Outputloc, which is a mapstatus array held by Shufflemapstage. When all the tasks on this stage have been completed, the mapstatus of all tasks in this stage will be told to Mapoutputtracker

              mapoutputtracker.registermapoutputs (                shuffleStage.shuffleDep.shuffleId,                if  NullElse  list.head). ToArray,                true)

Like Mapoutputtracker and Blockmanager, both are master-worker structures, and the worker requests master via RPC to provide information.

As a result, Mapstatus's information was passed from executor to driver, which was eventually registered to Mapoutputtracker.

How did reducer get to Mapstatus?

First, the transformation that raises shuffle will generate special Rdd,shuffledrdd and Cogroupedrdd, which will trigger the reduce process when the compute method of the RDD is called.

Let's take Shuffledrdd as an example.

  Override Def compute (split:partition, Context:taskcontext): iterator[(K, C)] = {    =  Dependencies.head.asinstanceof[shuffledependency[k, V, C]]    + 1, context)      . Read ()      . asinstanceof[iterator[(K, C)]  }

Currently, the Getreader method of Shufflemanager only returns reader of type Hashshufflereader, which is the only subclass of Shufflereader.

Its Read method will call Blockstoreshufflefetcher's fetch method to get the output of the map

Val iter = Blockstoreshufflefetcher.fetch (Handle.shuffleid, startpartition, context, Ser)

This fetch method requests Mapoutputtracker to obtain the location and size of the map output, and the Mapoutputtracker Getserverstatus method obtains the reducer corresponding mapstatus.

// statuses:array[(Blockmanagerid, Long)] Get this shuffleid, Reduceid the location and size    of the corresponding map output Val statuses = SparkEnv.get.mapOutputTracker.getServerStatuses (Shuffleid, Reduceid)
How does reducer get the map output based on Mapstatus?

The type of statuses is array[(Blockmanagerid, Long)], which is the two information that Mapstatus can provide.

The Fetch method is assembled using the information obtained in the Mapstatus Shuffleblockid

 val splitsbyaddress = new   hashmap[ Blockmanagerid, arraybuffer[(Int, Long)]]  for  ((address, size), index) & Lt;- Statuses.zipwithindex) {splitsbyaddress.getorelseupdate (address, ArrayBuffer ()  += ((index, size)} Val blocksbyaddress:seq[(Blockmanagerid, seq[(Blockid, Long)]]  = SplitsByAddress.toSeq.map {  Case  (address, splits) => (address, Splits.map (s  => (Shuffleblockid (Shuffleid, S._1, Reduceid), s._2))}  

Note that the information in this array of statuses includes the output of each map, even if there is no output from the map corresponding to this reduce. The information at the index of the array I, which is the output information of the map mapid to I. Therefore, Splitsbyaddress uses Statues.zipwithindex to obtain the mapid when it is generated. And the process of assembling blocksbyaddress generates Shuffleblockid

 Case class extends Blockid {  = "Shuffle_" + Shuffleid + "_" + MapId + "_" + Reduceid}

This blocksbyaddress will be used to construct the Shuffleblockfetcheriterator, which will request Blockmanager to obtain the corresponding shuffleblock. Here is the code that constructs shuffleblockfetcheriterator in the Fetch method

    New Shuffleblockfetcheriterator (      context,      SparkEnv.get.blockManager.shuffleClient,      Blockmanager,      blocksbyaddress,      Serializer,      //  note:we use getsizeasmb when no suffix yes provided for b Ackwards compatibility      SparkEnv.get.conf.getSizeAsMb ("Spark.reducer.maxSizeInFlight", "48m") * 1024 * 1024)

Shuffleblockfetcheriterator is an iterator whose main constructor invokes the Initialize method for initialization. The main function of this initialize is to generate a fetch request to shuffleblock and send these requests.

  Private[ This] Def initialize (): Unit = {    //ADD A task completion callback (called in both success case and failure case) to cleanup.Context.addtaskcompletionlistener (_ =Cleanup ())//separate the local and remote blocks.Val remoterequests =splitlocalremoteblocks ()//The remote block is randomly arranged and added to the queue .Fetchrequests ++=utils.randomize (remoterequests)//send a request to a remote block     while(Fetchrequests.nonempty &&(Bytesinflight= = 0 | | Bytesinflight + fetchRequests.front.size <=maxbytesinflight)) {SendRequest (Fetchrequests.dequeue ())} Val numfetches= Remoterequests.size-fetchrequests.size Loginfo ("Started" + numfetches + "remote fetches in" +Utils.getusedtimems (startTime))//get the local blockfetchlocalblocks () logdebug ("Got local blocks in" +Utils.getusedtimems (startTime))}

It distinguishes the remote or local block, and the local block is the block that is currently managed by the executor Blockmanager. It can be judged by whether the block is Blockmanagerid equal to the local blockmanagerid.

The fetchlocalblocks process is simple, just ask for the local blockmanager.

Val buf = Blockmanager.getblockdata (blockid)

Getting the block at the far end is a bit troublesome and requires shuffleclient to help

    shuffleclient.fetchblocks (Address.host, Address.port, Address.executorid, Blockids.toarray,       New Blockfetchinglistener {         ...           }    )

This shuffleclient is provided by Blockmanager. It has two kinds of

  Private if (externalshuffleserviceenabled) {    = sparktransportconf.fromsparkconf (conf, numusablecores)     New  externalshuffleclient (transconf, SecurityManager, securitymanager.isauthenticationenabled () ,      securitymanager.issaslencryptionenabled ())  Else  {    Blocktransferservice  }

By default, Blocktransferservice is used. There are two kinds of this thing.

    Val blocktransferservice =      conf.get ("Spark.shuffle.blockTransferService", "Netty"). toLowerCase Match {        case "Netty" =          new  nettyblocktransferservice (conf, SecurityManager, Numusablecores)        case "NiO" =          new  nioblocktransferservice (conf, SecurityManager)      }

Nettyblocktransferservice is used by default. This thing will start a nettyblockrpcserver that provides the transport service for the block. Shuffleclient will contact it via host and port.

After a string of calls, the server receives a message of type openblocks, and then it handles

   message Match {      case openblocks:openblocks          =          OpenBlocks.blockIds.map (blockid.apply). Map (blockmanager.getblockdata)        =  Streammanager.registerstream (Blocks.iterator)        logtrace (S"registered Streamid $streamId with ${ Blocks.size} buffers ")        responsecontext.onsuccess (new streamhandle (Streamid, blocks.size). Tobytearray)

Here, it calls Blockdatamanager's Getblockdata method to get the block. Blockmanager inherited the Blockdatamanager, and it will register itself to Blocktransferservice

This registration occurs in the Intialize method of the Blockmanager

  def initialize (appid:string): Unit = {    blocktransferservice.init (this)  //  Register yourself with Blocktransferservice so that Blocktransferservice can access the block by himself

So, the Getblockdata method to Blockmanager will eventually be called

Override Def Getblockdata (blockid:blockid): Managedbuffer = {    if(blockid.isshuffle) {shuffleManager.shuffleBlockResolver.getBlockData (Blockid.asinstanceof[shuffleblockid]) } Else{val blockbytesopt= Dogetlocal (blockid, Asblockresult =false). Asinstanceof[option[bytebuffer]]if(blockbytesopt.isdefined) {val buffer=Blockbytesopt.getNewniomanagedbuffer (buffer)}Else {        Throw Newblocknotfoundexception (blockid.tostring)}} }

So for Shuffleblockid, it calls Shuffleblockresover to get the block's data.

This shuffleblockresolver is a magical thing.

Spark's shuffle has two types, sort and hash, using Hashshufflemanager and Sortshufflemanager, respectively. The hash method writes each map a file for each reduce output, but sort is one file per map. Therefore, Shuffleblockresolver must deal with both cases, and indeed, there are two kinds of shuffleblockresolver. Hashshufflemanager uses Fileshuffleblockresolver, Sortshufflemanager uses Indexshuffleblockresolver.

The difference between the two shuffleblockresolver shows the difference between the hash and sort two shuffle modes reducer reading the map output file.

Hash and sort two shuffle ways to read the map output file when the difference

Hashshufflemanager uses Fileshuffleblockresolver, and its Getblockdata method is based on whether consolidate shuffle has different execution modes, consolidate Shuffle is not enabled by default and is executed at

  Override Def Getblockdata (blockid:shuffleblockid): Managedbuffer = {    if  ( Consolidateshufflefiles) {        ...     Else {      = blockManager.diskBlockManager.getFile (blockid)      New Filesegmentmanagedbuffer (transportconf, file, 0, file.length)    }  }    

Will directly according to Blockid to Diskblockmanager get the corresponding file, and then generate a Filesegmentmanagedbuffer object, this buffer offset starting from 0, The length is file.length, which is the entire file.

Sortshufflemanager use Indexshuffleblockresolver. Because each map in the sort mode shuffle writes a data file and an index file, the data file will have data corresponding to multiple reducer, so you need to read the index file first to determine which reducer to read from.

Override Def Getblockdata (blockid:shuffleblockid): Managedbuffer = {    //The block is actually going to being a range of a single map output file for this map, so//find out the consolidated file, then the offset within that from our indexVal Indexfile =getindexfile (Blockid.shuffleid, Blockid.mapid) Val in=NewDataInputStream (NewFileInputStream (indexfile))Try{bytestreams.skipfully (in, Blockid.reduceid* 8) Val Offset=In.readlong () Val nextoffset=In.readlong ()NewFilesegmentmanagedbuffer (transportconf, Getdatafile (Blockid.shuffleid, blockid.mapid), offset, Nextoffset-offset)} finally{in.close ()}}

This index file is remembered as a series of long values, and the I value represents the offset of the data in the data file for the I-reducer. As a result, the filesegmentmanagedbuffer it returns does not include the entire file, but one fragment of the file, as it does in hash mode.

Information transfer in the process of shuffle

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.