Spark Technology Insider: Overall shuffle read Process

Source: Internet
Author: User

Recall that the upper boundary of each stage either needs to read data from external storage or to read the output of the previous stage, while the lower boundary, either you need to write data to the local file system (Shuffle is required) for childstage to read, or the last stage to output results. Here, the stage is a group of tasks that can be run in pipeline mode. Except the last stage corresponds to resulttask, the other stages correspond to shufflemap tasks.

In addition to tasks that need to read data from external storage and which have been cached or checkpoint in RDD, generally, the start of a task begins with shuffleread of shuffledrdd. This section describes the shuffle read process in detail.

Let's take a look at the overall architecture of shuffleread.


Org. apache. spark. RDD. shuffledrdd # start by calling Org. apache. spark. shuffle. the getreader method of shufflemanager to get Org. apache. spark. shuffle. shufflereader, and then calls its Read () method to read. In spark1.2.0, whether it is hash basedshuffle or sort basedshuffle, the built-in shuffle reader is org. Apache. Spark. Shuffle. Hash. hashshufflereader. Core Implementation:

Override def read (): iterator [product2 [K, C] = {Val SER = serializer. getserializer (Dep. serializer) // obtain the result Val iter = blockstoreshufflefetcher. fetch (handle. shuffleid, startpartition, context, Ser) // processing result Val aggregatediter: iterator [product2 [K, C] = If (Dep. aggregator. isdefined) {// If (Dep. mapsidecombine) {// map side aggregation required new interruptibleiterator (context, dep. aggregator. get. combinecombinersbykey (ITER, context)} else {// aggregate new interruptibleiterator (context, dep. aggregator. get. combinevaluesbykey (ITER, context)} else {// No need to aggregate ITER. asinstanceof [iterator [product2 [K, C]. map (pair => (pair. _ 1, pair. _ 2)} // sort the output if there is a Sort ordering defined. dep. keyordering match {// determine whether to sort. Case some (keyord: ordering [k]) => // sort the data to be sorted // use externalsorter for sorting. Note that if spark. shuffle. if spill is false, the data is // not spill to the hard disk. Val sorter = new externalsorter [K, C, C] (ordering = some (keyord ), serializer = some (SER) sorter. insertall (aggregatediter) context. taskmetrics. memorybytesspilled + = sorter. memorybytesspilled context. taskmetrics. diskbytesspilled + = sorter. diskbytesspilled sorter. iterator case none => // No need to sort aggregatediter }}

Org. Apache. Spark. Shuffle. Hash. blockstoreshufflefetcher # Fetch will obtain the data, which will first pass

Org. apache. spark. mapoutputtracker # getserverstatuses to obtain the meta information of the data. This process may need to be directed to Org. apache. spark. mapoutputtrackermasteractor sends a read request in org. apache. spark. mapoutputtracker # asktracker. After obtaining the meta information of the data, it will store the data into seq [(blockmanagerid, seq [(blockid, long)], and then call Org. apache. spark. storage. shuffleblockfetcheriterator finally initiates a request. Shuffleblockfetcheriterator obtains data based on the Local principles of data. If the data is local, org. Apache. Spark. Storage. blockmanager # getblockdata is called to read the local data block. Getblockdata calls the getblockdata of shuffleblockmanager of shufflemanager for data of the shuffle type.

If the data is stored on other executors. shuffle. blocktransferservice is netty, then it will pass Org. apache. spark. network. netty. nettyblocktransferservice # fetchblocks; If NiO is used, it will be obtained through Org. apache. spark. network. NIO. obtain nioblocktransferservice # fetchblocks.


Data read Policy Division

Org. apache. spark. storage. shuffleblockfetcheriterator divides Data Reading policies through splitlocalremoteblocks: if data is available locally, data can be obtained directly from blockmanager. If data needs to be obtained from other nodes, the network is required. Because the size of shuffle data may be large, the network read policy here is as follows:

1) a maximum of five threads can be started at a time to read data from a maximum of five nodes.

2) The data size of each request does not exceed spark. Cer. maxmbinflight (default value: 48 MB)/5

There are several reasons for doing this:

1) avoid occupying too much bandwidth on the target machine. bandwidth is still important when the Gigabit Nic is the mainstream. If the machine uses a 10-ge Nic, you can set spark. Cer. maxmbinflight to make full use of the bandwidth.

2) request data can be parallel, so that the time required to request data can be greatly reduced. The total time of the request data is the longest time in the request. This can alleviate the impact of network congestion on a node.

Main Implementation:

Private [this] def splitlocalremoteblocks (): arraybuffer [fetchrequest] = {Val targetrequestsize = math. max (maxbytesinflight/5, 1l) Val remoterequests = new arraybuffer [fetchrequest] For (address, blockinfos) <-blocksbyaddress) {If (address.exe cutorid = blockmanager.blockmanagerid.exe cutorid) {// The block is local, and the block with a size of 0 needs to be filtered out. Localblocks ++ = blockinfos. Filter (_. _ 2! = 0 ). map (_. _ 1) numblockstofetch + = localblocks. size} else {// block Val iterator = blockinfos to be remotely obtained. iterator var currequestsize = 0l var curblocks = new arraybuffer [(blockid, long)] While (iterator. hasnext) {// blockid is Org. apache. spark. storage. shuffleblockid, // format: "Shuffle _" + shuffleid + "_" + mapid + "_" + performanceid Val (blockid, size) = iterator. next () // skip empty blocks if (size> 0) {curblocks + = (blockid, size )) remoteblocks + = blockid numblockstofetch + = 1 currequestsize + = size} If (currequestsize> = targetrequestsize) {// The current total size can already be put into one request in batches. remoterequests + = new fetchrequest (address, curblocks) curblocks = new arraybuffer [(blockid, long)] currequestsize = 0 }}// the remaining requests constitute a request if (curblocks. nonempty) {remoterequests + = new fetchrequest (address, curblocks) }}} remoterequests}

Local read

Fetchlocalblocks () is used to obtain the local block. In splitlocalremoteblocks, the local block list has been saved to localblocks: Private [this] Val localblocks = newarraybuffer [blockid] ()

The specific process is as follows:

  val iter = localBlocks.iterator   while (iter.hasNext) {     val blockId = iter.next()     try {       val buf = blockManager.getBlockData(blockId)       shuffleMetrics.localBlocksFetched += 1       buf.retain()       results.put(new SuccessFetchResult(blockId, 0, buf))     } catch {     }    }

The implementation of blockmanager. getblockdata (blockid) is:

override def getBlockData(blockId:BlockId): ManagedBuffer = {   if (blockId.isShuffle) {    shuffleManager.shuffleBlockManager.getBlockData(blockId.asInstanceOf[ShuffleBlockId])}
This calls the shuffleblockmanager's getblockdata method. In the shuffle pluggable framework, we introduce that one of the shuffle services is to implement shuffleblockmanager.

Take hash basedshuffle as an example. Its shuffleblockmanager is org. Apache. Spark. Shuffle. fileshuffleblockmanager. There are two types of fileshuffleblockmanager: one is file ready lidate. In this case, you need to obtain a file of filegroup Based on the map ID and reduce ID, then obtain the required data based on the offset and size in the file. If there is no file limit lidate, you can directly read the entire file based on the shuffle block ID.

Override def getblockdata (blockid: shuffleblockid): managedbuffer = {If (jsonlidateshufflefiles) {Val shufflestate = shufflestates (blockid. shuffleid) Val iter = shufflestate. allfilegroups. iteratorwhile (ITER. hasnext) {// obtain the file segment information based on the map ID and reduce ID. Val segmentopt = ITER. next. getfilesegmentfor (blockid. mapid, blockid. performanceid) if (segmentopt. isdefined) {Val segment = segmentopt. get // locate the offset and size return New filesegmentmanagedbuffer (transportconf, segment. file, segment. offset, segment. length)} Throw new illegalstateexception ("failed to find shuffle block:" + blockid)} else {Val file = blockmanager. diskblockmanager. getFile (blockid) // directly obtain the file handle new filesegmentmanagedbuffer (transportconf, file, 0, file. length )}}

For sort basedshuffle, it needs to obtain the specific location information of the data block in the data file through the index file to read the data.

The specific implementation is in org. Apache. Spark. Shuffle. indexshuffleblockmanager # getblockdata.

Override def getblockdata (blockid: shuffleblockid): managedbuffer = {// Based on shuffleid and mapid. apache. spark. storage. diskblockmanager obtains the index file Val indexfile = getindexfile (blockid. shuffleid, blockid. mapid) Val in = new datainputstream (New fileinputstream (indexfile) Try {bytestreams. skipfully (in, blockid. reduceid * 8) // jump to the data zone of this block Val offset = in. readlong () // start position in the data file Val nextoffset = in. readlong () // end position in the data file new filesegmentmanagedbuffer (transportconf, getdatafile (blockid. shuffleid, blockid. mapid), offset, nextoffset-offset)} finally {In. close ()}}

Spark Technology Insider: Overall shuffle read Process

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.