Spark Technology Insider: Shuffle Details (II.)

Source: Internet
Author: User

This article focuses on how Shuffledrdd's shuffle read reads data from other node.

here's how to get a strategy for getting it all in org.apache.spark.storage.blockfetcheriterator.basicblockfetcheriterator# The splitlocalremoteblocks. You can see the comments.

    Protected Def splitlocalremoteblocks (): arraybuffer[fetchrequest] = {//make remote requests at most Maxbytesinf LIGHT/5 in length; The reason to keep them//smaller than maxbytesinflight are to allow multiple, parallel fetches from up to 5//      nodes, rather than blocking on reading output from one node.      To get the data quickly, 5 threads are started each time to fetch data up to 5 node, and//the data per request will not exceed spark.reducer.maxMbInFlight (the default value is 48MB)/5. There are several reasons for this://1.      Avoid the excessive bandwidth of the target machine, in the Gigabit network card mainstream today, the bandwidth is still more important.      If a connection is going to consume 48M of bandwidth, this network IO can be a bottleneck. 2. Request data can be parallel, so that the time to request data can be greatly reduced.      The total time to request data is the longest request.      If it is not a parallel request, then the total time will be the sum of all the request times. and set Spark.reducer.maxMbInFlight, also in order not to occupy too much memory val targetrequestsize = Math.max (MAXBYTESINFLIGHT/5, 1L) logIn Fo ("maxbytesinflight:" + maxbytesinflight + ", targetrequestsize:" + targetrequestsize)//Split local and Remote B Locks. Remote blocks is further split into fetchrequests of size//at the most maxbytesinflight in order to Limit the amount of data in flight. Val remoterequests = new Arraybuffer[fetchrequest] var totalblocks = 0 for (address, Blockinfos) <-Blocksby Address) {//address is actually executor_id totalblocks + = Blockinfos.size if (address = = Blockmanagerid) {//data in this , then go directly to local read//Filter out zero-sized blocks localblockstofetch ++= blockinfos.filter (_._2! = 0). Ma          P (_._1) _numblockstofetch + = Localblockstofetch.size} else {val iterator = Blockinfos.iterator var currequestsize = 0L var curblocks = new arraybuffer[(Blockid, Long)] while (Iterator.hasnex T) {//Blockid is org.apache.spark.storage.ShuffleBlockId,//format: "Shuffle_" + Shuffleid + "_" + MapId + "  _ "+ Reduceid val (blockid, size) = Iterator.next ()//Skip empty blocks if (Size > 0) {//filter out to a file size of 0 Curblocks + = ((blockid, size)) Remoteblockstofetch + = Blockid              _numblockstofetch + = 1 Currequestsize + = size} else if (size < 0) { throw new Blockexception (Blockid, "negative block size" + size)} if (Currequestsize >= targetr Equestsize) {//Avoid excessive data volume for one request//ADD this fetchrequest remoterequests + = new Fetchrequest (addres S, curblocks) curblocks = new arraybuffer[(Blockid, Long)] Logdebug (S "Creating fetch request of           $curRequestSize at $address ") currequestsize = 0}}//ADD in the final request            if (!curblocks.isempty) {//Put the remaining requests in the last request. Remoterequests + = new Fetchrequest (address, Curblocks)}} loginfo ("Getting" + _numblockstofe TCH + "Non-empty blocks out of" + Totalblocks + "blocks") remoterequests}


Spark Technology Insider: Shuffle Details (II.)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.