Spark storage mechanism in detail

Source: Internet
Author: User

We know that spark can store a running Rdd in memory and reuse it when needed. So how does spark do all this work, this article will explain the process of rdd reuse by analyzing the source code.

In the previous article explaining the execution mechanism of Spark, Dagscheduler was responsible for decomposing the action, and in Dagscheduler.getmissingparentstages, Spark first took advantage of the past Rdd, And the function that is used is dagscheduler.getcachelocs.

1   PrivateVal Cachelocs =NewHashmap[int, Array[seq[tasklocation] ]2   3   Privatedef getcachelocs (Rdd:rdd[_]): array[seq[tasklocation]] = {4     if(!cachelocs.contains (rdd.id)) {5Val blockids = rdd.partitions.indices.map (index =Rddblockid (rdd.id, Index)). Toarray[blockid]6Val locs =blockmanager.blockidstoblockmanagers (Blockids, env, Blockmanagermaster)7Cachelocs (rdd.id) = blockids.map {id = =8Locs.getorelse (ID, Nil). Map (BM =tasklocation (Bm.host, Bm.executorid))9       }Ten     } One Cachelocs (rdd.id) A}

Dagscheduler only the location information of the Partiton in the Cachelocs storage section. Let's take a look at the execution logic of the cache, first generate a blocksids representing each partition, and then call Blockmanager.blockidstoblockmanagers to convert Blocksid to seq[ Blockmanagerid], and Blockmanagersid contains the location information of partition (each partition is stored as a block, and the block can store data such as broadcast).

According to the annotations, each node (both master and worker) runs Blockmanager to manage all storage information (including RDD and broadcast, etc.), and master and worker pass the Akka actor system ( Can see my other article to get started) communication, namely Blockmanagermasteractor and Blockmanagerslaveactor. Keep looking at blockmanager.blockidstoblockmanagers.

1 def blockidstoblockmanagers (2 Blockids:array[blockid],3 env:sparkenv,4Blockmanagermaster:blockmanagermaster =NULL): Map[blockid, Seq[blockmanagerid]] = {5 6     //Blockmanagermaster! = NULL is used in tests7     assert(Env! =NULL|| Blockmanagermaster! =NULL)8Val Blocklocations:seq[seq[blockmanagerid]] =if(Blockmanagermaster = =NULL) {9 Env.blockManager.getLocationBlockIds (blockids)Ten}Else { One blockmanagermaster.getlocations (blockids) A     } -  -Val blockmanagers =NewHashmap[blockid, Seq[blockmanagerid]] the      for(I <-0until Blockids.length) { -Blockmanagers (Blockids (i)) =blocklocations (i) -     } - Blockmanagers.tomap +}

Blockmanager is created in Sparkenv, Sparkenv also runs on all nodes and is created with Driverenv, and executorenv (same class, but element different, and Blockmanager), When you create a sparkenv, a blockmanagermasteractor is created for Blockmanager on the driver, Give a blockmanagermasteractor ref to the Blockmanager on the executor. The above code uses SparkEnv.blockManager.blockManagerMaster.getLocations to find out the blockmanagerid of each blockid, and organized into a map form to return. Next came blockmanager.getlocations.

 1  def getlocations ( Blockid:blockid): Seq[blockmanagerid] = { 2   Askdriverwithreply[seq[blockmanagerid]" (Getlocations (blockid))  Span style= "color: #008080;"   >3   4  private  def Askdriverwithreply[t] (message:any): T =      Akkautils.askwithreply (Message, Driveractor, akka_retry_attempts, Akka_retry_interval_ms,  6   timeout)  7 }  

This code simply sends the getlocations message to Blockmanagermasteractor and waits for a reply. Blockmanagermasteractor holds all the information about the store, Blockmanagerinfo has executor storage information, Blockmanageridbyexecutor the mapping of Blockmanagerid from executor to executor, Blocklocations saves all storage locations for all blocks ( Contains all the partition locations), the location of the blockmanagermasteractor about the query store:

1Override Def receivewithlogging = {2      CaseGetlocations (Blockid) =3Sender!getlocations (Blockid)4      Case... =5   }6 7   Privatedef getlocations (Blockid:blockid): Seq[blockmanagerid] = {8     if(Blocklocations.containskey (Blockid)) Blocklocations.get (blockid). toseqElseSeq.empty9}

Since Blockmanagermasteractor saves all block locations, it simply gives the answer. It is now possible to see that all block location information is stored on the master node. These are the complete steps that spark needs to find the persist Rdd, but it does not cover the entire spark storage mechanism, and the next step is to analyze some other code.

1   def removeblock (blockid:blockid) {2    askdriverwithreply (Removeblock (blockid) ) 3   }45   private def Askdriverwithreply[t] (message:any): T = {  6    akkautils.askwithreply (message, Driveractor, akka_retry_attempts, Akka_retry_ Interval_ms,7      timeout)8   }

As in the example above, send removeblock messages to Blockmanagermasteractor.

1Override Def receivewithlogging = {2      CaseRemoveblock (Blockid) =3 removeblockfromworkers (Blockid)4Sender!true5      Case... =6   }7 8   Privatedef removeblockfromworkers (blockid:blockid) {9Val locations =blocklocations.get (Blockid)Ten     if(Locations! =NULL) { OneLocations.foreach {Blockmanagerid:blockmanagerid = AVal Blockmanager =blockmanagerinfo.get (Blockmanagerid) -         if(blockmanager.isdefined) { -           //Remove The block from the slave ' s blockmanager. the           //doesn ' t actually wait for a confirmation and the message might get lost. -           //If message loss becomes frequent, we should add retry logic here. - BlockManager.get.slaveActor.ask (Removeblock (Blockid)) (akkatimeout) -         } +       } -     } +}

This code first uses Blocklocations to find the blockmanagerid of all the blocks in the block, The blockmanagerinfo is then obtained by Blockmanagerid to give the blockmanagerslaveactor on the executor, and then sends the REMOVEBLOCK message.

1   Override Def receivewithlogging = {2case      removeblock (blockid) =3       Doasync[boolean] ("removing block" + Blockid, sender) {4        Blockmanager.removeblock ( Blockid)5         true6      }7      case ... =8   }

Blockmanagerslaveactor call Blockmanager.removeblock When the message is received.

1def removeblock (blockid:blockid, Tellmaster:boolean =true): Unit = {2Loginfo (S "Removing block $blockId")3Val info =Blockinfo.get (blockid). Ornull4     if(Info! =NULL) {5Info.synchronized {6         //Removals is idempotent in disk store and memory store. At worst, we get a warning.7Val removedfrommemory =Memorystore.remove (Blockid)8Val Removedfromdisk =Diskstore.remove (Blockid)9Val Removedfromtachyon =if(tachyoninitialized) Tachyonstore.remove (Blockid)Else falseTen         if(!removedfrommemory &&!removedfromdisk &&!)Removedfromtachyon) { OneLogwarning (S "Block $blockId could not being removed as it is not found in either" + A"The disk, memory, or Tachyon store") -         } - Blockinfo.remove (Blockid) the         if(Tellmaster &&info.tellmaster) { -Val status =Getcurrentblockstatus (Blockid, info) - Reportblockstatus (Blockid, info, status) -         } +       } -}Else { +       //The block has already been removed; ALogwarning (S "asked to remove block $blockId, which does not exist") at     } -}

This code calls the 3 store remove function to complete the task and feedback the results as required. The memory in Memorystore is stored in off-heap manner and is not affected by Java GC. The entire spark storage management mechanism is here.

Spark storage mechanism in detail

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.