Let's start by introducing the interfaces that need to be implemented. Frame of the class diagram (today Csdn convulsions, unexpectedly upload pictures. If you need to implement a new shuffle mechanism, you need to implement these interfaces.
1.1.1 Org.apache.spark.shuffle.ShuffleManager
Driver and each executor will hold a shufflemanager, which can be specified Spark.shuffle.manager by configuration items and created by Sparkenv. The Shufflemanager in driver is responsible for registering shuffle metadata, such as the number of shuffle id,map tasks. The Shufflemanager in executor is responsible for reading and writing shuffle data.
Functions that need to be implemented and their function descriptions:
1) Registration of metadata information by driver
Defregistershuffle[k, V, C] (
Shuffleid:int,
Nummaps:int,
Dependency:shuffledependency[k, V, C]): Shufflehandle
In general, if there is no special requirements, you can use the following implementation, in fact, hash basedshuffle and sort basedshuffle are implemented.
Override Def Registershuffle[k, V, C] (
Shuffleid:int,
Nummaps:int,
Dependency:shuffledependency[k, V, C]): Shufflehandle = {
New Baseshufflehandle (Shuffleid, nummaps,dependency)
}
2) obtain shuffle writer and create shuffle writer for it based on the ID of the shuffle Map task.
def Getwriter[k, V] (handle:shufflehandle, Mapid:int, Context:taskcontext): Shufflewriter[k, V]
3) Get shuffle Reader and create shufflereader for it based on shuffle ID and partition ID.
def Getreader[k, C] (
Handle:shufflehandle,
Startpartition:int,
Endpartition:int,
Context:taskcontext): Shufflereader[k,c]
4) Assign a value to the data member Shuffleblockmanager to save the actual Shuffleblockmanager
5) Defunregistershuffle (shuffleid:int): Boolean, delete the metadata for the local shuffle.
6) def stop (): Unit, stop Shuffle Manager.
Examples of specific implementations of each interface can be referred to Org.apache.spark.shuffle.sort.SortShuffleManager and Org.apache.spark.shuffle.hash.HashShuffleManager.
1.1.2 Org.apache.spark.shuffle.ShuffleWriter
The Shuffle Map task writes Shuffle data locally through Shufflewriter. This writer writes the data mainly through Shuffleblockmanager, so its function is relatively lightweight.
1) def write (Records:iterator[_ <:P roduct2[k, V]]): Unit, writes all data. It is important to note that if you need to do aggregations on the map side. (aggregate), the records must be aggregated before writing.
2) def stop (Success:boolean): Option[mapstatus], the write is submitted after writing is complete.
For hash basedshuffle, see Org.apache.spark.shuffle.hash.HashShuffleWriter; for sort Based shuffle, Please see Org.apache.spark.shuffle.sort.SortShuffleWriter.
1.1.3 Org.apache.spark.shuffle.ShuffleBlockManager
The main use is to read shuffle data from the local function. These interfaces are called through Org.apache.spark.storage.BlockManager.
1) def getBytes (Blockid:shuffleblockid): Option[bytebuffer], generally by calling the next interface implementation, but the Managedbuffer converted to Bytebuffer.
2) def getblockdata (blockid:shuffleblockid): Managedbuffer, core read logic. For example, hash Based shuffle from the local read files are implemented through this interface. Because different implementations may organize the files differently, such as the sort Based shuffle need to read the index index file first to get the starting position of each partition before the real data file is readable.
3) def stop (): Unit, stop the manager.
For hash Based Shuffle, see Org.apache.spark.shuffle.FileShuffleBlockManager; for sort Based Shuffle, Please see Org.apache.spark.shuffle.IndexShuffleBlockManager.
1.1.4 Org.apache.spark.shuffle.ShuffleReader
Shufflereader implements the logic of how the downstream task reads the shuffle output of the upstream shufflemaptask. This logic is more complex, In simple terms, you get the location information of the data through Org.apache.spark.MapOutputTracker, and then if the data is locally then call Org.apache.spark.storage.BlockManager's getblockdata to read the local data ( In fact Getblockdata will eventually call Org.apache.spark.shuffle.ShuffleBlockManager's Getblockdata). For the specific shuffle read logic, see the following section.
1) def read (): Iterator[product2[k, C]]
How to develop your own shuffle mechanism? You should know what to do here. I don't know? Let's see it again.
If you like this article, then please move your finger support the following blog star rating Bar. Thank you very much for your vote. You can have a ticket every day.
Dot I vote
Spark Tech Insider: Spark pluggable Framework, how do you develop your own shuffle Service?