Spark Tech Insider: Spark pluggable Framework, how do you develop your own shuffle Service?

Source: Internet
Author: User
Tags shuffle

Let's start by introducing the interfaces that need to be implemented. Frame of the class diagram (today Csdn convulsions, unexpectedly upload pictures. If you need to implement a new shuffle mechanism, you need to implement these interfaces.


1.1.1 Org.apache.spark.shuffle.ShuffleManager

Driver and each executor will hold a shufflemanager, which can be specified Spark.shuffle.manager by configuration items and created by Sparkenv. The Shufflemanager in driver is responsible for registering shuffle metadata, such as the number of shuffle id,map tasks. The Shufflemanager in executor is responsible for reading and writing shuffle data.

Functions that need to be implemented and their function descriptions:

1) Registration of metadata information by driver

Defregistershuffle[k, V, C] (

Shuffleid:int,

Nummaps:int,

Dependency:shuffledependency[k, V, C]): Shufflehandle

In general, if there is no special requirements, you can use the following implementation, in fact, hash basedshuffle and sort basedshuffle are implemented.

Override Def Registershuffle[k, V, C] (

Shuffleid:int,

Nummaps:int,

Dependency:shuffledependency[k, V, C]): Shufflehandle = {

New Baseshufflehandle (Shuffleid, nummaps,dependency)

}

2) obtain shuffle writer and create shuffle writer for it based on the ID of the shuffle Map task.

def Getwriter[k, V] (handle:shufflehandle, Mapid:int, Context:taskcontext): Shufflewriter[k, V]

3) Get shuffle Reader and create shufflereader for it based on shuffle ID and partition ID.

def Getreader[k, C] (

Handle:shufflehandle,

Startpartition:int,

Endpartition:int,

Context:taskcontext): Shufflereader[k,c]

4) Assign a value to the data member Shuffleblockmanager to save the actual Shuffleblockmanager

5) Defunregistershuffle (shuffleid:int): Boolean, delete the metadata for the local shuffle.

6) def stop (): Unit, stop Shuffle Manager.

Examples of specific implementations of each interface can be referred to Org.apache.spark.shuffle.sort.SortShuffleManager and Org.apache.spark.shuffle.hash.HashShuffleManager.

1.1.2 Org.apache.spark.shuffle.ShuffleWriter

The Shuffle Map task writes Shuffle data locally through Shufflewriter. This writer writes the data mainly through Shuffleblockmanager, so its function is relatively lightweight.

1) def write (Records:iterator[_ <:P roduct2[k, V]]): Unit, writes all data. It is important to note that if you need to do aggregations on the map side. (aggregate), the records must be aggregated before writing.

2) def stop (Success:boolean): Option[mapstatus], the write is submitted after writing is complete.

For hash basedshuffle, see Org.apache.spark.shuffle.hash.HashShuffleWriter; for sort Based shuffle, Please see Org.apache.spark.shuffle.sort.SortShuffleWriter.

1.1.3 Org.apache.spark.shuffle.ShuffleBlockManager

The main use is to read shuffle data from the local function. These interfaces are called through Org.apache.spark.storage.BlockManager.

1) def getBytes (Blockid:shuffleblockid): Option[bytebuffer], generally by calling the next interface implementation, but the Managedbuffer converted to Bytebuffer.

2) def getblockdata (blockid:shuffleblockid): Managedbuffer, core read logic. For example, hash Based shuffle from the local read files are implemented through this interface. Because different implementations may organize the files differently, such as the sort Based shuffle need to read the index index file first to get the starting position of each partition before the real data file is readable.

3) def stop (): Unit, stop the manager.

For hash Based Shuffle, see Org.apache.spark.shuffle.FileShuffleBlockManager; for sort Based Shuffle, Please see Org.apache.spark.shuffle.IndexShuffleBlockManager.

1.1.4 Org.apache.spark.shuffle.ShuffleReader

Shufflereader implements the logic of how the downstream task reads the shuffle output of the upstream shufflemaptask. This logic is more complex, In simple terms, you get the location information of the data through Org.apache.spark.MapOutputTracker, and then if the data is locally then call Org.apache.spark.storage.BlockManager's getblockdata to read the local data ( In fact Getblockdata will eventually call Org.apache.spark.shuffle.ShuffleBlockManager's Getblockdata). For the specific shuffle read logic, see the following section.

1) def read (): Iterator[product2[k, C]]


How to develop your own shuffle mechanism? You should know what to do here. I don't know? Let's see it again.



If you like this article, then please move your finger support the following blog star rating Bar. Thank you very much for your vote. You can have a ticket every day.
Dot I vote

Spark Tech Insider: Spark pluggable Framework, how do you develop your own shuffle Service?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.