Spark Tech Insider: Spark pluggable Framework, how do you develop your own shuffle Service?

Last Update:2015-01-08 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Let's start by introducing the interfaces that need to be implemented. Frame of the class diagram (today Csdn convulsions, unexpectedly upload pictures. If you need to implement a new shuffle mechanism, you need to implement these interfaces.

1.1.1 Org.apache.spark.shuffle.ShuffleManager

Driver and each executor will hold a shufflemanager, which can be specified Spark.shuffle.manager by configuration items and created by Sparkenv. The Shufflemanager in driver is responsible for registering shuffle metadata, such as the number of shuffle id,map tasks. The Shufflemanager in executor is responsible for reading and writing shuffle data.

Functions that need to be implemented and their function descriptions:

1) Registration of metadata information by driver

Defregistershuffle[k, V, C] (

Shuffleid:int,

Nummaps:int,

Dependency:shuffledependency[k, V, C]): Shufflehandle

In general, if there is no special requirements, you can use the following implementation, in fact, hash basedshuffle and sort basedshuffle are implemented.

Override Def Registershuffle[k, V, C] (

Shuffleid:int,

Nummaps:int,

Dependency:shuffledependency[k, V, C]): Shufflehandle = {

New Baseshufflehandle (Shuffleid, nummaps,dependency)

}

2) obtain shuffle writer and create shuffle writer for it based on the ID of the shuffle Map task.

def Getwriter[k, V] (handle:shufflehandle, Mapid:int, Context:taskcontext): Shufflewriter[k, V]

3) Get shuffle Reader and create shufflereader for it based on shuffle ID and partition ID.

def Getreader[k, C] (

Handle:shufflehandle,

Startpartition:int,

Endpartition:int,

Context:taskcontext): Shufflereader[k,c]

4) Assign a value to the data member Shuffleblockmanager to save the actual Shuffleblockmanager

5) Defunregistershuffle (shuffleid:int): Boolean, delete the metadata for the local shuffle.

6) def stop (): Unit, stop Shuffle Manager.

Examples of specific implementations of each interface can be referred to Org.apache.spark.shuffle.sort.SortShuffleManager and Org.apache.spark.shuffle.hash.HashShuffleManager.

1.1.2 Org.apache.spark.shuffle.ShuffleWriter

The Shuffle Map task writes Shuffle data locally through Shufflewriter. This writer writes the data mainly through Shuffleblockmanager, so its function is relatively lightweight.

1) def write (Records:iterator[_ <:P roduct2[k, V]]): Unit, writes all data. It is important to note that if you need to do aggregations on the map side. (aggregate), the records must be aggregated before writing.

2) def stop (Success:boolean): Option[mapstatus], the write is submitted after writing is complete.

For hash basedshuffle, see Org.apache.spark.shuffle.hash.HashShuffleWriter; for sort Based shuffle, Please see Org.apache.spark.shuffle.sort.SortShuffleWriter.

1.1.3 Org.apache.spark.shuffle.ShuffleBlockManager

The main use is to read shuffle data from the local function. These interfaces are called through Org.apache.spark.storage.BlockManager.

1) def getBytes (Blockid:shuffleblockid): Option[bytebuffer], generally by calling the next interface implementation, but the Managedbuffer converted to Bytebuffer.

2) def getblockdata (blockid:shuffleblockid): Managedbuffer, core read logic. For example, hash Based shuffle from the local read files are implemented through this interface. Because different implementations may organize the files differently, such as the sort Based shuffle need to read the index index file first to get the starting position of each partition before the real data file is readable.

3) def stop (): Unit, stop the manager.

For hash Based Shuffle, see Org.apache.spark.shuffle.FileShuffleBlockManager; for sort Based Shuffle, Please see Org.apache.spark.shuffle.IndexShuffleBlockManager.

1.1.4 Org.apache.spark.shuffle.ShuffleReader

Shufflereader implements the logic of how the downstream task reads the shuffle output of the upstream shufflemaptask. This logic is more complex, In simple terms, you get the location information of the data through Org.apache.spark.MapOutputTracker, and then if the data is locally then call Org.apache.spark.storage.BlockManager's getblockdata to read the local data ( In fact Getblockdata will eventually call Org.apache.spark.shuffle.ShuffleBlockManager's Getblockdata). For the specific shuffle read logic, see the following section.

1) def read (): Iterator[product2[k, C]]

How to develop your own shuffle mechanism? You should know what to do here. I don't know? Let's see it again.

If you like this article, then please move your finger support the following blog star rating Bar. Thank you very much for your vote. You can have a ticket every day.
Dot I vote

Spark Tech Insider: Spark pluggable Framework, how do you develop your own shuffle Service?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More