Spark core source code analysis: spark task model

Source: Internet
Author: User
Overview

A spark job is divided into multiple stages. The last stage contains one or more resulttask. The previous stages contains one or more shufflemaptasks.

Run resulttask and return the result to the driver application.

Shufflemaptask separates the output of a task from Multiple Buckets Based on the partition of the task. A shufflemaptask corresponds to a shuffledependency partition, and the total number of partition is the same as that of parallelism and reduce.


Task

The task code is in the scheduler package.

The number of workers constructed by the abstract class task is as follows:

private[spark] abstract class Task[T](val stageId: Int, var partitionId: Int) extends Serializable

The task corresponds to a stageid and partitionid.

Provides runtask () and kill () interfaces.

Provides killed variables, taskmetrics variables, and taskcontext variables.

In addition to the above basic interfaces and variables, the companion object of the task provides methods for serializing and deserializing jar packages that the application depends on. The reason is that the task must ensure that the worker node has other Dependencies required by this task. Note that dependencies are written to sparkcontext. Therefore, the method for converting dependencies into streams is provided.


Task implementation


Shufflemaptask

Shufflemaptask constructs the following number of workers,

private[spark] class ShuffleMapTask(    stageId: Int,    var rdd: RDD[_],    var dep: ShuffleDependency[_,_],    _partitionId: Int,    @transient private var locs: Seq[TaskLocation])  extends Task[MapStatus](stageId, _partitionId)

The RDD partitioner corresponds to shuffledependency.

 

Shufflemaptask re-writes the mapstatus method to read and write data externally. The read and write content includes stageid, RDD, DEP, partitionid, epoch, and split ). Stageid, RDD, and Dep have unified serialization and deserialization operations, which are cached in the memory and then written into objectoutput. The serialization operation uses gzip, And the serialization information is maintained inSerializedinfocache=NewHashmap [int, array [byte]. The reasons for serialization and storage are: stageid, RDD, and Dep actually represent the information of this shuffle task. To reduce the burden on the master node, cache the serialization results.


Stage running logic

The main process is as follows:

val ser = Serializer.getSerializer(dep.serializer)shuffle = shuffleBlockManager.forMapTask(dep.shuffleId, partitionId, numOutputSplits, ser)

This step initializes a shufflewritergroup, which contains a blockobjectwriter array.


for (elem <- rdd.iterator(split, context)) {val pair = elem.asInstanceOf[Product2[Any, Any]]  val bucketId = dep.partitioner.getPartition(pair._1)  shuffle.writers(bucketId).write(pair)}

This step corresponds to a bucket for each writer and calls the write () method of each blockobjectwriter to write data.


var totalBytes = 0Lvar totalTime = 0Lval compressedSizes: Array[Byte] = shuffle.writers.map { writer: BlockObjectWriter =>    writer.commit()    writer.close()val size = writer.fileSegment().length    totalBytes += sizetotalTime += writer.timeWriting()MapOutputTracker.compressSize(size)}

In this step, run writer. Commit () and obtain the file segment size. compress the total size.


val shuffleMetrics = new ShuffleWriteMetricsshuffleMetrics.shuffleBytesWritten = totalBytesshuffleMetrics.shuffleWriteTime = totalTimemetrics.get.shuffleWriteMetrics = Some(shuffleMetrics)success = truenew MapStatus(blockManager.blockManagerId, compressedSizes)

This step records metrcis information and returns a mapstatus class, which contains information about the result of local shufflemaptask.

 

Finally, we will release writers to record and reuse the shuffle files (shuffleblockmanager manages these files, which are written by a group of writers in Shuffle tasks ).

This article focuses on understanding.


Important

This section describes important external classes to help you understand them.


Shuffleblockmanager

Overall sorting:

Shufflestate maintains the concurrent1_queue of two shufflefilegroups to record the state of the current shuffle.

The shufflestate records the status of the file group in a shuffle operation. In shuffleblockmanager, map is used to maintain a shufflestate for each shuffleid.

Each shuffleid obtains a set of writer by using the formaptask () method, that is, shufllewritergroup. The writers in this group share a shuffleid and mapid, but each of them has a different bucketid and file. When a filegroup is allocated to the writer, unusedfilegroup is obtained from the shuffle state corresponding to the shuffleid. If no file exists, a new file is created on HDFS.

Writer can append the target file on HDFS. When creating a file, you can create a file based on the shuffleid, Bucket number, and an incremental fileid.


Shufflefilegroup's reuse of files and records such as mapid, index, and offset seem obscure.

 

Important methods:

def forMapTask(shuffleId: Int, mapId: Int, numBuckets: Int, serializer: Serializer) = { new ShuffleWriterGroup {} }

This method is called by a shufflemaptask and passed in the ID of this shuffle operation. mapid is partitionid. The number of buckects is equal to the number of partitions. The shufflewritergroup returned by this method is a group of diskblockobjectwriter. Each writer belongs to this shuffle operation. Therefore, they share the same shuffleid and mapid, But they correspond to different buckets, and each has a corresponding file.

 

Call in Shuffle run and upload data transmission:

val ser = Serializer.getSerializer(dep.serializer)shuffle = shuffleBlockManager.forMapTask(dep.shuffleId, partitionId, numOutputSplits, ser)

Shuffleid is the Globally Unique id obtained by shuffledependency, representing the ID of this shuffle task

Mapid equals to partitionid

The number of buckets equals to the number of partitions.

 

Generate writers:

The writer type is diskblockobjectwriter, and the number is equal to the number of buckets. Buffersize settings:

conf.getInt("spark.shuffle.file.buffer.kb", 100) * 1024

Blockid is generated from:

blockId = ShuffleBlockId(shuffleId, mapId, bucketId)

When generating the writer, the getdiskwriter method of blockmanager is called. When shuffleblockmanager is initialized, it is bound to blockmanager.

private[spark] class DiskBlockObjectWriter(    blockId: BlockId,    file: File,    serializer: Serializer,    bufferSize: Int,    compressStream: OutputStream => OutputStream,    syncWrites: Boolean)  extends BlockObjectWriter(blockId)

Shufflefilegroup: Private internal class, corresponding to a group of shuffle files, each file corresponds to a reducer. A mapper is assigned to a shufflefilegroup, And the Mapper result is written to this group of files.

Mapstatus

Note that the shufflemaptask type is mapstatus. The mapstatus class is the running result that shufflemaptask returns to scheduler. It contains two items:

class MapStatus(var location: BlockManagerId, var compressedSizes: Array[Byte])

The former is the block manager address of the run task (blockmanagerid is a class that stores executorid, host, port, nettyport), and the latter is the output size, this value will be passed to the next reduce task. This size is compressed by mapoutputtracker.


The mapstatus class provides two methods, for example, shufflemaptask.

  def writeExternal(out: ObjectOutput) {    location.writeExternal(out)    out.writeInt(compressedSizes.length)    out.write(compressedSizes)  }  def readExternal(in: ObjectInput) {    location = BlockManagerId(in)    compressedSizes = new Array[Byte](in.readInt())    in.readFully(compressedSizes)  }

Blockmanagerid

The blockmanagerid class constructs information dependent on executorid, host, port, and nettyport. The companion object maintains a blockmanageridcache and implements concurrenthashmap [blockmanagerid, blockmanagerid] ().

For example, when the readexternal method of mapstatus passes objectinput into the blockmanagerid constructor, The blockmanagerid apply () method extracts the executorid, host, port, and nettyport Information Based on objectinput, maintain blockmanageridobj to blockmanageridcache.


Resulttask

Construct the Worker Number

private[spark] class ResultTask[T, U](    stageId: Int,    var rdd: RDD[T],    var func: (TaskContext, Iterator[T]) => U,    _partitionId: Int,    @transient locs: Seq[TaskLocation],    var outputId: Int)  extends Task[U](stageId, _partitionId) with Externalizable {

Resulttask is simpler than merge. The runtask method calls the RDD iterator:

  override def runTask(context: TaskContext): U = {    metrics = Some(context.taskMetrics)    try {      func(context, rdd.iterator(split, context))    } finally {      context.executeOnCompleteCallbacks()    }  }

Process Model vs. Thread Model

Tasks on the same node of spark are executed in a JVM process in multi-thread mode.

 

Strengths:

Fast task startup

Shared Memory, suitable for memory-intensive tasks

Resources occupied by executor can be reused

 

Disadvantages:

When all tasks on the same node are executed in one process, serious resource contention occurs, making it difficult to control the resource occupation of each task in a fine-grained manner. Mapreduce sets different resources for map tasks and reduce tasks to control the resources occupied by tasks in a fine-grained manner.

 

Every task in mapreduce is a JVM process and must go through the process of applying for resources, executing tasks, and releasing resources.

 

Each node can have one or more executors with a certain number of slots. The executor can run multiple result tasks and shufflemap tasks.

In terms of shared memory, broadcast variables are saved in each executor, and tasks in this executor can be shared.




Full Text :)



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.