Spark Source code reading

Last Update:2016-03-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

RDD stands for Resilient Distributed DataSets, an elastic Distributed dataset. Is the core content of Spark.

RDD is a read-only, unchangeable dataset, and has a good fault tolerance mechanism. He has five main features

-A list of partitions: shard list. data can be split for parallel computing.

-A function for computing each split: one function computes one shard.

-A list of dependencies on other RDDs

-Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)

RDD (optional): key-value-type RDD, Which is partitioned by hash.

-Optionally, a list of preferred locations to compute each split on (e.g. blocklocations

An HDFS file) (optional) the optimal computing location for each shard.

RDD is the underlying system for running All Spark components. RDD is a fault-tolerant and parallel data structure that provides rich data operations and APIs.

Rdd api in Spark

An RDD can contain multiple partitions. Each partition is a dataset segment. RDD can be mutually dependent.

Narrow dependency: a one-to-one correspondence relationship. an RDD partition can only be used by one RDD partition.

Wide dependency: Multiple mappings. If multiple sub-RDD partitions depend on the same parent RDD partition

See the following RDD Diagram

There are some comparison methods in the source packageorg. apache. spark. rdd. RDD:

/**

* Implemented by subclasses to return the set of partitions in this RDD. This method will only

* Be called once, so it is safe to implement a time-consuming computation in it.

* Subclass implementation returns a partition in this RDD. This method will be called only once, so it is safe to implement a time-consuming computing.

Protected def getPartitions: Array [Partition]

This method returns multiple partitions and stores them in a number.

/**

* Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only

* Be called once, so it is safe to implement a time-consuming computation in it.

* How the subclass implementation returns this RDD depends on the parent RDDS. This method will be called only once, so it is safe to implement a time-consuming computing.

Protected def getDependencies: Seq [Dependency [_] = deps

It returns a dependency Seq set.

/**

*: Producer API ::

* Implemented by subclasses to compute a given partition.

* Subclass is used to calculate a given partition.

@ Override API

Def compute (split: Partition, context: TaskContext): Iterator [T]

Each RDD has a specific computing function.

/**

* Optionally overridden by subclasses to specify placement preferences.

Protected def getPreferredLocations (split: Partition): Seq [String] = Nil

Obtain the preferred location for partition. This is a partition policy.

RDD Transformations and action

There are two main actions for RDD data operations:

Transformations (conversion)

Map (f: T) U): RDD [T]) RDD [U]
Filter (f: T) Bool): RDD [T]) RDD [T]
FlatMap (f: T) Seq [U]): RDD [T]) RDD [U]
Sample (fraction: Float): RDD [T]) RDD [T] (Deterministic sampling)
GroupByKey (): RDD [(K, V)]) RDD [(K, Seq [V])]
ReduceByKey (f: (V; V): RDD [(K, V)]) RDD [(K, V)]
Union (): (RDD [T]; RDD [T]) RDD [T]
Join (): (RDD [(K, V)]; RDD [(K, W)]) RDD [(K, (V, W)]
Cogroup (): (RDD [(K, V)]; RDD [(K, W)]) RDD [(K, (Seq [V], seq [W])]
CrossProduct (): (RDD [T]; RDD [U]) RDD [(T, U)]
MapValues (f: V) W): RDD [(K, V)]) RDD [(K, W)] (Preserves partitioning)
Sort (c: Comparator [K]): RDD [(K, V)]) RDD [(K, V)]
PartitionBy (p: Partitioner [K]): RDD [(K, V)]) RDD [(K, V)]

Action)

Count (): RDD [T]) Long
Collect (): RDD [T]) Seq [T]
Reduce (f: (T; T): RDD [T]) T
Lookup (k: K): RDD [(K, V)]) Seq [V] (On hash/range partitioned RDDs)
Save (path: String): Outputs RDD to a storage system, e.g., HDFS

Let's take a look at the Transformations section.

// Transformations (return a new RDD)

/**

* Return a new RDD by applying a function to all elements of this RDD.

Def map [U: ClassTag] (f: T => U): RDD [U] = new MappedRDD (this, SC. clean (f ))

/**

* Return a new RDD by first applying a function to all elements of this

* RDD, and then flattening the results.

Def flatMap [U: ClassTag] (f: T => TraversableOnce [U]): RDD [U] =

New FlatMappedRDD (this, SC. clean (f ))

/**

* Return a new RDD containing only the elements that satisfy a predicate.

Def filter (f: T => Boolean): RDD [T] = new FilteredRDD (this, SC. clean (f ))

......

Map

/**

* Return a new RDD by applying a function to all elements of this RDD.

Def map [U: ClassTag] (f: T => U): RDD [U] = new MappedRDD (this, SC. clean (f ))

Returns a MappedRDD, which inherits RDD and overwrites two methods: getPartitions, compute

The first method, getPartitions, obtains the first parent RDD and obtains the shard array.

Override def getPartitions: Array [Partition] = firstParent [T]. partitions

The second method compute will traverse the RDD partition based on the map parameter content

Override def compute (split: Partition, context: TaskContext) =

FirstParent [T]. iterator (split, context). map (f)

Filter

/**

* Return a new RDD containing only the elements that satisfy a predicate.

Def filter (f: T => Boolean): RDD [T] = new FilteredRDD (this, SC. clean (f ))

Filter is a filtering operation, such as mapRDD. filter (_> 1)

Union

/**

* Return the union of this RDD and another one. Any identical elements will appear multiple

* Times (use '. distinct ()' to eliminate them ).

Def union (other: RDD [T]): RDD [T] = new UnionRDD (SC, Array (this, other ))

Multiple RDDs form a new RDD, which overrides the five methods of RDD: getPartitions, getDependencies, compute, getPreferredLocations, clearDependencies

It can be seen from getPartitions and getDependencies that it should be a set of wide dependencies.

Override def getDependencies: Seq [Dependency [_] = {

Val deps = new ArrayBuffer [Dependency [_]

Var pos = 0

For (rdd <-rdds ){

Deps + = new RangeDependency (rdd, 0, pos, rdd. partitions. size)

Pos + = rdd. partitions. size

}

Deps

}

GroupBy

/**

* Return an RDD of grouped items. Each group consists of a key and a sequence of elements

* Mapping to that key.

* Note: This operation may be very expensive. If you are grouping in order to perform

* Aggregation (such as a sum or average) over each key, using [[PairRDDFunctions. aggregateByKey]

* Or [[PairRDDFunctions. performancebykey] will provide much better performance.

Def groupBy [K] (f: T => K) (implicit kt: ClassTag [K]): RDD [(K, Iterable [T])] =

GroupBy [K] (f, defaultPartitioner (this ))

According to the parameter grouping, this produces a new RDD

Action

Count

/**

* Return the number of elements in the RDD.

Def count (): Long = SC. runJob (this, Utils. getIteratorSize _). sum

Trace code. dagScheduler. runJob is called in the runJob method. In dagschediter, the Job scheduler is submitted and the JobWaiter object is returned. The JobWaiter object can be used for blocking until the task is executed or can be used to cancel the job.

Task Scheduling in RDD

From this figure:

RDD Object generates a DAG and then enters the DAGScheduler stage:

1. dagsched is a Stage-oriented High-level scheduler. dagscheddag splits the DAG into many tasks, and a group of tasks is the stage in the figure.

2. Each shuffle process generates a new stage. DAGScheduler RDD records the physical and chemical operations on the disk. To obtain the most tasks, DAGSchulder first looks for the local tasks.

3. dagschedle also monitors the failed tasks generated by shuffle. If you have to restart

After the dagschedstage stage is divided, the task is submitted to TaskScheduler in the unit of TaskSet:

1. One taskschednext serves only one sparkConext.

2. After receiving the TaskSet, it submits the task to the Worker node Executor for running. Failed task

It is monitored and restarted by TaskScheduler.

Executor is run in multiple threads. Each thread is responsible for one task.

Next, we will track the source code of an example provided by spark:

Source code packageorg. apache. spark. examples. SparkPi

Def main (args: Array [String]) {

// Set an application name (for display in Web UI)

Val conf = new SparkConf (). setAppName ("Spark Pi ")

// Instantiate a SparkContext

Val spark = new SparkContext (conf)

// Convert to data

Val slices = if (args. length> 0) args (0). toInt else 2

Val n = 100000 * slices

Val count = spark. parallelize (1 to n, slices). map {I =>

Val x = random * 2-1

Val y = random * 2-1

If (x * x + y * y <1) 1 else 0

}. Reduce (_ + _)

Println ("Pi is roughly" + 4.0 * count/n)

Spark. stop ()

}

In the code, parallelize is a parallel delay loading, tracking the source code

/** Distribute a local Scala collection to form an RDD.

* Allocate a local scala set from RDD

* @ Note Parallelize acts lazily. If 'seq 'is a mutable collection and is

* Altered after the call to parallelize and before the first action on

* RDD, the resultant RDD will reflect the modified collection. Pass a copy

* The argument to avoid this.

Def parallelize [T: ClassTag] (seq: Seq [T], numSlices: Int = defaultParallelism): RDD [T] = {

New ParallelCollectionRDD [T] (this, seq, numSlices, Map [Int, Seq [String] ()

}

It calls the map in RDD. the previously mentioned map is a conversion process that will generate a new RDD. Reduce.

Get a word Statistical example in shell:

Scala> val rdd = SC. textFile ("hdfs: // 192.168.0.245: 8020/test/README. md ")

14/12/18 01:12:26 INFO storage. MemoryStore: ensureFreeSpace (82180) called with curMem = 331133, maxMem = 280248975

14/12/18 01:12:26 INFO storage. MemoryStore: Block broadcast_3 stored as values in memory (estimated size 80.3 KB, free 266.9 MB)

Rdd: org. apache. spark. rdd. RDD [String] = hdfs: // 192.168.0.245: 8020/test/README. md MappedRDD [7] at textFile at <console>: 12

Scala> rdd. toDebugString

14/12/18 01:12:29 INFO mapred. FileInputFormat: Total input paths to process: 1

Res3: String =

(1) hdfs: // 192.168.0.245: 8020/test/README. md MappedRDD [7] at textFile at <console>: 12

| Hdfs: // 192.168.0.245: 8020/test/README. md HadoopRDD [6] at textFile at <console>: 12

SC reads data from hdfs, which is converted to HadoopRDD in debugString.

Scala> val result = rdd. flatMap (_. split (""). map (_, 1). performancebykey (_ + _). collect

14/12/18 01:14:51 INFO spark. SparkContext: Starting job: collect at <console>: 14

14/12/18 01:14:51 INFO scheduler. DAGScheduler: Registering RDD 9 (map at <console>: 14)

14/12/18 01:14:51 INFO schedtions. dagschedtions: Got job 0 (collect at <console>: 14) with 1 output partitions (allowLocal = false)

14/12/18 01:14:51 INFO scheduler. DAGScheduler: Final stage: Stage 0 (collect at <console>: 14)

14/12/18 01:14:51 INFO scheduler. dagschedents: Parents of final stage: List (Stage 1)

14/12/18 01:14:51 INFO scheduler. dagschedents: Missing parents: List (Stage 1)

14/12/18 01:14:51 INFO scheddd. dagschedents: Submitting Stage 1 (MappedRDD [9] at map at <console>: 14), which has no missing parents

14/12/18 01:14:51 INFO storage. MemoryStore: ensureFreeSpace (3440) called with curMem = 413313, maxMem = 280248975

14/12/18 01:14:51 INFO storage. MemoryStore: Block broadcast_4 stored as values in memory (estimated size 3.4 KB, free 266.9 MB)

14/12/18 01:14:51 INFO scheduler. dagschedting: Submitting 1 missing tasks from Stage 1 (MappedRDD [9] at map at <console>: 14)

14/12/18 01:14:51 INFO scheduler. TaskSchedulerImpl: Adding task set 1.0 with 1 tasks

14/12/18 01:14:51 INFO scheduler. TaskSetManager: Starting task 0.0 in stage 1.0 (TID 0, localhost, ANY, 1185 bytes)

14/12/18 01:14:51 INFO executor. Executor: Running task 0.0 in stage 1.0 (TID 0)

14/12/18 01:14:51 INFO rdd. HadoopRDD: Input split: hdfs: // 192.168.0.245: 8020/test/README. md: 0 + 4811

14/12/18 01:14:51 INFO Configuration. deprecation: mapred. tip. id is deprecated. Instead, use mapreduce. task. id

14/12/18 01:14:51 INFO Configuration. deprecation: mapred. task. id is deprecated. Instead, use mapreduce. task. attempt. id

14/12/18 01:14:51 INFO Configuration. deprecation: mapred. task. is. map is deprecated. Instead, use mapreduce. task. ismap

14/12/18 01:14:51 INFO Configuration. deprecation: mapred. task. partition is deprecated. Instead, use mapreduce. task. partition

14/12/18 01:14:51 INFO Configuration. deprecation: mapred. job. id is deprecated. Instead, use mapreduce. job. id

14/12/18 01:14:52 INFO executor. Executor: Finished task 0.0 in stage 1.0 (TID 0). 1860 bytes result sent to driver

14/12/18 01:14:53 INFO scheduler. DAGScheduler: Stage 1 (map at <console>: 14) finished in 1.450 s

14/12/18 01:14:53 INFO scheduler. dagschedages: looking for newly runnable stages

14/12/18 01:14:53 INFO scheduler. DAGScheduler: running: Set ()

14/12/18 01:14:53 INFO scheduler. DAGScheduler: waiting: Set (Stage 0)

14/12/18 01:14:53 INFO scheduler. TaskSetManager: Finished task 0.0 in stage 1.0 (TID 0) in 1419 MS on localhost (1/1)

14/12/18 01:14:53 INFO scheduler. dagschedled: failed: Set ()

14/12/18 01:14:53 INFO schedset. TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool

14/12/18 01:14:53 INFO scheduler. dagschedents: Missing parents for Stage 0: List ()

14/12/18 01:14:53 INFO scheduler. dagschedting: Submitting Stage 0 (ShuffledRDD [10] at performancebykey at <console>: 14), which is now runnable

14/12/18 01:14:53 INFO storage. MemoryStore: ensureFreeSpace (2112) called with curMem = 416753, maxMem = 280248975

14/12/18 01:14:53 INFO storage. MemoryStore: Block broadcast_5 stored as values in memory (estimated size 2.1 KB, free 266.9 MB)

14/12/18 01:14:53 INFO scheduler. dagschedting: Submitting 1 missing tasks from Stage 0 (ShuffledRDD [10] at performancebykey at <console>: 14)

14/12/18 01:14:53 INFO scheduler. TaskSchedulerImpl: Adding task set 0.0 with 1 tasks

14/12/18 01:14:53 INFO scheduler. TaskSetManager: Starting task 0.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 948 bytes)

14/12/18 01:14:53 INFO executor. Executor: Running task 0.0 in stage 0.0 (TID 1)

14/12/18 01:14:53 INFO storage. BlockFetcherIterator $ BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329

14/12/18 01:14:53 INFO storage. BlockFetcherIterator $ BasicBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks

14/12/18 01:14:53 INFO storage. BlockFetcherIterator $ BasicBlockFetcherIterator: Started 0 remote fetches in 5 MS

14/12/18 01:14:53 INFO executor. Executor: Finished task 0.0 in stage 0.0 (TID 1). 8680 bytes result sent to driver

14/12/18 01:14:53 INFO scheduler. DAGScheduler: Stage 0 (collect at <console>: 14) finished in 0.108 s

14/12/18 01:14:53 INFO scheduler. TaskSetManager: Finished task 0.0 in stage 0.0 (TID 1) in 99 MS on localhost (1/1)

14/12/18 01:14:53 INFO schedset. TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool

14/12/18 01:14:53 INFO spark. SparkContext: Job finished: collect at <console>: 14, took 1.884598939 s

Result: Array [(String, Int)] = Array (For, 5), (Programs, 1), (gladly, 1), (Because, 1), (, 1), (agree, 1), (cluster ., 1), (webpage, 1), (its, 1), (-Pyarn, 3), (under, 2), (legal, 1), (APIs, 1 ), (1.x,, 1), (computation, 1), (Try, 1), (MRv1, 1), (have, 2), (Thrift, 2), (add, 2), (through, 1), (several, 1), (This, 2), (Whether, 1), ("yarn-cluster", 1), (%, 2), (graph, 1), (storage, 1), (To, 2), (setting, 2), (any, 2), (Once, 1 ), (application, 1), (JDBC, 3), (use:, 1), (prefer, 1), (SparkPi, 2), (engine, 1), (version, 3), (file, 1), (documentation, 1), (processing, 2), (Along, 1), (the, 28), (explicitly ,, 1), (entry, 1), (author ., 1), (are, 2), (systems ., 1), (params, 1), (not, 2), (different, 1), (refer, 1), (Interactive, 2), (given ., 1), (if, 5), ('-pyarn':, 1), (build, 3), (when, 3), (be, 2), (Tests, 1), (file's, 1), (Apache, 6 ),(. /bin/run-e...

The statistical results of each word after distinguishing words by Space

For more details, please continue to read the highlights on the next page:

1
2
3
4
5
6
7
8
9 ..
12
Next Page

[Content navigation]
Page 1: RDD	Page 1: Spark Submit
Page 1: Job Runtime	Page 1: Scheduler
Page 5th: Storage	Page 1: Shuffle
Page 1: Broadcast	Page 1: NetWork
Page 1: Metrics	Page 1: Spark On Yarn
Page 1: Standalone	Page 1: Spark Streaming

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark Source code reading

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Spark Source code reading

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support