Spark programming Model (II): Rdd detailed

Source: Internet
Author: User
Tags abstract shuffle thread spark rdd
Rdd Detailed

This article is a summary of the spark Rdd paper, interspersed with some spark's internal implementation summaries, corresponding to the spark version of 2.0. Motivation

The traditional distributed computing framework (such as MapReduce) performs computational tasks in which intermediate results are usually stored on disk, resulting in very large IO consumption, especially for various machine learning algorithms, which need to iterate over the results of the last calculation, and if each result is saved to disk and then read from disk, It will take a lot of time. So spark this paper presents a new distributed data Abstraction--rdd. design ideas and features

The resilient Distributed DataSet (RDD) is the core abstraction of data in Apache Spark, a read-only, partitioned collection of data records, with the features of the RDD: Lazy evaluation, calculated only when needed The data inside the RDD is partitioned, and each piece of data may be distributed across different nodes in the cluster; Support for parallel computing resilient: With the Rdd lineage Graph,spark you can re-execute previously failed compute tasks without having to recalculate on the whole, It guarantees fault tolerance and is very flexible, realizing the fault-tolerance

So how to manipulate and manipulate the data. Spark provides a set of functional programming-style APIs that make it easy to manipulate, transform, and manipulate the RDD, just like a collection of operations. Like what:

Val Rdd = sc.parallelize (1
to ten) Val result = Rdd.map (_ +)
  . Filter (_ >)
  . map (x = (x, 1)).
  R Educebykey (_+_)
  . Collect

And developers can write the appropriate RDD and the operation between the RDD as needed, very convenient. As you can see, the RDD is equivalent to an abstract data representation, and operation is the equivalent of a DSL used to transform or evaluate the RDD. the expression of the Rdd

The RDD in Spark consists mainly of five parts: partitions (): Partition Collection Dependencies (): Dependency collection of current Rdd iterator (split, context): Functions Partitioner (): Partitioning methods such as Hashpartitioner and Rangepartitioner preferredlocations (split) for each partition calculation or read operation: Access the fastest node of a partition

All RDD inherit the abstract class Rdd. Several common operations: sc#textfile: Generate Hadooprdd, which represents an RDD sc#parallelize that can read data from HDFs: Generate Parallelcollectionrdd, representing the Rdd map generated from the Scala collection, FLATMAP, filter: Generates MAPPARTITIONSRDD, whose partition is consistent with the parent Rdd, and also corresponds to the operation (lazy) Union of the data returned by the iterator function in the parent RDD: Generate Unionrdd or Partitionerawareunionrdd Reducebykey, Groupbykey: Generate Shuffledrdd, shuffle operation required Cogroup, join: Generate Cogroupedrdd Operations

The operation of the RDD in Spark is divided into two types: transformation and action. The transformation is lazy and simply saves the calculation step and returns a new RDD without immediately performing a calculation action that performs the calculation operation in turn and obtains the result

These transformation and action should be common in FP, such as map, FLATMAP, filter, reduce, count, sum.

The transformation function for a single data operation is within the Rdd abstract class, while the transformation for a tuple operation is in the Pairrddfunctions wrapper class. The RDD can be automatically converted to the Pairrddfunctions class by the implicit function when it conforms to the type requirements, enabling operations such as Reducebykey. The corresponding implicit function:

Implicit def Rddtopairrddfunctions[k, V] (rdd:rdd[(K, v)])
  (implicit kt:classtag[k], vt:classtag[v], ord:ordering[ K] = null): Pairrddfunctions[k, V] = {
  new pairrddfunctions (RDD)
}
Dependency

As we mentioned above, the RDD only calculates the results when needed, and after invoking those transformation methods, the corresponding transformation information is simply stored, until an action is called to actually perform the calculation. There is a connection between the RDD in Spark, and the Rdd forms a dependency between the lineage graph (dependency graph). Dependency broadly divided into two types: narrow dependency and wide dependency. Narrow Dependency (narrowdependency): Each partition in the Parent Rdd is used by at most one of the partition in the child Rdd, that is, a single relationship. such as map, FLATMAP, filter, etc. transformation are narrow dependency Wide dependency (shuffledependency): Parent Each partition in the RDD is used by multiple partition in the child Rdd, a one-to-many relationship. For example, the RDD generated by join is generally wide dependency (different partitioner)

The legend in the paper is a visual representation of the dependencies between the RDD:

The reason for dividing dependency: Narrow dependency can be easily executed in the form of pipelining, that is, a string of chain down from beginning to end. and wide dependency must wait until all the parent RDD results are ready to perform the calculation Narrow dependency failure, spark only needs to recalculate the failed parent Rdd, and for wide Dependency, a failure can cause some partitions to be lost and must be recalculated as a whole Shuffle

The shuffle operation in Spark is similar to that in MapReduce, which is triggered when the wide dependency corresponding RDD is calculated (that is, shufflemapstage).

First, let's review why you should do shuffle operations. In the case of the Reducebykey operation, Spark will assemble the tuple with the same key into a block followed by the key and then perform the calculation operation. However, these tuple may be in different partition, even in different cluster nodes, to calculate must first gather them together. Therefore, Spark uses a set of map tasks to write each partition to a temporary file, and then the next stage side (reduce task) Gets the temporary file based on the number and then aggregates the tuple in the partition by key and takes the appropriate action. This also includes sorting operations (which may be done in the map side or in the reduce side).

Shuffle is one of the main performance bottlenecks of spark (involving disk IO, data serialization, and network IO), and its optimization has always been a challenge. Shuffle Write (Map Task): Sortshufflewriter#write Shuffle Read (reduce Task): Shufflerdd#compute Persistence

The purpose of the checkpoint is to save the RDD data (long lineage chains), which takes longer to calculate, and a new job is submitted when the checkpoint is executed, so it's best to persist after checkpoint.

Caches and persist are used to cache some frequently used RDD results (but not too large). The main function of the persist method is to change the storagelevel to perform the corresponding persistence operation in compute by Blockmanager. The cache method is equivalent to setting the storage level to memory_only Job scheduling

Simply put, spark divides the submitted calculations into different stages, forming a directed acyclic graph (DAG). The scheduler for Spark calculates each stage in sequence, in order of the DAG, and results in the resulting calculation. Several important classes or interfaces for performing calculations are as follows: Dagscheduler activejob Stage Task TaskScheduler schedulerbackend

The most important thing in this is Dagscheduler, which translates the logical execution plan (the Rdd lineage) into a physical execution plan (stage/task). As we mentioned before, when a developer executes an action on an RDD, spark does the real computational process. When the developer executes the action, Sparkcontext will pass the current logical execution plan to Dagscheduler,dagscheduler to generate a job (corresponding to the Activejob Class) and submit it based on the given logical execution plan. Each execution of a Acton generates a activejob.

During the job submission process, Dagscheduler will divide the stage. Spark is in accordance with the shuffle operation to divide the stage, that is, the stage between the wide dependency, each stage of the dependency are narrow dependency. The benefit of this partitioning is to put as many narrow dependency as possible into the same stage to facilitate pipeline calculations, while the child Rdd in wide dependency must wait for all the parent The RDD calculation is complete and can be shuffle later, so dividing the stage is the most appropriate.

The partitioned stages will form a dag,dagscheduler that commits the parent stages (if present), submits the current stage, and so on, the first stage without the parent stage, according to the order in the DAG. From an execution perspective, the stage can be executed only after the parent stages of the stage has finished executing. The last stage is the stage that produces the final result, corresponding to the resultstage, while the rest of the stage is shufflemapstage. Here is a legend of the stage division in the paper, very intuitive:

When you submit the stage, Spark generates a set of corresponding types of tasks (Resulttask or Shufflemaptask) based on the type of stage, and then wraps the tasks into Taskset and submits them to TaskScheduler. A task corresponds to one of the partition in an RDD, where a task is responsible for only one partition calculation:

val tasks:seq[task[_]] = try {Stage match {case stage:shufflemapstage = parti
        Tionstocompute.map {id = val locs = taskidtolocations (id) Val part = stage.rdd.partitions (ID) New Shufflemaptask (Stage.id, Stage.latestInfo.attemptId, Taskbinary, part, locs, Stage.latestInfo.taskMetrics, Properties)} Case stage:resultstage = val Job = Stage.activeJob.get Partitionstocompute.map { id = val P:int = stage.partitions (id) Val part = Stage.rdd.partitions (p) Val locs = Taskidtol Ocations (ID) New Resulttask (Stage.id, Stage.latestInfo.attemptId, Taskbinary, part, locs, ID, properties , Stage.latestInfo.taskMetrics)}}} catch {//here code slightly ...} 

The TaskScheduler sends reviveoffers messages to the back end of the task (Schedulerbackend, which can be local, Mesos, Hadoop yarn, or other cluster management components). The corresponding execution backend receives the message and then encapsulates the task into Taskrunner (an instance of the Runnable interface) and submits it to the underlying executor, performing the compute task in parallel.

The thread pool in executor is defined as follows:

Private Val ThreadPool = Threadutils.newdaemoncachedthreadpool ("Executor task launch worker")

def Newdaemoncachedthreadpool (prefix:string): Threadpoolexecutor = {
  val threadfactory = namedthreadfactory (prefix)
  Executors.newcachedthreadpool (threadfactory). Asinstanceof[threadpoolexecutor]
}

You can see that the thread pool that executes the task at the bottom is actually cachedthreadpool in Juc, creates a new thread on demand, and re-uses the already-built threads in the thread pool.

Finally, a picture summarizes the job, stage, and task relationships (pictured from mastering Apache Spark 2.0):

Step diagram of the entire Spark context execution task:
Memory Management

The RDD in Spark is stored in two ways: in memory and on disk, which is in memory by default. Distributed computing often reads large amounts of data, and often needs to be reused, and if simply handing out memory management to a GC, it can easily lead to a recovery failure that cause the full GC, affecting performance.

Spark 1.5 begins to no longer manage memory through GC. Spark 1.5 implements a memory manager for manually managing memory (Project Tungsten), which allocates and reclaims memory directly from the unsafe class.

In addition, the GC aspect of the distributed computing system can also refer to a paper in OSDI 2016: yak:a high-performance big-data-friendly garbage Collector. PageRank Instances

Below, run a PageRank in spark to see the resulting stage DAG. The formula for PageRank is relatively simple:

Here we choose damping factor=0.85, the initial rank value for the 1.0;pagerank algorithm can be optimized with the Markov matrix, but here the number of iterations is small, can be directly iterative calculation. Corresponding Code:

Val iters = Ten
val data = Sc.textfile ("Data.txt")
val links = data.map {s =
  val parts = s.split ("\\s+") 
   (Parts (0), parts (1))
}.distinct (). Groupbykey (). Cache ()
var ranks = links.mapvalues (v = 1.0) for
(i <-1 to Iters) {
  val contribs = Links.join (ranks). Values.flatmap {case (urls, rank) =
    val size = urls.size
    urls.map (url = (URL, rank/size))
  }
  Ranks = Contribs.reducebykey (_ + _). Mapvalues (0.15 + 0.85 * _)
}
val output = Ranks.collect ()

The corresponding stage DAG:

The Rdd dependencies in Stage 3 are as follows:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.