What exactly is a Spark rdd?

Source: Internet
Author: User
Tags spark rdd

Objective

With spark for a while, but feel still on the surface, the understanding of Spark's rdd is still in the concept, that is, only know that it is an elastic distributed data set, the other is not known

A little slightly ashamed. Below is a note of my new understanding of the RDD.

Official introduction

Elastic distributed data sets. The RDD is a collection of read-only, partitioned records. The RDD can only be created based on deterministic operations on datasets in stable physical storage and other existing RDD.

Problem

As long as you dare to ask Niang rdd What, package you see a large piece of the same answer, are saying such conceptual things, there is no value.

I just want to know why the RDD is elastic, not inelastic, and how the RDD actually saves the data, and which stage of the task is read in the process.

What is elasticity

My understanding is as follows (if wrong or insufficient, please indicate correct):

1. Rdd can be manually or automatically switched between memory and disk

2. The RDD can be converted into other rdd, the bloodline

3. RDD can store any type of data

What are the stored content

According to the code that writes the spark task, it is intuitive to feel that the RDD is a read-only data, such as Rdd.foreach (println)

But no, the RDD does not actually store the real data, only the method of acquiring the data, the partitioning method, and the type of the data.

Seeing is believing, the following look at the source of Rdd:

The other code was deleted, and the main two abstract methods were preserved.
Abstract class Rdd[t:classtag] ( @transient private var _sc:sparkcontext, @transient private var Deps:seq[depen Dency[_]]
A method of calculating a partition's data, which reads the data of a partition into a Iterator
def compute (split:partition, Context:taskcontext): Iterator[t]
  Compute partition information will only be called once  protected Def Getpartitions:array[partition]}

Through these two abstract methods of RDD, we can see:

The Rdd actually does not store true data, only the partition information of the real data getpartitions, and the Read method for a single partition compute

It may be a little confusing to be here, if the RDD only stores this partition information and the Read method, then how does the RDD dependency information be stored?

In fact, the RDD is saved, but I paste out just the top layer of the Rdd abstract class, but also need to note that theRdd can only be relied on , and the real implementation of both methods of the RDD is the entire task input, that is, in the top layer of the RDD pedigree, the first generation of RDD

For example: Val rdd = Sc.textfile (...);  Val rdd1 = Rdd.map (f). The RDD here is a first-generation RDD, there is no dependency on the RDD, so there is no dependency on the information, and Rdd1 is a descendant of the RDD, then it must be recorded from who they are from, that is, descent,

The following shows the Hadooprdd and Mappartitionsrdd

Partition information and read method responsible for recording data

Class Hadooprdd[k, V] (
@transient Sc:sparkcontext,
Broadcastedconf:broadcast[serializableconfiguration],
initlocaljobconffuncopt:option[jobconf = Unit],
Inputformatclass:class[_ <: Inputformat[k, V]],
Keyclass:class[k],
VALUECLASS:CLASS[V],
Minpartitions:int)
Extends rdd[(K, V)] (SC, Nil) with Logging {

override def Getpartitions:array[partition] = {* * * * ** * * * * * *}

Override Def compute (thesplit:partition, Context:taskcontext): interruptibleiterator[(K, V)] = {* * * * Limit yourself View * *}

}

The beginning of the role of the child Rdd is simply a record of what the initial rdd is doing to get himself

Private[spark] class Mappartitionsrdd[u:classtag, T:classtag] (

    PREV:RDD[T],  //previous generation RDD    F: (Taskcontext, Int., iterator[t]) = Iterator[u],  //(Taskcontext, partition Index, iterator)  ///First-generation RDD generate its own method    Preservespartitioning:boolean = False)  extends Rdd[u] (prev) {  Override Val Partitioner = if (preservespartitioning) Firstparent[t].partitioner else None  override Def Getpartitions:array[partition] = firstparent[t].partitions  override def compute (split:partition, Context: Taskcontext): Iterator[u] =    f (context, Split.index, Firstparent[t].iterator (split, context))}

Here we have a rough idea of what the RDD is storing,

First-generation RDD: The top level of descent, storing the partition information of the data required by the task, and the method of reading the data in a single partition, without the dependency of the RDD, because it is the beginning of the dependency.

Child Rdd: In the descent of the lower layer, the storage is the first generation of the RDD is what to do to produce their own, there is a reference to the first-generation RDD

Now that we have a basic understanding of what is stored in the RDD, the problem is, when the data is read.

When does the data read occur

Directly to the point of saying that the data read occurs in the running task, that is, the data is run on the task distribution executor when read , on the source:

Private[spark] class resulttask[t, U] (Stageid:int, Stageattemptid:int, Taskbinary:broadcast[array[byte]), Partition:partition, @transient Locs:seq[tasklocation], Val outputid:int, Internalaccumulators:seq[accumulato R[long]] extends Task[u] (Stageid, Stageattemptid, Partition.index, internalaccumulators) with Serializable {@transien T Private[this] val Preferredlocs:seq[tasklocation] = {if (locs = = null) Nil else Locs.toSet.toSeq} override def R    Untask (context:taskcontext): U = {//deserialize the RDD and the func using the broadcast variables. Val deserializestarttime = System.currenttimemillis () val ser = SparkEnv.get.closureSerializer.newInstance () Val (rd D, func) = ser.deserialize[(Rdd[t], (Taskcontext, Iterator[t]) (Bytebuffer.wrap (taskbinary.value), THREAD.C Urrentthread.getcontextclassloader) _executordeserializetime = System.currenttimemillis ()-DeserializeStartTime met RICS = Some (context.taskmetrics) FUNC (context, Rdd.iterator (partition, context))//Here Call the Rdd.iterator, look at this method of the RDD}//This is just callable on the DR  Iver side. Override def preferredlocations:seq[tasklocation] = Preferredlocs override def tostring:string = "Resulttask (" + stagei D + "," + PartitionID + ")"}


Final Def iterator (Split:partition, Context:taskcontext): iterator[t] = {
if (storagelevel! = Storagelevel.none) {

First determine if there is a cache, then take it directly from the cache, do not remove it from the disk, and then perform the cache operation
SparkEnv.get.cacheManager.getOrCompute (this, split, context, Storagelevel)
} else {

Read directly from disk or read from checkpoint
Computeorreadcheckpoint (split, context)
}
}

The task in spark will eventually be broken down into multiple taskset to executor, and the Taskset division is based on the need for shuffle.

There are only two kinds of tasks in spark, one is Resulttask, the other is Shuffletask, and both tasks read the RDD data in the same way.

What exactly is a Spark rdd?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.