Objective
With spark for a while, but feel still on the surface, the understanding of Spark's rdd is still in the concept, that is, only know that it is an elastic distributed data set, the other is not known
A little slightly ashamed. Below is a note of my new understanding of the RDD.
Official introduction
Elastic distributed data sets. The RDD is a collection of read-only, partitioned records. The RDD can only be created based on deterministic operations on datasets in stable physical storage and other existing RDD.
Problem
As long as you dare to ask Niang rdd What, package you see a large piece of the same answer, are saying such conceptual things, there is no value.
I just want to know why the RDD is elastic, not inelastic, and how the RDD actually saves the data, and which stage of the task is read in the process.
What is elasticity
My understanding is as follows (if wrong or insufficient, please indicate correct):
1. Rdd can be manually or automatically switched between memory and disk
2. The RDD can be converted into other rdd, the bloodline
3. RDD can store any type of data
What are the stored content
According to the code that writes the spark task, it is intuitive to feel that the RDD is a read-only data, such as Rdd.foreach (println)
But no, the RDD does not actually store the real data, only the method of acquiring the data, the partitioning method, and the type of the data.
Seeing is believing, the following look at the source of Rdd:
The other code was deleted, and the main two abstract methods were preserved.
Abstract class Rdd[t:classtag] ( @transient private var _sc:sparkcontext, @transient private var Deps:seq[depen Dency[_]]
A method of calculating a partition's data, which reads the data of a partition into a Iterator
def compute (split:partition, Context:taskcontext): Iterator[t]
Compute partition information will only be called once protected Def Getpartitions:array[partition]}
Through these two abstract methods of RDD, we can see:
The Rdd actually does not store true data, only the partition information of the real data getpartitions, and the Read method for a single partition compute
It may be a little confusing to be here, if the RDD only stores this partition information and the Read method, then how does the RDD dependency information be stored?
In fact, the RDD is saved, but I paste out just the top layer of the Rdd abstract class, but also need to note that theRdd can only be relied on , and the real implementation of both methods of the RDD is the entire task input, that is, in the top layer of the RDD pedigree, the first generation of RDD
For example: Val rdd = Sc.textfile (...); Val rdd1 = Rdd.map (f). The RDD here is a first-generation RDD, there is no dependency on the RDD, so there is no dependency on the information, and Rdd1 is a descendant of the RDD, then it must be recorded from who they are from, that is, descent,
The following shows the Hadooprdd and Mappartitionsrdd
Partition information and read method responsible for recording data
Class Hadooprdd[k, V] (
@transient Sc:sparkcontext,
Broadcastedconf:broadcast[serializableconfiguration],
initlocaljobconffuncopt:option[jobconf = Unit],
Inputformatclass:class[_ <: Inputformat[k, V]],
Keyclass:class[k],
VALUECLASS:CLASS[V],
Minpartitions:int)
Extends rdd[(K, V)] (SC, Nil) with Logging {
override def Getpartitions:array[partition] = {* * * * ** * * * * * *}
Override Def compute (thesplit:partition, Context:taskcontext): interruptibleiterator[(K, V)] = {* * * * Limit yourself View * *}
}
The beginning of the role of the child Rdd is simply a record of what the initial rdd is doing to get himself
Private[spark] class Mappartitionsrdd[u:classtag, T:classtag] (
PREV:RDD[T], //previous generation RDD F: (Taskcontext, Int., iterator[t]) = Iterator[u], //(Taskcontext, partition Index, iterator) ///First-generation RDD generate its own method Preservespartitioning:boolean = False) extends Rdd[u] (prev) { Override Val Partitioner = if (preservespartitioning) Firstparent[t].partitioner else None override Def Getpartitions:array[partition] = firstparent[t].partitions override def compute (split:partition, Context: Taskcontext): Iterator[u] = f (context, Split.index, Firstparent[t].iterator (split, context))}
Here we have a rough idea of what the RDD is storing,
First-generation RDD: The top level of descent, storing the partition information of the data required by the task, and the method of reading the data in a single partition, without the dependency of the RDD, because it is the beginning of the dependency.
Child Rdd: In the descent of the lower layer, the storage is the first generation of the RDD is what to do to produce their own, there is a reference to the first-generation RDD
Now that we have a basic understanding of what is stored in the RDD, the problem is, when the data is read.
When does the data read occur
Directly to the point of saying that the data read occurs in the running task, that is, the data is run on the task distribution executor when read , on the source:
Private[spark] class resulttask[t, U] (Stageid:int, Stageattemptid:int, Taskbinary:broadcast[array[byte]), Partition:partition, @transient Locs:seq[tasklocation], Val outputid:int, Internalaccumulators:seq[accumulato R[long]] extends Task[u] (Stageid, Stageattemptid, Partition.index, internalaccumulators) with Serializable {@transien T Private[this] val Preferredlocs:seq[tasklocation] = {if (locs = = null) Nil else Locs.toSet.toSeq} override def R Untask (context:taskcontext): U = {//deserialize the RDD and the func using the broadcast variables. Val deserializestarttime = System.currenttimemillis () val ser = SparkEnv.get.closureSerializer.newInstance () Val (rd D, func) = ser.deserialize[(Rdd[t], (Taskcontext, Iterator[t]) (Bytebuffer.wrap (taskbinary.value), THREAD.C Urrentthread.getcontextclassloader) _executordeserializetime = System.currenttimemillis ()-DeserializeStartTime met RICS = Some (context.taskmetrics) FUNC (context, Rdd.iterator (partition, context))//Here Call the Rdd.iterator, look at this method of the RDD}//This is just callable on the DR Iver side. Override def preferredlocations:seq[tasklocation] = Preferredlocs override def tostring:string = "Resulttask (" + stagei D + "," + PartitionID + ")"}
Final Def iterator (Split:partition, Context:taskcontext): iterator[t] = {
if (storagelevel! = Storagelevel.none) {
First determine if there is a cache, then take it directly from the cache, do not remove it from the disk, and then perform the cache operation
SparkEnv.get.cacheManager.getOrCompute (this, split, context, Storagelevel)
} else {
Read directly from disk or read from checkpoint
Computeorreadcheckpoint (split, context)
}
}
The task in spark will eventually be broken down into multiple taskset to executor, and the Taskset division is based on the need for shuffle.
There are only two kinds of tasks in spark, one is Resulttask, the other is Shuffletask, and both tasks read the RDD data in the same way.
What exactly is a Spark rdd?