Reference articles
Coalesce () method and repartition () method
Transformations
Repartitionandsortwithinpartitions explanation return source coalesce and repartition explanation return source pipe explanation return source Cartesian explanation return source code cogroup explanation source Code J Oin explanation return Source code Sortbykey interpretation return source code Aggregatebykey interpretation return Source Reducebykey interpretation return Source Groupbykey interpretation return source distinct explanation return to source code Interse Ction explanation return Source Union interpretation return source code sample explanation return source map explanation return source code mappartitions explanation return source code Mappartitionswithindex return source FlatMap explanation Return Source Filter explanation return source Core function Combinebykeywithclasstag
When I first wrote Spark, swallowed knew a little bit of transformations, see the RDD operation for details.
Use your free time today to Xu Yisu these RDD conversion operations and deepen your understanding. repartitionandsortwithinpartitions explain
It literally means that the data in the partition is sorted as well when the partition is reassigned. The parameter is the partitioner (I'll talk about the partition system in the next section). The official document says the method is more efficient than repartition because he has been sequenced before entering the shuffle machine. return
Shuffledrdd Source
Orderedrddfunctions.scala
def repartitionandsortwithinpartitions (Partitioner:partitioner): rdd[(K, V)] = self.withscope {
new ShuffledRDD[K , V, v] (self, partitioner). setkeyordering (ordering)
}
The code logic is relatively simple, which is to create a shuffledrdd, and set the key sequencer. Coalesce and Repartition explain
Why put these two together, because the source code shows that repartition is actually called the coalesce, just pass the parameter is true.
That's easy, we just have to understand the coalesce method. The function of this method is to reset the number of partitions, and the second parameter is to set the shuffle operation when repartitioning. return
Coalescedrdd Source
def repartition (Numpartitions:int) (implicit ord:ordering[t] = null): rdd[t] = withscope {COALESCE (numpartitions, SH Uffle = True)} def coalesce (Numpartitions:int, Shuffle:boolean = False, partitioncoalescer:option[ Partitioncoalescer] = option.empty) (implicit ord:ordering[t] = null): rdd[t] = withscope {requ
IRE (Numpartitions > 0, S "Number of partitions ($numPartitions) must be positive.")
if (shuffle) {/** distributes elements evenly across output partitions, starting from a random partition. */ Val distributepartition = (index:int, items:iterator[t]) = = {var position = (new Random (Index)). Nextint (Nump Artitions) Items.map {t =//Note that the hash code of the key would just be the key itself.
The Hashpartitioner/'ll mod it with the number of total partitions.
Position = position + 1 (position, T)}}: Iterator[(Int, T)] Include a shuffle step so, our upstream tasks is still distributed new Coalescedrdd (new Shuffledr Dd[int, T, T] (Mappartitionswithindex (distributepartition), New Hashpartitioner (numpartitions)), NumPartiti
ONS, partitioncoalescer). Values} else {new Coalescedrdd (this, numpartitions, Partitioncoalescer)} }
Pipe
explain
Simply, execute commands, get command output, convert to rdd[string], and use this feature to execute scripting languages such as Php,python across languages to call each other with Scala. return
Pipedrdd Source
/** * Return an RDD created by piping elements to a forked external process. */def pipe (command:string): rdd[string] = withscope {//Similar to Runtime.exec (), if we is given a single strin
G, split it into words//using a-stringtokenizer (i.e. by spaces) pipe (pipedrdd.tokenize (command))}
/** * Return an RDD created by piping elements to a forked external process. */def pipe (command:string, env:map[string, String]): rdd[string] = withscope {//Similar to Runtime.exec (), if W E is given a single string, split it into the words//using a standard StringTokenizer (i.e. by spaces) pipe (PIPEDRD D.tokenize (command), ENV)} def pipe (command:seq[string], env:map[string, String] = Map (), Prin
Tpipecontext: (string = unit) = unit = NULL, Printrddelement: (T, String = unit) = = NULL, Separateworkingdir:boolean = False, Buffersize:int = 8192, encoding:string = Codec. Defaultcharsetcodec.name): rdd[string] = withscope {New Pipedrdd (this, command, env, if (Printpipecontext ne n ull) Sc.clean (printpipecontext) Else null, if (printrddelement ne null) sc.clean (printrddelement) else null, s Eparateworkingdir, buffersize, encoding)}
Cartesian
explain
Cartesian product calculations with data from another RDD. But in general this kind of scene is seldom seen, I have a stroke. return
Cartesianrdd Source
def Cartesian[u:classtag] (Other:rdd[u]): rdd[(T, U)] = withscope {
new Cartesianrdd (SC, this, other)
}
Cogroup
explain
For the same pair type of RDD, the same k of different value, the combination of multiple tuple generation, how many different value, is a few tuples.
Similar to (a,1), (a,2), (a,3), after Cogroup operation, get (A, ()) source code
There are 9 ways to Cogroup, and I've just listed one of the following methods:
def COGROUP[W1, W2] (other1:rdd[(k, W1)], other2:rdd[(k, W2)], Partitioner:partitioner)
: rdd[(k, (ITERABLE[V), Iter ABLE[W1], iterable[w2])] = self.withscope {
if (Partitioner.isinstanceof[hashpartitioner] && Keyclass.isarray) {
throw new sparkexception ("Hashpartitioner cannot partition array keys.")
}
Val CG = new Cogroupedrdd[k] (Seq (self, other1, other2), partitioner)
cg.mapvalues {case Array (VS, W1s, W2s) =>
(Vs.asinstanceof[iterable[v]],
W1S.ASINSTANCEOF[ITERABLE[W1]],
w2s.asinstanceof[iterable[w2])
}
}
Join
explain
Similar to inline statements in MySQL. return
Cogroupedrdd Source
Now that we're talking about the inline relationship with MySQL, the join is naturally in the same joint, left and right, out of the way. So the method of join in the source code is shown in the following figure:
def Join[w] (other:rdd[(k, W)], Partitioner:partitioner): rdd[(k, (V, w))] = self.withscope {
This.cogroup (other, par Titioner). flatmapvalues (pair =
for (v <-pair._1.iterator; w <-pair._2.iterator) Yield (V, W)
)
}
Learned from the source, the call is actually the Cogroup method. Sortbykey explain
For the (K,V) format of the Rdd, sort by K, the parameters are set in reverse or positive order. return
Shuffledrdd Source
In Orderedrddfunctions
def sortbykey (Ascending:boolean = true, Numpartitions:int = self.partitions.length)
: rdd[(K, V)] = self.withscope
{
val part = new Rangepartitioner (numpartitions, Self, ascending)
new shuffledrdd[k, V, v] (self, part)
. Setkeyordering (if (ascending) ordering else ordering.reverse)
}
Aggregatebykey
explain
Press key to aggregate the operation. return
Shuffledrdd Source
def Aggregatebykey[u:classtag] (Zerovalue:u, Partitioner:partitioner) (Seqop: (U, V) = u,
combop: (u, u) = = u) : rdd[(K, U)] = self.withscope {
//Serialize The zero value to a byte array so this we can get a new clone of it on E Ach key
val zerobuffer = SparkEnv.get.serializer.newInstance (). Serialize (Zerovalue)
val zeroarray = new array[ Byte] (zerobuffer.limit)
zerobuffer.get (Zeroarray)
lazy val Cachedserializer = SparkEnv.get.serializer.newInstance ()
val Createzero = () = () = Cachedserializer.deserialize[u] (bytebuffer.wrap (Zeroarray))
We'll clean the combiner closure later in ' Combinebykey '
val cleanedseqop = Self.context.clean (seqop)
Combi Nebykeywithclasstag[u] ((v:v) = Cleanedseqop (Createzero (), V),
Cleanedseqop, Combop, partitioner)
}
Reducebykey
explain
Aggregates with key, value values are merged, and the specific merge function is provided with the first parameter. return
Shuffledrdd Source
def reducebykey (Partitioner:partitioner, func: (V, v) = v): rdd[(K, v)] = self.withscope {
combinebykeywithclasst Ag[v] ((v:v) = V, Func, func, Partitioner)
}
Groupbykey
explain
(k,v) Type RDD operation, grouping data with key, repartitioning. return
Shuffledrdd Source
def groupbykey (Partitioner:partitioner): rdd[(K, iterable[v])] = self.withscope {
//Groupbykey shouldn ' t use map sid E combine because map side combine does not
//reduce the amount of data shuffled and requires all map side data is in serted
//into a hash table, leading to more objects in the old Gen.
Val Createcombiner = (v:v) = Compactbuffer (v)
val mergevalue = (Buf:compactbuffer[v], v:v) = = buf + = v
V Al Mergecombiners = (C1:compactbuffer[v], c2:compactbuffer[v]) = C1 ++= C2
val bufs = combinebykeywithclasstag[c OMPACTBUFFER[V]] (
createcombiner, Mergevalue, Mergecombiners, partitioner, Mapsidecombine = False)
bufs.asinstanceof[rdd[(K, Iterable[v])]
}
distinct
explain
Go to redo Operation return
With Father Rdd consistent source
def distinct (numpartitions:int) (implicit ord:ordering[t] = null): rdd[t] = withscope {
map (x = (x, NULL)). reduce Bykey ((x, y) = x, numpartitions). Map (_._1)
}
intersection
explain
Returns the intersection of two Rdd and carries out a redo operation to return
Parent Rdd Consistent Source
def intersection (Other:rdd[t]): rdd[t] = withscope {This.map (v = = (V, null)). Cogroup (Other.map (v = = (V, null
)). Filter {Case (_, (Leftgroup, rightgroup)) = = Leftgroup.nonempty && Rightgroup.nonempty} . Keys}/** * Return the intersection of this RDD and another one.
The output won't contain any duplicate * elements, even if the-input RDDs did.
* * @note This method performs a shuffle internally. * * @param partitioner partitioner to use for the resulting RDD */def intersection (Other:rdd[t], PA Rtitioner:partitioner) (implicit ord:ordering[t] = null): rdd[t] = withscope {This.map (v = = (V, null)). Cogroup (OT Her.map (v = (V, null)), partitioner). Filter {Case (_, (Leftgroup, rightgroup)) = = Leftgroup.nonempty & ;& rightgroup.nonempty}. Keys}/** * Return the intersection of this RDD and another one. The output won't contain any duplicate * elements, Even if the input RDDs did.
Performs a hash partition across the cluster * * @note This method performs a shuffle internally. * * @param numpartitions How many partitions to use in the resulting RDD */def intersection (Other:rdd[t], Numpar Titions:int): rdd[t] = withscope {intersection (Other, new Hashpartitioner (numpartitions))}
Union
explain
Merge does not go back
Unionrdd/partitionerawareunionrdd Source
def Union[t:classtag] (rdds:seq[rdd[t]): rdd[t] = withscope {
val partitioners = Rdds.flatmap (_.partitioner). ToSet
if (Rdds.forall (_.partitioner.isdefined) && partitioners.size = = 1) {
new Partitionerawareunionrdd ( This, Rdds)
} else {
new Unionrdd (this, Rdds)
}
}
Sample
explain
Sample return
Parent Rdd Source code
Def sample (
Withreplacement:boolean,
fraction:double,
seed:long = Utils.random.nextLong): rdd[t] = {
require (fraction >= 0,
S "Fraction must is nonnegative, but got ${fraction}")
Withscope {
require ( Fraction >= 0.0, "negative fraction value:" + fraction)
if (withreplacement) {
new partitionwisesampledrdd[t, T] (this, new poissonsampler[t] (fraction), true, Seed)
} else {
new partitionwisesampledrdd[t, T] (this, new Bern Oullisampler[t] (fraction), true, Seed)}}
Map
explain
The simplest transformations method, in which each parent RDD functions the incoming function, one by one, corresponds to the number of the parent RDD and the child-like RDD. return
Mappartitionsrdd Source
def Map[u:classtag] (f:t = U): rdd[u] = withscope {
val cleanf = Sc.clean (f)
new Mappartitionsrdd[u, T] (this, (context, PID, iter) = Iter.map (cleanf))
}
mappartitions
explain
Map operations are performed within the partition. return
Mappartitionsrdd Source
def Mappartitions[u:classtag] (
f:iterator[t] = Iterator[u],
Preservespartitioning:boolean = false): RDD[ U] = withscope {
val cleanedf = Sc.clean (f)
new Mappartitionsrdd (This
,
(Context:taskcontext, index : Int, iter:iterator[t]) = CLEANEDF (iter),
preservespartitioning)
}
Mappartitionswithindex
More than mappartitions. A partition index value is available for use. return
Mappartitionsrdd Source
def Mappartitionswithindex[u:classtag] (
F: (Int, iterator[t]) = Iterator[u],
preservespartitioning: Boolean = False): Rdd[u] = withscope {
val cleanedf = Sc.clean (f)
new Mappartitionsrdd (This
,
(context:t Askcontext, Index:int, iter:iterator[t]) = CLEANEDF (index, ITER),
preservespartitioning)
}
FlatMap
explain
The element is transformed into multiple elements, then tiled, with the passed in function. return
Mappartitionsrdd Source
def Flatmap[u:classtag] (f:t = Traversableonce[u]): rdd[u] = withscope {
val cleanf = Sc.clean (f)
new Mappa Rtitionsrdd[u, T] (this, (context, PID, iter) = Iter.flatmap (cleanf))
}
Filter
explain
Filters the parent Rdd with the filter condition to satisfy the conditional incoming subclass Rdd. return
Mappartitionsrdd Source
def filter (f:t = Boolean): rdd[t] = withscope {
val cleanf = Sc.clean (f)
new Mappartitionsrdd[t, T] (
this ,
(context, PID, iter) = Iter.filter (cleanf),
preservespartitioning = True)
}
core function Combinebykeywithclasstag
In the explanation of Groupbykey,aggregatebykey,reducebykey and other operations (K,V) in the form of RDD, the source code is used Combinebykeywithclasstag method, it is necessary to understand the method.
Reference article: Combinebykey
Combinebykey
def combinebykeywithclasstag[c] (createcombiner:v = C, Mergevalue: (c, V) and C, Mergecombiners: (c, c) = = C, Partitioner:partitioner, Mapsidecombine:boolean = true, Serializer:serializer = null) (implicit ct:classtag[c]): rdd[(K, C)] = self.withscope {require (mergecombiners! = NULL, "Mergecombiners must be def ined ")//required as of Spark 0.9.0 if (keyclass.isarray) {if (mapsidecombine) {throw new sparkexcept
Ion ("Cannot use map-side combining with array keys.") } if (Partitioner.isinstanceof[hashpartitioner]) {throw new Sparkexception ("Hashpartitioner cannot partitio
n Array Keys ")}}
Val aggregator = new Aggregator[k, V, C] (Self.context.clean (Createcombiner), Self.context.clean (Mergevalue), Self.context.clean (Mergecombiners)) if (Self.partitioner = = Some (partitioner)) {self.mappartitions (iter =
> {val context = Taskcontext.get () New Interruptibleiterator (Context, Aggregator.combinevaluesbykey (ITER, Context))}, preservespartitioning = True) } else {new shuffledrdd[k, V, C] (self, partitioner). Setserializer (Serializer). Setaggregator (
Aggregator). Setmapsidecombine (Mapsidecombine)}}
The core is three functions Createcombiner: initializes the first value. Mergevalue: Processes the remaining values with the first value and iterates over them. Mergecombiners: Use this function to merge if the data is in a different partition.
This function converts rdd[(K,V)] to the format of rdd[(K,C)], V is the value of the parent Rdd, and K is the parent Rdd key, and the operation we are going to do is based on K, converting V to C,c can be understood as any type and also includes K type.
are based on the key classification of the operation, the different keys are not recognized, the following explanations are classified by key, the various groups of processing methods
The first function Createcombiner, the abstract defines the C format, his definition is v=>c, the input is V, the return is C, this is an initialization function, the RDD in the partition of the first data of the V value to this function, into C. The second function Mergevalue, abstract form (c,v) =>c, in fact, is the use of the initialization of the C, with the RDD other data merge operation, finally get a C. The third function mergecombiners, only the data is scattered across different partitions