Transformation processing data for the Key-value form of operators can be broadly divided into: input partition and output partition one-to-one, aggregation, connection operation.
input partition and output partition one-to-one mapvalues
Mapvalues: Map operation for Value in (key,value) type data, not Key processing.
The box represents the RDD partition. The a=>a+2 represents only 1 of the data (V1, 1) plus 2 operation, and the result is 3.
Source:
/** * Pass each value in the key-value pair RDD through a map function without changing the keys; * this also retains the original RDD‘s partitioning. */ def mapValues[U](f: V => U): RDD[(K, U)] = { val cleanF = self.context.clean(f) new MapPartitionsRDD[(K, U), (K, V)](self, case (k, v) => (k, cleanF(v)) }, true) }
Single RDD or two Rdd aggregation (1) combinebykey
Combinebykey is an aggregation of a single rdd. An RDD equivalent to converting an RDD (int,int) to an element of type (Int,seq[int]).
The description of defining the Combinebykey operator is as follows:
- Createcombiner:v = c, in cases where C does not exist, such as a SEQ C created by V.
- Mergevalue: (c, V) + C, when C is already present, merge is required, e.g. Add Item V to SEQ
c, or overlay.
- Mergecombiners: (c,c) + C, merging two C.
- Partitioner:partitioner (partitioner), shuffle need to be partitioned by Partitioner's partitioning policy.
- Mapsidecombine:boolean=true, in order to reduce the amount of transmission, many combine can be done first on the map side. For example, overlays can overlay the value of all the same key in a partition before shuffle.
- Serializerclass:string=null, the transport needs to be serialized, and the user can customize the serialization class.
The box represents the RDD partition. by Combinebykey, the (v1,2), (v1,1) data is merged into (V1,SEQ (2,1)).
Source:
/** * Generic function to combine, the elements for each key using a custom set of aggregation * functions. Turns an rdd[(k, V)] to a result of type rdd[(K, C)], for a "combined type" C * Note this V and C can be different-- For example, one might group a rdd of type * (int, int) into an RDD of type (int, seq[int]). Users provide three functions: * *-' createcombiner ', which turns a V into a C (e.g., creates a one-element list) * -' mergevalue ', to merge a V into a C (e.g., adds it to the end of a list) *-' mergecombiners ', to combine ' C ' s int o a single one. * In addition, users can control the partitioning of the output RDD, and whether to perform * map-side aggregation ( If a mapper can produce multiple items with the same key). */ defCombinebykey[c] (createcombiner:v = C, Mergevalue: (c, V) and C, Mergecombiners: (c, c) + = C, part Itioner:partitioner, Mapsidecombine:boolean =true, Serializer:serializer =NULL): rdd[(K, C)] = {require (mergecombiners! =)NULL,"Mergecombiners must be defined")//required as of Spark 0.9.0 if(Keyclass.isarray) {if(Mapsidecombine) {Throw NewSparkexception ("cannot use map-side combining with array keys.") }if(Partitioner.isinstanceof[hashpartitioner]) {Throw NewSparkexception ("Default Partitioner cannot partition array keys.") } }ValAggregator =NewAggregator[k, V, C] (Self.context.clean (Createcombiner), Self.context.clean (Mergevalue), Self.context.clean (mergecombiners))if(Self.partitioner = = Some (partitioner)) {Self.mappartitions (iter = {Valcontext = Taskcontext.get ()NewInterruptibleiterator (Context, Aggregator.combinevaluesbykey (ITER, Context))}, preservespartitioning =true) }Else{NewShuffledrdd[k, V, C] (self, partitioner). Setserializer (Serializer). Setaggregator (aggregator). Setmap Sidecombine (Mapsidecombine)}}/** * Simplified version of Combinebykey that hash-partitions the output RDD. */ defCombinebykey[c] (createcombiner:v = C, Mergevalue: (c, V) and C, Mergecombiners: (c, c) = C, Nump Artitions:int): rdd[(K, C)] = {Combinebykey (Createcombiner, Mergevalue, Mergecombiners,NewHashpartitioner (numpartitions))}
(2) Reducebykey
Reducebykey is a simpler case where only two values are combined into one value, so Createcombiner is simple, which is to return directly to V, and Mergevalue and Mergecombiners have the same logic, no difference.
The box represents the RDD partition. With the user-defined function (A, b) = (A+B), the value of the same key data (v1,2), (v1,1) is added, and the result is (v1,3).
Source:
/** * Merge The values for each key using a associative reduce function. This would also perform * the merging locally on each mapper before sending results to a reducer, similarly to a * "com Biner "in MapReduce. */ defReducebykey (Partitioner:partitioner, func: (V, v) = v): rdd[(K, v)] = {Combinebykey[v] ((v:v) = V, func, func , Partitioner)}/** * Merge The values for each key using a associative reduce function. This would also perform * the merging locally on each mapper before sending results to a reducer, similarly to a * "com Biner "in MapReduce. Output'll is hash-partitioned with numpartitions partitions. */ defReducebykey (func: (V, v) = V, numpartitions:int): rdd[(K, v)] = {Reducebykey (NewHashpartitioner (numpartitions), func)}/** * Merge The values for each key using a associative reduce function. This would also perform * the merging locally on each mapper before sending results to a reducer, similarly to a * "com Biner "in MapReduce. Output would be hash-partitioned with the existing partitioner/* parallelism level. */ defReducebykey (func: (V, v) = v): rdd[(K, v)] = {Reducebykey (Defaultpartitioner (self), func)}
(3) Partitionby
The Partitionby function partitions the RDD operation.
If the original Rdd's partition and the existing partition (Partitioner) are consistent, then the partition is not re-partitioned, which is equivalent to generating a new shuffledrdd based on the partition.
The box represents the RDD partition. The V1 and V2 data from different partitions are merged into one partition through the new partitioning policy.
Source:
/** * Return a copy of the RDD partitioned using the specified partitioner. */ def partitionBy(partitioner: Partitioner): RDD[(K, V)] = { if (keyClass.isArray && partitioner.isInstanceOf[HashPartitioner]) { thrownew SparkException("Default partitioner cannot partition array keys.") } if (self.partitioner == Some(partitioner)) { self else { new ShuffledRDD[K, V, V](self, partitioner) } }
(4) Cogroup
The Cogroup function divides the two rdd into a synergistic division. For elements of type key-value in two Rdd, each element with the same key in the RDD is aggregated into a single collection, and an iterator that returns the collection of elements of the key in the two Rdd (K, (Iterable[V], Iterable[w]))
. Where key and Value,value are tuples of two data sets of the same key under two Rdd.
The generous box represents the Rdd, and the small box inside the generous frame represents the partition in the RDD. Merge the data in RDD1 (u1,1), (u1,2), and RDD2 (u1,2) into (U1, ((), (2))).
Source:
/** * For each key k in ' the ' or ' other1 ' or ' other2 ' or ' other3 ', * return a resulting RDD that contains a tuple With the list of values * for that key in ' this ', ' other1 ', ' other2 ' and ' Other3 '. */ defCOGROUP[W1, W2, W3] (other1:rdd[(k, W1)], other2:rdd[(k, W2)], other3:rdd[(k, W3)], Partitioner:partitio NER): rdd[(K, (Iterable[v], ITERABLE[W1], iterable[w2], iterable[w3])] = {if(Partitioner.isinstanceof[hashpartitioner] && Keyclass.isarray) {Throw NewSparkexception ("Default Partitioner cannot partition array keys.") }ValCG =NewCogroupedrdd[k] (Seq (self, other1, Other2, Other3), Partitioner) cg.mapvalues { CaseArray (VS, W1s, W2s, w3s) = (Vs.asinstanceof[iterable[v]], W1S.ASINSTANCEOF[ITERABLE[W1], w2s.as INSTANCEOF[ITERABLE[W2]], W3S.ASINSTANCEOF[ITERABLE[W3]])}}/** * For each key K "this" or "other", return a resulting RDD that contains a tuple with the * List of values For that, key in ' the ' as well as ' other '. */ defCOGROUP[W] (other:rdd[(k, W)], Partitioner:partitioner): rdd[(K, (Iterable[v], iterable[w))] = {if(Partitioner.isinstanceof[hashpartitioner] && Keyclass.isarray) {Throw NewSparkexception ("Default Partitioner cannot partition array keys.") }ValCG =NewCogroupedrdd[k] (Seq (self, other), Partitioner) cg.mapvalues { CaseArray (VS, w1s) = (Vs.asinstanceof[iterable[v]], w1s.asinstanceof[iterable[w])}}/** * For each key k in ' this ' or ' other1 ' or ' other2 ', return a resulting RDD that contains a * tuple with the LI St of values for that key in ' this ', ' other1 ' and ' other2 '. */ defCOGROUP[W1, W2] (other1:rdd[(k, W1)], other2:rdd[(k, W2)], Partitioner:partitioner): rdd[(k, (ITERABLE[V), iterabl E[W1], iterable[w2])] = {if(Partitioner.isinstanceof[hashpartitioner] && Keyclass.isarray) {Throw NewSparkexception ("Default Partitioner cannot partition array keys.") }ValCG =NewCogroupedrdd[k] (Seq (self, other1, other2), Partitioner) cg.mapvalues { CaseArray (VS, w1s, w2s) = (Vs.asinstanceof[iterable[v]], W1S.ASINSTANCEOF[ITERABLE[W1], w2s.asinstance OF[ITERABLE[W2]]}}/** * For each key k in ' the ' or ' other1 ' or ' other2 ' or ' other3 ', * return a resulting RDD that contains a tuple With the list of values * for that key in ' this ', ' other1 ', ' other2 ' and ' Other3 '. */ defCOGROUP[W1, W2, W3] (other1:rdd[(k, W1)], other2:rdd[(k, W2)], other3:rdd[(k, W3)]): rdd[(k, (ITERABLE[V), iterabl E[W1], iterable[w2], iterable[w3])] = {Cogroup (other1, Other2, Other3, Defaultpartitioner (self, other1, Other2, other 3))}/** * For each key K "this" or "other", return a resulting RDD that contains a tuple with the * List of values For that, key in ' the ' as well as ' other '. */ defCOGROUP[W] (other:rdd[(k, W)]): rdd[(K, (Iterable[v], iterable[w))] = {Cogroup (other, defaultpartitioner (self, other) ) }/** * For each key k in ' this ' or ' other1 ' or ' other2 ', return a resulting RDD that contains a * tuple with the LI St of values for that key in ' this ', ' other1 ' and ' other2 '. */ defCOGROUP[W1, W2] (other1:rdd[(k, W1)], other2:rdd[(k, W2)]): rdd[(K, (Iterable[v], ITERABLE[W1], ITERABLE[W2]))] = { Cogroup (Other1, Other2, Defaultpartitioner (self, Other1, other2))}/** * For each key K "this" or "other", return a resulting RDD that contains a tuple with the * List of values For that, key in ' the ' as well as ' other '. */ defCOGROUP[W] (other:rdd[(k, W)], Numpartitions:int): rdd[(K, (Iterable[v], iterable[w])] = {Cogroup (other,NewHashpartitioner (numpartitions))}/** * For each key k in ' this ' or ' other1 ' or ' other2 ', return a resulting RDD that contains a * tuple with the LI St of values for that key in ' this ', ' other1 ' and ' other2 '. */ defCOGROUP[W1, W2] (other1:rdd[(k, W1)], other2:rdd[(k, W2)], Numpartitions:int): rdd[(K, (Iterable[v], ITERABLE[W1), ITERABLE[W2])] = {Cogroup (other1, Other2,NewHashpartitioner (numpartitions))}/** * For each key k in ' the ' or ' other1 ' or ' other2 ' or ' other3 ', * return a resulting RDD that contains a tuple With the list of values * for that key in ' this ', ' other1 ', ' other2 ' and ' Other3 '. */ defCOGROUP[W1, W2, W3] (other1:rdd[(k, W1)], other2:rdd[(k, W2)], other3:rdd[(k, W3)], Numpartitions:int) : rdd[(K, (Iterable[v], ITERABLE[W1], iterable[w2], ITERABLE[W3]))] = {Cogroup (other1, Other2, Other3,NewHashpartitioner (numpartitions))}
Connect (1) Join
Join Cogroup function Operations on two rdd that need to be connected. Cogroup the new Rdd formed after the operation, the elements under each key Cartesian product operation, the returned results are flattened, corresponding to the key of all tuples formed a set, and finally return to rdd[(K, (V,W))].
The essence of join is to divide the merged data by Flatmapvalues by Cogroup operator first.
A join operation on two rdd. The generous box represents the Rdd, and the small box represents the partition in the RDD. The function is key for elements that have the same key (for example, V1), with the result that the data after the connection is (V1, ()) and (V1, ()).
Source:
/** * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and * (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD. */ def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = { this.cogroup(other, partitioner).flatMapValues( pair => foryield (v, w) ) }
(2) Leftouterjoin and Rightouterjoin
Leftouterjoin (left outer join) and Rightouterjoin (right outer join) are equivalent to determining whether the RDD element on one side is empty, or empty if empty, on the basis of join. If it is not empty, the data is concatenated and the result is returned.
Source:
/** * Perform a left outer joins of ' this ' and ' other '. For each element (K, V) in ' this ', the * resulting RDD would either contain all pairs (k, (V, Some (W))) as W in ' other ', or the * pair (k, (V, None)) if no elements in ' other ' has key K Uses the given partitioner to * partition the OUTP UT RDD. */ defLEFTOUTERJOIN[W] (other:rdd[(k, W)], Partitioner:partitioner): rdd[(k, (V, option[w))] = { This. Cogroup (Other, partitioner). flatmapvalues {pair = =if(Pair._2.isempty) {Pair._1.iterator.map (v = (V, None)}Else{ for(v <-pair._1.iterator; w <-pair._2.iterator)yield(V, Some (w))} } }/** * Perform a right outer join of "this" and "other". For each element (K, W) in ' Other ', the * resulting RDD would either contain all pairs (k, (Some (v), W)) for V in ' this ', or the * pair (k, (None, W)) If no elements in ' this ' has key K Uses the given partitioner to * partition the OUTPU T RDD. */ defRIGHTOUTERJOIN[W] (other:rdd[(k, W)], Partitioner:partitioner): rdd[(K, (Option[v], W))] = { This. Cogroup (Other, partitioner). flatmapvalues {pair = =if(Pair._1.isempty) {Pair._2.iterator.map (w = (None, W))}Else{ for(v <-pair._1.iterator; w <-pair._2.iterator)yield(Some (v), W)} } }
reprint Please indicate the author Jason Ding and its provenance
Gitcafe Blog Home page (http://jasonding1354.gitcafe.io/)
GitHub Blog Home page (http://jasonding1354.github.io/)
CSDN Blog (http://blog.csdn.net/jasonding1354)
Jane Book homepage (http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)
Google search jasonding1354 go to my blog homepage
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
"Spark" Rdd operation detailed 3--key-value type transformation operator