The implementation process of the spark operator is detailed in eight

Source: Internet
Author: User
Tags foreach comparison prev unique id zip
36.zip

The 2 rdd elements of the same position are composed of kv pairs

/**
* Zips this RDD with another one, returning key-value pairs with the first element in each RDD,
* Second ele ment in each RDD, etc. Assumes that the RDDs has the *same number of * partitions* and the *same number of elements in each
partition* (e.g. one was made through
* A map to the other).
*
  /Zip[u:classtag] (Other:rdd[u]): rdd[(T, U)] = withscope {
false
//If 2 iterators have values, the output is not output if there is no value, and the length of the two iterators must be consistent
(Thisiter, otheriter) =
iterator[(T, U)] {
{
(true true true
 (falsefalse sparkexception(+"Same

 number of elements in each partition ")
}
next (): (T, U) = (Thisiter.next (), Otheriter.nex T ())
}
}
}

Continue to see the specific implementation of Zippartitions:

Zippartitions[b:classtag, V:classtag]
(rdd2:rdd[b], Preservespartitioning:boolean)
(f: (Iterator[t], I TERATOR[B] = Iterator[v]): rdd[v] = withscope {
This, RDD2, preservespartitioning)
}
Private[Spark]classZippedpartitionsrdd2[a:classtag, B:classtag, V:classtag] (Sc:sparkcontext,varF: (Iterator[a], iterator[b]) = Iterator[v],varRdd1:rdd[a],varRdd2:rdd[b], Preservespartitioning:boolean =false)extendsZIPPEDPARTITIONSBASERDD[V] (SC, List (RDD1, RDD2), preservespartitioning) {//compute is the use of a zip-generated iterator that overrides its hasnext and next methods to return the data override DefCompute (S:partition, Context:taskcontext): iterator[v] = {Valpartitions = s.asinstanceof[zippedpartitionspartition].partitions F (rdd1.iterator (partitions (0), context), Rdd2.iterator (Partitions (1), context))}Override DefCleardependencies () {Super. cleardependencies () Rdd1 =NULLRDD2 =NULLf =NULL}
}

So how is the partitioning feature and data locality calculated? Need to view its base class Zippedpartitionsbaserdd

PrivateZippedpartitionsbaserdd[v:classtag] (
sc:sparkcontext,
Rdds:seq[rdd[_]],
false)
 onetoonedependency (x))) {//because the default preservespartitioning of Zip is 
   false, Then zip partitioner to none
partitioner =
None
  
getpartitions: Array[partition] = {
numparts = rdds.head.partitions.length
//zip the number of partitions of the left and right two
Rdd must be consistent (!rdds.forall (Rdd = Rdd.partitions.length = = numparts)) {
illegalargumentexception ("Can ' t zip RDDs with unequal numbers of partitions")
}
Ar Ray.tabulate[partition] (numparts) {i + =
//Get LOC
information for each partition Prefs = Rdds.map (Rdd = rdd.preferredlocations (Rdd.partitions (i)))
//Check whether there is any hosts that Match all RDDs; Otherwise return the Union
Find the intersection
 of loc exactmatchlocations = Prefs.reduce ((x, y) = X.intersect (y))
If the intersection is non-null, the intersection is LOC after zip, and if the intersection is empty, the non-coincident loc
prefs.flatten.distinct
 of both are taken Zippedpartitionspartition (i, Rdds, locs)
}
}
  
getpreferredlocations (s:partition): seq[string] = {
s.asinstanceof[zippedpartitionspartition].preferredlocations
}
  
cleardependencies ( ) {
Super. Cleardependencies ()
null
}
}
Its specific implementation process is as follows:

37.zipPartitions

Zippartition has several variants, listed below:

defZippartitions[b:classtag, C:classtag, V:classtag] (Rdd2:rdd[b], rdd3:rdd[c], Preservespartitioning:boolean) (f: (Iterator[t], iterator[b], iterator[c]) = Iterator[v]): rdd[v] = withscope {NewZippedPartitionsRDD3 (SC, Sc.clean (f), This, RDD2, RDD3, preservespartitioning)}defZippartitions[b:classtag, C:classtag, V:classtag] (Rdd2:rdd[b], rdd3:rdd[c]) (f: (Iterator[t], iterator[b], I TERATOR[C] = Iterator[v]): rdd[v] = withscope {zippartitions (RDD2, rdd3, preservespartitioning =false) (f)}defZippartitions[b:classtag, C:classtag, D:classtag, V:classtag] (Rdd2:rdd[b], rdd3:rdd[c], rdd4:rdd[d], preserves  Partitioning:boolean) (f: (Iterator[t], iterator[b], iterator[c], iterator[d]) (= Iterator[v]): rdd[v] = WithScope {NewZippedPartitionsRDD4 (SC, Sc.clean (f), This, RDD2, Rdd3, Rdd4, preservespartitioning)}defZippartitions[b:classtag, C:classtag, D:classtag, V:classtag] (Rdd2:rdd[b], rdd3:rdd[c], rdd4:rdd[d]) (f: ( Iterator[t], iterator[b], iterator[c], iterator[d]) = Iterator[v]): rdd[v] = withscope {zippartitions (RDD2, RDD3, R DD4, preservespartitioning =false) (f)}
You can support multiple Rdd zip, and you can customize iterators with the same internal implementation principle as zip.

38.zipWithIndex

This function combines the elements in the RDD with the ID (index number) of the element in the RDD bonding/value pairs.

Zipwithindex (): rdd[(T, Long)] = withscope {
zippedwithindexrdd (this)
}

The details of the Zippedwithindexrdd are as follows:

rdd[(T, Long)] (prev) {
  
/** the start index of each partition. *
/Startindices:array[long] = {
c12>n = prev.partitions.length
(n = = 0) {
Array[long] ()
(n = = 1) {
Array (0L)
{
 //calculates the starting offset for each partition first, for example, assuming that there are 4 partitions, each containing 3,4,3 elements, then startindices: [0,3,7]         Prev.context.runJob (        prev,          utils.getiteratorsize _,         0 until n-1,//Don't need to count th e last partition         allowlocal =  false       ). Scanleft (0L) (_ + _)    }  }    override Def  getpartitions:array[ Partition] = {
The configuration information for the assembly partition, that is, the corresponding starting offset for the partition is Startindices (x.index)
zippedwithindexrddpartition (x, Startindices ( x.index))
}
  
getpreferredlocations (split:partition): seq[string] =
firstparent[t]. Preferredlocations (Split.asinstanceof[zippedwithindexrddpartition].prev)
  
Compute (splitin: Partition, Context:taskcontext): iterator[(T, Long)] = {
split = splitin.asinstanceof[ Zippedwithindexrddpartition]
//Calculate the relative offset of each element within its own partition by Zipwithindex, and then overlay split.startindex to calculate the absolute offset
in the entire RDD    Firstparent[t].iterator (Split.prev, context). zipwithindex.map {x =
(x._1, Split.startindex + x._2)
}
}
}

For example:

scala> var rdd2 = Sc.makerdd (Seq ("A", "B", "R", "D", "F"), 2)
Rdd2:org.apache.spark.rdd.rdd[string] = parallelcollectionrdd[34] at Makerdd at:21
Scala> Rdd2.zipwithindex (). Collect
res27:array[(String, Long)] = Array ((a,0), (b,1), (r,2), (d,3), (f,4))

Its specific execution flow is as follows: 39.zipWithUniqueId

This function combines an bonding/value pair with an element in the RDD and a unique ID, and the unique ID generation algorithm is as follows:

The unique ID value for the first element in each partition is: The partition index number,

/**
* Zips this RDD with generated unique Long IDs. Items in the kth partition would get IDs K, n+k,
* 2*n+k, ..., where n is the number of partitions. So there could exist gaps, but the This method
* won ' t trigger a spark job, which was different from [[Org.apache.spark.rdd. Rdd#zipwithindex]].
*
* Note that some RDDs, such as those returned by GroupBy (), does not guarantee order of
* elements in a partitio N. The unique ID assigned to each element was therefore not guaranteed,
* and could even change if the RDD is reevaluate D. If a fixed ordering is required to guarantee
* The same index assignments, you should sort the RDD with Sortbykey ( ) or save it to a file.
*
  /Zipwithuniqueid (): rdd[(T, Long)] = Withscope
{this. Partitions.length.toLong
K for partition index
this(k, iter) =
I for each partition element in the index
 of its own partition (item, i) = =
//i*n+k: The base value is the index of the partition, increasing the number of partitions each time * The distance of each partition element in the index of its own partition
(Item, I * n + k)
}
}
}

Examples are as follows:

scala> var rdd1 = Sc.makerdd (Seq ("A", "B", "C", "D", "E", "F"), 2)
Rdd1:org.apache.spark.rdd.rdd[string] = parallelcollectionrdd[44] at Makerdd at:21
RDD1 has two partitions,
Scala> Rdd1.zipwithuniqueid (). Collect
res32:array[(String, Long)] = Array ((a,0), (b,2), (c,4), (d,1), (e,3), (f,5))

The execution process is as follows: 40.foreach

Handling with F for each element of the RDD

/**
* Applies a function f to all elements of the this RDD.
*
  /foreach (f:t = unit): unit = withscope {
cleanf = Sc.clean (f)
Sc.runjob (this
      , (iter:iterator[t]) = Iter.foreach (cleanf))
}

Note that this is an action that will trigger the execution of the F function

Second, note that if you perform foreach on the RDD, it will only work on the executor side, not the driver side.

For example: Rdd.foreach (println), will only print in executor stdout, driver end is not visible.

By combining accumulator shared variables with foreach, you can count the values inside the RDD

scala> var cnt = sc.accumulator (0)

Cnt:org.apache.spark.accumulator[int] = 0

scala> var rdd1 = Sc.makerdd (1 to 10,2)

Rdd1:org.apache.spark.rdd.rdd[int] = parallelcollectionrdd[5] at Makerdd at:21

Scala> Rdd1.foreach (x = cnt + x)

Scala> Cnt.value

Res51:int = 55

41.foreachPartition

Foreachpartition and foreach, for each partition, they differ similar to the map and mappartitions operations

/*
* Applies a function f to each partition of the this RDD.
*/
def foreachpartition (f:iterator[t] = unit): unit = withscope {
val cleanf = Sc.clean (f)
Sc.runjob (This, (iter:iterator[t]) = CLEANF (ITER))
}

42.subtract

Erase the records that exist in this rdd from the other Rdd and return the remaining records in this RDD
/
* * Return an RDD with the elements from ' this ' is not in ' other '.
*
* Uses ' this ' partitioner/partition size, because even if ' other ' is huge, the resulting
* RDD would be &lt ; = us.
*
  /Subtract (other:rdd[t]): rdd[t] = withscope {
if you want to subtract, then you must know the distribution of this rdd, that is, its partition function
Subtract (Other, Partitioner.getorelse (hashpartitioner (partitions.length)))
}
/** * Return an RDD with the elements from ' this ' is not in
' other '.
*
  /Subtract (
other:rdd[t],
p:partitioner)(null): rdd[t] = withscope {
      
       (partitioner = = Some (p)) {
      
//result the RDD partition function is the same as this rdd, then regenerate a P2, because this P2 does not define the equals function, it means that any comparison is actually a comparison class address, which will cause the next two rdd will have shuffle action, As for why this design, not how to understandOur partitioner knows how to handle T (which, since we had a partitioner, is//really (K, V)) so make a new parti Tioner that would de-tuple our fake tuplesValP2 =NewPartitioner () {Override DefNumpartitions:int = P.numpartitionsOverride DefGetpartition (k:any): Int = P.getpartition (k.asinstanceof[(Any, _)]._1)}//Unfortunately, since we ' re making a N
    EW P2, we'll get shuffledependencies//anyway, and when calling. Keys, would not have a partitioner set, even though The Subtractedrdd would, thanks to P2 's de-tupled partitioning, already be//partitioned by the Right/real keys (e.g. p). This. map (x = (x,NULL). Subtractbykey (Other.map (_,NULL), p2). Keys}Else{
//If not equal, the default hash partition this is
null), p). Keys
}
}
/** Return an RDD with the pairs from ' this ' whose keys is not in ' other '. *
  /Subtractbykey[w:classtag] (other:rdd[(k, W)], P:partitioner): rdd[(k, V)] = self.withscope {
 /c4>subtractedrdd[k, V, W] (self, other, p)
}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.