International - English

Cart Console

Topic Center

Contact Sales

Home > Others

The implementation process of the spark operator is detailed in eight

Last Update:2018-07-26 Source: Internet

Author: User

Tags foreach comparison prev unique id zip

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

36.zip

The 2 rdd elements of the same position are composed of kv pairs

/**
* Zips this RDD with another one, returning key-value pairs with the first element in each RDD,
* Second ele ment in each RDD, etc. Assumes that the RDDs has the *same number of * partitions* and the *same number of elements in each
partition* (e.g. one was made through
* A map to the other).
*
  /Zip[u:classtag] (Other:rdd[u]): rdd[(T, U)] = withscope {
false

//If 2 iterators have values, the output is not output if there is no value, and the length of the two iterators must be consistent

(Thisiter, otheriter) =
iterator[(T, U)] {
{
(true true true
 (falsefalse sparkexception(+"Same

 number of elements in each partition ")
}
next (): (T, U) = (Thisiter.next (), Otheriter.nex T ())
}
}

Continue to see the specific implementation of Zippartitions:

Zippartitions[b:classtag, V:classtag]
(rdd2:rdd[b], Preservespartitioning:boolean)
(f: (Iterator[t], I TERATOR[B] = Iterator[v]): rdd[v] = withscope {
This, RDD2, preservespartitioning)

Private[Spark]classZippedpartitionsrdd2[a:classtag, B:classtag, V:classtag] (Sc:sparkcontext,varF: (Iterator[a], iterator[b]) = Iterator[v],varRdd1:rdd[a],varRdd2:rdd[b], Preservespartitioning:boolean =false)extendsZIPPEDPARTITIONSBASERDD[V] (SC, List (RDD1, RDD2), preservespartitioning) {//compute is the use of a zip-generated iterator that overrides its hasnext and next methods to return the data override DefCompute (S:partition, Context:taskcontext): iterator[v] = {Valpartitions = s.asinstanceof[zippedpartitionspartition].partitions F (rdd1.iterator (partitions (0), context), Rdd2.iterator (Partitions (1), context))}Override DefCleardependencies () {Super. cleardependencies () Rdd1 =NULLRDD2 =NULLf =NULL}
}

So how is the partitioning feature and data locality calculated? Need to view its base class Zippedpartitionsbaserdd

PrivateZippedpartitionsbaserdd[v:classtag] (
sc:sparkcontext,
Rdds:seq[rdd[_]],
false)
 onetoonedependency (x))) {//because the default preservespartitioning of Zip is 
   false, Then zip partitioner to none
partitioner =
None
  
getpartitions: Array[partition] = {
numparts = rdds.head.partitions.length

//zip the number of partitions of the left and right two
Rdd must be consistent (!rdds.forall (Rdd = Rdd.partitions.length = = numparts)) {
illegalargumentexception ("Can ' t zip RDDs with unequal numbers of partitions")
}
Ar Ray.tabulate[partition] (numparts) {i + =

//Get LOC
information for each partition Prefs = Rdds.map (Rdd = rdd.preferredlocations (Rdd.partitions (i)))
//Check whether there is any hosts that Match all RDDs; Otherwise return the Union

Find the intersection
 of loc exactmatchlocations = Prefs.reduce ((x, y) = X.intersect (y))

If the intersection is non-null, the intersection is LOC after zip, and if the intersection is empty, the non-coincident loc
prefs.flatten.distinct
 of both are taken Zippedpartitionspartition (i, Rdds, locs)
}
}
  
getpreferredlocations (s:partition): seq[string] = {
s.asinstanceof[zippedpartitionspartition].preferredlocations
}
  
cleardependencies ( ) {
Super. Cleardependencies ()
null
}

Its specific implementation process is as follows:

37.zipPartitions

Zippartition has several variants, listed below:

defZippartitions[b:classtag, C:classtag, V:classtag] (Rdd2:rdd[b], rdd3:rdd[c], Preservespartitioning:boolean) (f: (Iterator[t], iterator[b], iterator[c]) = Iterator[v]): rdd[v] = withscope {NewZippedPartitionsRDD3 (SC, Sc.clean (f), This, RDD2, RDD3, preservespartitioning)}defZippartitions[b:classtag, C:classtag, V:classtag] (Rdd2:rdd[b], rdd3:rdd[c]) (f: (Iterator[t], iterator[b], I TERATOR[C] = Iterator[v]): rdd[v] = withscope {zippartitions (RDD2, rdd3, preservespartitioning =false) (f)}defZippartitions[b:classtag, C:classtag, D:classtag, V:classtag] (Rdd2:rdd[b], rdd3:rdd[c], rdd4:rdd[d], preserves  Partitioning:boolean) (f: (Iterator[t], iterator[b], iterator[c], iterator[d]) (= Iterator[v]): rdd[v] = WithScope {NewZippedPartitionsRDD4 (SC, Sc.clean (f), This, RDD2, Rdd3, Rdd4, preservespartitioning)}defZippartitions[b:classtag, C:classtag, D:classtag, V:classtag] (Rdd2:rdd[b], rdd3:rdd[c], rdd4:rdd[d]) (f: ( Iterator[t], iterator[b], iterator[c], iterator[d]) = Iterator[v]): rdd[v] = withscope {zippartitions (RDD2, RDD3, R DD4, preservespartitioning =false) (f)}

You can support multiple Rdd zip, and you can customize iterators with the same internal implementation principle as zip.

38.zipWithIndex

This function combines the elements in the RDD with the ID (index number) of the element in the RDD bonding/value pairs.

Zipwithindex (): rdd[(T, Long)] = withscope {
zippedwithindexrdd (this)

The details of the Zippedwithindexrdd are as follows:

rdd[(T, Long)] (prev) {
  
/** the start index of each partition. *
/Startindices:array[long] = {
c12>n = prev.partitions.length
(n = = 0) {
Array[long] ()
(n = = 1) {
Array (0L)
{

 //calculates the starting offset for each partition first, for example, assuming that there are 4 partitions, each containing 3,4,3 elements, then startindices: [0,3,7]         Prev.context.runJob (        prev,          utils.getiteratorsize _,         0 until n-1,//Don't need to count th e last partition         allowlocal =  false       ). Scanleft (0L) (_ + _)    }  }    override Def  getpartitions:array[ Partition] = {

The configuration information for the assembly partition, that is, the corresponding starting offset for the partition is Startindices (x.index)
zippedwithindexrddpartition (x, Startindices ( x.index))
}
  
getpreferredlocations (split:partition): seq[string] =
firstparent[t]. Preferredlocations (Split.asinstanceof[zippedwithindexrddpartition].prev)
  
Compute (splitin: Partition, Context:taskcontext): iterator[(T, Long)] = {
split = splitin.asinstanceof[ Zippedwithindexrddpartition]

//Calculate the relative offset of each element within its own partition by Zipwithindex, and then overlay split.startindex to calculate the absolute offset
in the entire RDD    Firstparent[t].iterator (Split.prev, context). zipwithindex.map {x =
(x._1, Split.startindex + x._2)
}
}
}

For example:

scala> var rdd2 = Sc.makerdd (Seq ("A", "B", "R", "D", "F"), 2)

Rdd2:org.apache.spark.rdd.rdd[string] = parallelcollectionrdd[34] at Makerdd at:21

Scala> Rdd2.zipwithindex (). Collect

res27:array[(String, Long)] = Array ((a,0), (b,1), (r,2), (d,3), (f,4))

Its specific execution flow is as follows: 39.zipWithUniqueId

This function combines an bonding/value pair with an element in the RDD and a unique ID, and the unique ID generation algorithm is as follows:

The unique ID value for the first element in each partition is: The partition index number,

/**
* Zips this RDD with generated unique Long IDs. Items in the kth partition would get IDs K, n+k,
* 2*n+k, ..., where n is the number of partitions. So there could exist gaps, but the This method
* won ' t trigger a spark job, which was different from [[Org.apache.spark.rdd. Rdd#zipwithindex]].
*
* Note that some RDDs, such as those returned by GroupBy (), does not guarantee order of
* elements in a partitio N. The unique ID assigned to each element was therefore not guaranteed,
* and could even change if the RDD is reevaluate D. If a fixed ordering is required to guarantee
* The same index assignments, you should sort the RDD with Sortbykey ( ) or save it to a file.
*
  /Zipwithuniqueid (): rdd[(T, Long)] = Withscope
{this. Partitions.length.toLong

K for partition index
this(k, iter) =

I for each partition element in the index
 of its own partition (item, i) = =

//i*n+k: The base value is the index of the partition, increasing the number of partitions each time * The distance of each partition element in the index of its own partition
(Item, I * n + k)
}
}

Examples are as follows:

scala> var rdd1 = Sc.makerdd (Seq ("A", "B", "C", "D", "E", "F"), 2)

Rdd1:org.apache.spark.rdd.rdd[string] = parallelcollectionrdd[44] at Makerdd at:21

RDD1 has two partitions,

Scala> Rdd1.zipwithuniqueid (). Collect

res32:array[(String, Long)] = Array ((a,0), (b,2), (c,4), (d,1), (e,3), (f,5))

The execution process is as follows: 40.foreach

Handling with F for each element of the RDD

/**
* Applies a function f to all elements of the this RDD.
*
  /foreach (f:t = unit): unit = withscope {
cleanf = Sc.clean (f)
Sc.runjob (this
      , (iter:iterator[t]) = Iter.foreach (cleanf))
}

Note that this is an action that will trigger the execution of the F function

Second, note that if you perform foreach on the RDD, it will only work on the executor side, not the driver side.

For example: Rdd.foreach (println), will only print in executor stdout, driver end is not visible.

By combining accumulator shared variables with foreach, you can count the values inside the RDD

scala> var cnt = sc.accumulator (0)

Cnt:org.apache.spark.accumulator[int] = 0

scala> var rdd1 = Sc.makerdd (1 to 10,2)

Rdd1:org.apache.spark.rdd.rdd[int] = parallelcollectionrdd[5] at Makerdd at:21

Scala> Rdd1.foreach (x = cnt + x)

Scala> Cnt.value

Res51:int = 55

41.foreachPartition

Foreachpartition and foreach, for each partition, they differ similar to the map and mappartitions operations

/*
* Applies a function f to each partition of the this RDD.
*/
def foreachpartition (f:iterator[t] = unit): unit = withscope {
val cleanf = Sc.clean (f)
Sc.runjob (This, (iter:iterator[t]) = CLEANF (ITER))
}

42.subtract

Erase the records that exist in this rdd from the other Rdd and return the remaining records in this RDD

/
* * Return an RDD with the elements from ' this ' is not in ' other '.
*
* Uses ' this ' partitioner/partition size, because even if ' other ' is huge, the resulting
* RDD would be &lt ; = us.
*
  /Subtract (other:rdd[t]): rdd[t] = withscope {

if you want to subtract, then you must know the distribution of this rdd, that is, its partition function
Subtract (Other, Partitioner.getorelse (hashpartitioner (partitions.length)))
}

/** * Return an RDD with the elements from ' this ' is not in
' other '.
*
  /Subtract (
other:rdd[t],
p:partitioner)(null): rdd[t] = withscope {
      
       (partitioner = = Some (p)) {

//result the RDD partition function is the same as this rdd, then regenerate a P2, because this P2 does not define the equals function, it means that any comparison is actually a comparison class address, which will cause the next two rdd will have shuffle action, As for why this design, not how to understandOur partitioner knows how to handle T (which, since we had a partitioner, is//really (K, V)) so make a new parti Tioner that would de-tuple our fake tuplesValP2 =NewPartitioner () {Override DefNumpartitions:int = P.numpartitionsOverride DefGetpartition (k:any): Int = P.getpartition (k.asinstanceof[(Any, _)]._1)}//Unfortunately, since we ' re making a N
    EW P2, we'll get shuffledependencies//anyway, and when calling. Keys, would not have a partitioner set, even though The Subtractedrdd would, thanks to P2 's de-tupled partitioning, already be//partitioned by the Right/real keys (e.g. p). This. map (x = (x,NULL). Subtractbykey (Other.map (_,NULL), p2). Keys}Else{

//If not equal, the default hash partition this is
null), p). Keys
}

/** Return an RDD with the pairs from ' this ' whose keys is not in ' other '. *
  /Subtractbykey[w:classtag] (other:rdd[(k, W)], P:partitioner): rdd[(k, V)] = self.withscope {
 /c4>subtractedrdd[k, V, W] (self, other, p)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

wordpress the field implementation of ajax in php square root of eight what is parquet in spark partial implementation of interface in java use of operator in php use of operator in c

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The implementation process of the spark operator is detailed in eight

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support