Spark Growth Path (3)-Talk about the transformations of the RDD

Source: Internet
Author: User
Tags abstract hash iterable join ord require shuffle

Reference articles
Coalesce () method and repartition () method
Transformations

Repartitionandsortwithinpartitions explanation return source coalesce and repartition explanation return source pipe explanation return source Cartesian explanation return source code cogroup explanation source Code J Oin explanation return Source code Sortbykey interpretation return source code Aggregatebykey interpretation return Source Reducebykey interpretation return Source Groupbykey interpretation return source distinct explanation return to source code Interse Ction explanation return Source Union interpretation return source code sample explanation return source map explanation return source code mappartitions explanation return source code Mappartitionswithindex return source FlatMap explanation Return Source Filter explanation return source Core function Combinebykeywithclasstag

When I first wrote Spark, swallowed knew a little bit of transformations, see the RDD operation for details.
Use your free time today to Xu Yisu these RDD conversion operations and deepen your understanding. repartitionandsortwithinpartitions explain

It literally means that the data in the partition is sorted as well when the partition is reassigned. The parameter is the partitioner (I'll talk about the partition system in the next section). The official document says the method is more efficient than repartition because he has been sequenced before entering the shuffle machine. return

Shuffledrdd Source

Orderedrddfunctions.scala

def repartitionandsortwithinpartitions (Partitioner:partitioner): rdd[(K, V)] = self.withscope {
    new ShuffledRDD[K , V, v] (self, partitioner). setkeyordering (ordering)
  }

The code logic is relatively simple, which is to create a shuffledrdd, and set the key sequencer. Coalesce and Repartition explain

Why put these two together, because the source code shows that repartition is actually called the coalesce, just pass the parameter is true.
That's easy, we just have to understand the coalesce method. The function of this method is to reset the number of partitions, and the second parameter is to set the shuffle operation when repartitioning. return

Coalescedrdd Source

def repartition (Numpartitions:int) (implicit ord:ordering[t] = null): rdd[t] = withscope {COALESCE (numpartitions, SH Uffle = True)} def coalesce (Numpartitions:int, Shuffle:boolean = False, partitioncoalescer:option[ Partitioncoalescer] = option.empty) (implicit ord:ordering[t] = null): rdd[t] = withscope {requ
    IRE (Numpartitions > 0, S "Number of partitions ($numPartitions) must be positive.")
      if (shuffle) {/** distributes elements evenly across output partitions, starting from a random partition. */ Val distributepartition = (index:int, items:iterator[t]) = = {var position = (new Random (Index)). Nextint (Nump Artitions) Items.map {t =//Note that the hash code of the key would just be the key itself.
          The Hashpartitioner/'ll mod it with the number of total partitions.

    Position = position + 1 (position, T)}}: Iterator[(Int, T)]  Include a shuffle step so, our upstream tasks is still distributed new Coalescedrdd (new Shuffledr Dd[int, T, T] (Mappartitionswithindex (distributepartition), New Hashpartitioner (numpartitions)), NumPartiti 
  ONS, partitioncoalescer). Values} else {new Coalescedrdd (this, numpartitions, Partitioncoalescer)} }
Pipe explain

Simply, execute commands, get command output, convert to rdd[string], and use this feature to execute scripting languages such as Php,python across languages to call each other with Scala. return

Pipedrdd Source

 /** * Return an RDD created by piping elements to a forked external process. */def pipe (command:string): rdd[string] = withscope {//Similar to Runtime.exec (), if we is given a single strin

  G, split it into words//using a-stringtokenizer (i.e. by spaces) pipe (pipedrdd.tokenize (command))}
   /** * Return an RDD created by piping elements to a forked external process. */def pipe (command:string, env:map[string, String]): rdd[string] = withscope {//Similar to Runtime.exec (), if W E is given a single string, split it into the words//using a standard StringTokenizer (i.e. by spaces) pipe (PIPEDRD D.tokenize (command), ENV)} def pipe (command:seq[string], env:map[string, String] = Map (), Prin
      Tpipecontext: (string = unit) = unit = NULL, Printrddelement: (T, String = unit) = = NULL, Separateworkingdir:boolean = False, Buffersize:int = 8192, encoding:string = Codec. Defaultcharsetcodec.name): rdd[string] = withscope {New Pipedrdd (this, command, env, if (Printpipecontext ne n ull) Sc.clean (printpipecontext) Else null, if (printrddelement ne null) sc.clean (printrddelement) else null, s Eparateworkingdir, buffersize, encoding)}
Cartesian explain

Cartesian product calculations with data from another RDD. But in general this kind of scene is seldom seen, I have a stroke. return

Cartesianrdd Source

def Cartesian[u:classtag] (Other:rdd[u]): rdd[(T, U)] = withscope {
    new Cartesianrdd (SC, this, other)
  }
Cogroup explain

For the same pair type of RDD, the same k of different value, the combination of multiple tuple generation, how many different value, is a few tuples.

Similar to (a,1), (a,2), (a,3), after Cogroup operation, get (A, ()) source code

There are 9 ways to Cogroup, and I've just listed one of the following methods:

def COGROUP[W1, W2] (other1:rdd[(k, W1)], other2:rdd[(k, W2)], Partitioner:partitioner)
      : rdd[(k, (ITERABLE[V), Iter ABLE[W1], iterable[w2])] = self.withscope {
    if (Partitioner.isinstanceof[hashpartitioner] && Keyclass.isarray) {
      throw new sparkexception ("Hashpartitioner cannot partition array keys.")
    }
    Val CG = new Cogroupedrdd[k] (Seq (self, other1, other2), partitioner)
    cg.mapvalues {case Array (VS, W1s, W2s) =>
  
    (Vs.asinstanceof[iterable[v]],
        W1S.ASINSTANCEOF[ITERABLE[W1]],
        w2s.asinstanceof[iterable[w2])
    }
  }
  
Join explain

Similar to inline statements in MySQL. return

Cogroupedrdd Source

Now that we're talking about the inline relationship with MySQL, the join is naturally in the same joint, left and right, out of the way. So the method of join in the source code is shown in the following figure:

def Join[w] (other:rdd[(k, W)], Partitioner:partitioner): rdd[(k, (V, w))] = self.withscope {
    This.cogroup (other, par Titioner). flatmapvalues (pair =
      for (v <-pair._1.iterator; w <-pair._2.iterator) Yield (V, W)
    )
  }

Learned from the source, the call is actually the Cogroup method. Sortbykey explain

For the (K,V) format of the Rdd, sort by K, the parameters are set in reverse or positive order. return

Shuffledrdd Source

In Orderedrddfunctions

def sortbykey (Ascending:boolean = true, Numpartitions:int = self.partitions.length)
      : rdd[(K, V)] = self.withscope
  {
    val part = new Rangepartitioner (numpartitions, Self, ascending)
    new shuffledrdd[k, V, v] (self, part)
      . Setkeyordering (if (ascending) ordering else ordering.reverse)
  }
Aggregatebykey explain

Press key to aggregate the operation. return

Shuffledrdd Source

def Aggregatebykey[u:classtag] (Zerovalue:u, Partitioner:partitioner) (Seqop: (U, V) = u,
      combop: (u, u) = = u) : rdd[(K, U)] = self.withscope {
    //Serialize The zero value to a byte array so this we can get a new clone of it on E Ach key
    val zerobuffer = SparkEnv.get.serializer.newInstance (). Serialize (Zerovalue)
    val zeroarray = new array[ Byte] (zerobuffer.limit)
    zerobuffer.get (Zeroarray)

    lazy val Cachedserializer = SparkEnv.get.serializer.newInstance ()
    val Createzero = () = () = Cachedserializer.deserialize[u] (bytebuffer.wrap (Zeroarray))

    We'll clean the combiner closure later in ' Combinebykey '
    val cleanedseqop = Self.context.clean (seqop)
    Combi Nebykeywithclasstag[u] ((v:v) = Cleanedseqop (Createzero (), V),
      Cleanedseqop, Combop, partitioner)
  }
Reducebykey explain

Aggregates with key, value values are merged, and the specific merge function is provided with the first parameter. return

Shuffledrdd Source

def reducebykey (Partitioner:partitioner, func: (V, v) = v): rdd[(K, v)] = self.withscope {
    combinebykeywithclasst Ag[v] ((v:v) = V, Func, func, Partitioner)
  }
Groupbykey explain

(k,v) Type RDD operation, grouping data with key, repartitioning. return

Shuffledrdd Source

def groupbykey (Partitioner:partitioner): rdd[(K, iterable[v])] = self.withscope {
    //Groupbykey shouldn ' t use map sid E combine because map side combine does not
    //reduce the amount of data shuffled and requires all map side data is in serted
    //into a hash table, leading to more objects in the old Gen.
    Val Createcombiner = (v:v) = Compactbuffer (v)
    val mergevalue = (Buf:compactbuffer[v], v:v) = = buf + = v
    V Al Mergecombiners = (C1:compactbuffer[v], c2:compactbuffer[v]) = C1 ++= C2
    val bufs = combinebykeywithclasstag[c OMPACTBUFFER[V]] (
      createcombiner, Mergevalue, Mergecombiners, partitioner, Mapsidecombine = False)
    bufs.asinstanceof[rdd[(K, Iterable[v])]
  }
distinct explain

Go to redo Operation return

With Father Rdd consistent source

def distinct (numpartitions:int) (implicit ord:ordering[t] = null): rdd[t] = withscope {
    map (x = (x, NULL)). reduce Bykey ((x, y) = x, numpartitions). Map (_._1)
  }
intersection explain

Returns the intersection of two Rdd and carries out a redo operation to return

Parent Rdd Consistent Source

  def intersection (Other:rdd[t]): rdd[t] = withscope {This.map (v = = (V, null)). Cogroup (Other.map (v = = (V, null
        )). Filter {Case (_, (Leftgroup, rightgroup)) = = Leftgroup.nonempty && Rightgroup.nonempty} . Keys}/** * Return the intersection of this RDD and another one.
   The output won't contain any duplicate * elements, even if the-input RDDs did.
   * * @note This method performs a shuffle internally. * * @param partitioner partitioner to use for the resulting RDD */def intersection (Other:rdd[t], PA Rtitioner:partitioner) (implicit ord:ordering[t] = null): rdd[t] = withscope {This.map (v = = (V, null)). Cogroup (OT Her.map (v = (V, null)), partitioner). Filter {Case (_, (Leftgroup, rightgroup)) = = Leftgroup.nonempty &amp ;& rightgroup.nonempty}. Keys}/** * Return the intersection of this RDD and another one. The output won't contain any duplicate * elements, Even if the input RDDs did.
   Performs a hash partition across the cluster * * @note This method performs a shuffle internally. * * @param numpartitions How many partitions to use in the resulting RDD */def intersection (Other:rdd[t], Numpar Titions:int): rdd[t] = withscope {intersection (Other, new Hashpartitioner (numpartitions))}
Union explain

Merge does not go back

Unionrdd/partitionerawareunionrdd Source

def Union[t:classtag] (rdds:seq[rdd[t]): rdd[t] = withscope {
    val partitioners = Rdds.flatmap (_.partitioner). ToSet
    if (Rdds.forall (_.partitioner.isdefined) && partitioners.size = = 1) {
      new Partitionerawareunionrdd ( This, Rdds)
    } else {
      new Unionrdd (this, Rdds)
    }
  }
Sample explain

Sample return

Parent Rdd Source code

Def sample (
      Withreplacement:boolean,
      fraction:double,
      seed:long = Utils.random.nextLong): rdd[t] = {
    require (fraction >= 0,
      S "Fraction must is nonnegative, but got ${fraction}")

    Withscope {
      require (  Fraction >= 0.0, "negative fraction value:" + fraction)
      if (withreplacement) {
        new partitionwisesampledrdd[t, T] (this, new poissonsampler[t] (fraction), true, Seed)
      } else {
        new partitionwisesampledrdd[t, T] (this, new Bern Oullisampler[t] (fraction), true, Seed)}}
  
Map explain

The simplest transformations method, in which each parent RDD functions the incoming function, one by one, corresponds to the number of the parent RDD and the child-like RDD. return

Mappartitionsrdd Source

def Map[u:classtag] (f:t = U): rdd[u] = withscope {
    val cleanf = Sc.clean (f)
    new Mappartitionsrdd[u, T] (this, (context, PID, iter) = Iter.map (cleanf))
  }
mappartitions explain

Map operations are performed within the partition. return

Mappartitionsrdd Source

def Mappartitions[u:classtag] (
      f:iterator[t] = Iterator[u],
      Preservespartitioning:boolean = false): RDD[ U] = withscope {
    val cleanedf = Sc.clean (f)
    new Mappartitionsrdd (This
      ,
      (Context:taskcontext, index : Int, iter:iterator[t]) = CLEANEDF (iter),
      preservespartitioning)
  }
Mappartitionswithindex

More than mappartitions. A partition index value is available for use. return

Mappartitionsrdd Source

def Mappartitionswithindex[u:classtag] (
      F: (Int, iterator[t]) = Iterator[u],
      preservespartitioning: Boolean = False): Rdd[u] = withscope {
    val cleanedf = Sc.clean (f)
    new Mappartitionsrdd (This
      ,
      (context:t Askcontext, Index:int, iter:iterator[t]) = CLEANEDF (index, ITER),
      preservespartitioning)
  }
FlatMap explain

The element is transformed into multiple elements, then tiled, with the passed in function. return

Mappartitionsrdd Source

def Flatmap[u:classtag] (f:t = Traversableonce[u]): rdd[u] = withscope {
    val cleanf = Sc.clean (f)
    new Mappa Rtitionsrdd[u, T] (this, (context, PID, iter) = Iter.flatmap (cleanf))
  }
Filter explain

Filters the parent Rdd with the filter condition to satisfy the conditional incoming subclass Rdd. return

Mappartitionsrdd Source

def filter (f:t = Boolean): rdd[t] = withscope {
    val cleanf = Sc.clean (f)
    new Mappartitionsrdd[t, T] (
      this ,
      (context, PID, iter) = Iter.filter (cleanf),
      preservespartitioning = True)
  }
core function Combinebykeywithclasstag

In the explanation of Groupbykey,aggregatebykey,reducebykey and other operations (K,V) in the form of RDD, the source code is used Combinebykeywithclasstag method, it is necessary to understand the method.

Reference article: Combinebykey
Combinebykey

def combinebykeywithclasstag[c] (createcombiner:v = C, Mergevalue: (c, V) and C, Mergecombiners: (c, c) = = C, Partitioner:partitioner, Mapsidecombine:boolean = true, Serializer:serializer = null) (implicit ct:classtag[c]): rdd[(K, C)] = self.withscope {require (mergecombiners! = NULL, "Mergecombiners must be def ined ")//required as of Spark 0.9.0 if (keyclass.isarray) {if (mapsidecombine) {throw new sparkexcept
      Ion ("Cannot use map-side combining with array keys.") } if (Partitioner.isinstanceof[hashpartitioner]) {throw new Sparkexception ("Hashpartitioner cannot partitio
    n Array Keys ")}}
      Val aggregator = new Aggregator[k, V, C] (Self.context.clean (Createcombiner), Self.context.clean (Mergevalue), Self.context.clean (Mergecombiners)) if (Self.partitioner = = Some (partitioner)) {self.mappartitions (iter =
     > {val context = Taskcontext.get ()   New Interruptibleiterator (Context, Aggregator.combinevaluesbykey (ITER, Context))}, preservespartitioning = True) } else {new shuffledrdd[k, V, C] (self, partitioner). Setserializer (Serializer). Setaggregator (
 Aggregator). Setmapsidecombine (Mapsidecombine)}}

The core is three functions Createcombiner: initializes the first value. Mergevalue: Processes the remaining values with the first value and iterates over them. Mergecombiners: Use this function to merge if the data is in a different partition.

This function converts rdd[(K,V)] to the format of rdd[(K,C)], V is the value of the parent Rdd, and K is the parent Rdd key, and the operation we are going to do is based on K, converting V to C,c can be understood as any type and also includes K type.

are based on the key classification of the operation, the different keys are not recognized, the following explanations are classified by key, the various groups of processing methods

The first function Createcombiner, the abstract defines the C format, his definition is v=>c, the input is V, the return is C, this is an initialization function, the RDD in the partition of the first data of the V value to this function, into C. The second function Mergevalue, abstract form (c,v) =>c, in fact, is the use of the initialization of the C, with the RDD other data merge operation, finally get a C. The third function mergecombiners, only the data is scattered across different partitions

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.