"Spark" Rdd operation detailed 4--action operator

Source: Internet
Author: User

The execution of the Rdd dag is triggered by essentially executing the runjob operation of the submit job through Sparkcontext in the actions operator.
The action operator is categorized according to the output space of the action operator: no output, HDFS, Scala collection, and data type.

No output foreach

The F function operation is applied to each element in the RDD, instead of the Rdd and array, and the uint is returned.

In the diagram, the foreach operator operates on each data item through a user-defined function. In this example, the custom function is println, and the console prints all the data items.

Source:

  /**   * Applies a function f to all elements of this RDD.   */  def foreach(f: T => Unit) {    val cleanF = sc.clean(f)    sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))  }
HDFS (1) saveastextfile

The function stores the data output to the specified directory in HDFs. Convert each element mapping in the RDD to (null,x.tostring) and then write it to HDFs.

In the diagram, the box on the left represents the Rdd partition, and the right square represents the block of HDFs. Each partition of the RDD is stored as a block in HDFs through a function.

Source:

  /** * Save This RDD as a text file, using string representations of elements. */  defSaveastextfile (path:string) {//https://issues.apache.org/jira/browse/SPARK-2075    //    //Nullwritable is a ' comparable ' on Hadoop 1.+, so the compiler cannot find a implicit    //ordering for it and would use the default ' null '. However, it ' s a ' comparable[nullwritable] '    //In Hadoop 2.+, so the compiler would call the implicit ' ordering.ordered ' method to create an    //ordering for ' nullwritable '. that's why the compiler would generate different anonymous    //classes for ' Saveastextfile ' in Hadoop 1.+ and Hadoop 2.+.    //    //Therefore, here we provide a explicit ordering ' null ' to make sure the compiler generate    //Same bytecodes for ' saveastextfile '.    ValNullwritableclasstag = implicitly[classtag[nullwritable]]ValTextclasstag = Implicitly[classtag[text]]ValR = This. mappartitions {iter =ValText =NewText () iter.map {x = Text.set (x.tostring) (Nullwritable.get (), Text)}} RDD.RDDTOPAIRR Ddfunctions (R) (Nullwritableclasstag, Textclasstag,NULL). Saveashadoopfile[textoutputformat[nullwritable, Text]] (path)}/** * Save This RDD as a compressed text file, using string representations of elements. */  defSaveastextfile (path:string, codec:class[_ <: Compressioncodec]) {//https://issues.apache.org/jira/browse/SPARK-2075    ValNullwritableclasstag = implicitly[classtag[nullwritable]]ValTextclasstag = Implicitly[classtag[text]]ValR = This. mappartitions {iter =ValText =NewText () iter.map {x = Text.set (x.tostring) (Nullwritable.get (), Text)}} RDD.RDDTOPAIRR Ddfunctions (R) (Nullwritableclasstag, Textclasstag,NULL). Saveashadoopfile[textoutputformat[nullwritable, Text]] (path, codec)}
(2) Saveasobjectfile

Saveasobjectfile each of the 10 elements in the partition into an array, and then serializes the array to a (null,byteswritable (Y)) element and writes the format of HDFs to Sequencefile.

In the diagram, the left square represents the Rdd partition, and the right square represents the block of HDFs. Each partition of the RDD is stored as a block on the HDFs through a function.

Source:

  /**   * Save this RDD as a SequenceFile of serialized objects.   */  def saveAsObjectFile(path: String) {    this.mapPartitions(iter => iter.grouped(10).map(_.toArray))      new BytesWritable(Utils.serialize(x))))      .saveAsSequenceFile(path)  }
Scala collections and data types (1) Collect

Collect equivalent to Toarray,toarray is deprecated, collect returns the distributed RDD as a single stand-alone Scala array. Use Scala's functional operation on this array.

In the diagram, the left square represents the Rdd partition, and the right square represents an array in the stand-alone memory. The result is returned to the node where the driver program is located, stored as an array, through a function operation.

Source:

  /**   * Return an array that contains all of the elements in this RDD.   */  def collect(): Array[T] = {    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)    Array.concat(results: _*)  }
(2) Collectasmap

Collectasmap to (k,v) type RDD data returns a single stand-alone hashmap. For the Rdd element that repeats K, the element that follows overrides the preceding element.

In the diagram, the left square represents the Rdd partition, and the right square represents a stand-alone array. The data is returned to the driver program by the COLLECTASMAP function, and the results are stored in hashmap form.

Source:

  /**   * Return the key-value pairs in this RDD to the master as a Map.   *   * Warning: this doesn‘t return a multimap (so if you have multiple values to the same key, only   *          one value per key is preserved in the map returned)   */  def collectAsMap(): Map[K, V] = {    val data = self.collect()    valnew mutable.HashMap[K, V]    map.sizeHint(data.length)    data.foreach { pair => map.put(pair._1, pair._2) }    map  }
(3) reducebykeylocally

The realization is the first reduce again Collectasmap function, first the whole of the RDD to reduce operations, and then collect all the results returned as a hashmap.

Source:

  /** * Merge The values for each key using a associative reduce function, but return the results * immediately to The master as a Map. This would also perform the merging locally on each mapper * before sending results to a reducer, similarly to a "combine   R "in MapReduce. */  defReducebykeylocally (func: (V, v) = v): map[k, V] = {if(Keyclass.isarray) {Throw NewSparkexception ("reducebykeylocally () does not support array keys")    }ValReducepartition = (iter:iterator[(K, V)]) = = {ValMap =NewJhashmap[k, V] Iter.foreach {pair = =ValOld = Map.get (pair._1) map.put (Pair._1,if(Old = =NULL) pair._2ElseFunc (old, pair._2))} Iterator (map)}: Iterator[jhashmap[k, V]]ValMergemaps = (m1:jhashmap[k, v], M2:jhashmap[k, v]) = = {M2.foreach {pair = =ValOld = M1.get (pair._1) m1.put (Pair._1,if(Old = =NULL) pair._2ElseFunc (old, Pair._2)} M1}: Jhashmap[k, V] self.mappartitions (reducepartition). Reduce (Mergemaps)}
(4) Lookup

The lookup function pairs the (key,value) Rdd operation and returns the SEQ formed by the element corresponding to the specified Key. Part of this function processing optimization is that if the RDD contains a partition, it will only correspond to the partition where K is located, and then return the SEQ formed by (K,V). If the RDD does not contain a partitioner, a brute-force scan of the full RDD element is required to search for the element that corresponds to the specified K.

In the diagram, the left square represents the RDD partition, the right square represents SEQ, and the final result is returned to the application of the node where driver is located.

Source:

  /** * Return the list of values in the RDD for key ' key '. This operation was done efficiently if the * RDD have a known partitioner by only searching the partition that the key map   S to. */  defLookup (KEY:K): seq[v] = {Self.partitionerMatch{ CaseSome (P) =Valindex = p.getpartition (key)ValProcess = (it:iterator[(K, V)]) = = {ValBUF =NewARRAYBUFFER[V] for(Pair <-ItifPair._1 = = key) {buf + = pair._2} buf}: Seq[v]Valres = Self.context.runJob (self, process, Array (index),false) Res (0) CaseNone = Self.filter (_._1 = = key). Map (_._2). Collect ()}}
(5) Count

Count returns the number of elements for the entire RDD.

Figure, the number of returned data is 5. A block represents an RDD partition.

Source:

  /**   * Return the number of elements in the RDD.   */  def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
(6) Top

Top can return the largest k elements.
Approximate function Description:

  • Top returns the largest element of K.
  • Take returns the smallest element of K.
  • takeordered returns the smallest element of K, and preserves the order of the elements in the returned array.
  • First equals top (1) returns the top K elements in the entire RDD, and can define how the sort is ordering[t]. Returns an array containing the first k elements.

Source:

  /**   * Returns the top k (largest) elements from this RDD as defined by the specified   * implicit Ordering[T]. This does the opposite of [[takeOrdered]]. For example:   * {{{   *   sc.parallelize(Seq(10, 4, 2, 12, 3)).top(1)   *   // returns Array(12)   *   *   sc.parallelize(Seq(2, 3, 4, 5, 6)).top(2)   *   // returns Array(6, 5)   * }}}   *   * @param num k, the number of top elements to return   * @param ord the implicit ordering for T   * @return an array of top elements   */  def top(num: Int)(implicit ord: Ordering[T]): Array[T] = takeOrdered(num)(ord.reverse)
(7) Reduce

The reduce function acts as an reduceleft function for the elements in the RDD.
Reduceleft first to two elements

  /** * Reduces the elements of this RDD using the specified commutative and * associative binary operator. */  defReduce (f: (T, T) and T): T = {ValCLEANF = Sc.clean (f)ValREDUCEPARTITION:ITERATOR[T] = option[t] = iterif(Iter.hasnext) {Some (Iter.reduceleft (cleanf))}Else{None}}varJobresult:option[t] = NoneValMergeresult = (Index:int, taskresult:option[t]) = = {if(taskresult.isdefined) {Jobresult = JobresultMatch{ CaseSome (value) = Some (f (value, Taskresult.get)) CaseNone = = Taskresult}}} sc.runjob ( This, Reducepartition, Mergeresult)//Get The final result out of our Option, or throw a exception if the RDD was emptyJobresult.getorelse (Throw NewUnsupportedoperationexception ("Empty Collection"))  }
(8) Fold

Fold and reduce have the same principle, but unlike reduce, the first element taken by an iterator is zerovalue when it is equivalent to each reduce.

Figure, a fold operation is performed by a user-defined function, and a box in the diagram represents an RDD partition.

Source:

  /** * Aggregate The elements of each partition, and then the results for all the partitions, using a * given Assoc iative function and a neutral "zero value". The function op (t1, T2) is allowed to * Modify T1 and return it as its result value to avoid object allocation;   However, it should not * modify T2. */  defFold (zerovalue:t) (OP: (T, T) + t): T = {//Clone The zero value since we'll also is serializing it as part of the tasks    varJobresult = Utils.clone (Zerovalue, Sc.env.closureSerializer.newInstance ())ValCleanop = Sc.clean (OP)ValFoldpartition = (iter:iterator[t]) = Iter.fold (Zerovalue) (CLEANOP)ValMergeresult = (Index:int, taskresult:t) = Jobresult = OP (jobresult, Taskresult) sc.runjob ( This, Foldpartition, Mergeresult) Jobresult}
(9) Aggregate

Aggregate aggregate operations on all elements of each partition before fold the results of the partition.
the difference between aggreagate and fold and reduce is that aggregate is equivalent to aggregating data in a way that is parallelized. During the operation of the fold and reduce functions, serial processing is required for each partition, and each partition is serially computed, and the results are aggregated in the previous way and the results of the final aggregation are returned.

In the diagram, the RDD is aggregate by a user-defined function, and each box in the diagram represents an RDD partition.

Source:

  /** * Aggregate The elements of each partition, and then the results for all the partitions, using * given combine Functions and a neutral "zero value". This function can return a different result * type, U, than the type of this RDD, T. Thus, we need one operation for merging a T to a u * and one operation for merging-one, as in Scala. Traversableonce. Both of these functions is * allowed to modify and return their first argument instead of creating a new U to avoid mem   Ory * allocation. */  defAggregate[u:classtag] (zerovalue:u) (Seqop: (U, T) = = u, Combop: (u, u) = u): U = {//Clone The zero value since we'll also is serializing it as part of the tasks    varJobresult = Utils.clone (Zerovalue, Sc.env.closureSerializer.newInstance ())ValCleanseqop = Sc.clean (SEQOP)ValCleancombop = Sc.clean (Combop)ValAggregatepartition = (it:iterator[t]) = It.aggregate (Zerovalue) (Cleanseqop, Cleancombop)ValMergeresult = (Index:int, taskresult:u) = Jobresult = Combop (Jobresult, Taskresult) sc.runjob ( This, Aggregatepartition, Mergeresult) Jobresult}

reprint Please indicate the author Jason Ding and its provenance
Gitcafe Blog Home page (http://jasonding1354.gitcafe.io/)
GitHub Blog Home page (http://jasonding1354.github.io/)
CSDN Blog (http://blog.csdn.net/jasonding1354)
Jane Book homepage (http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)
Google search jasonding1354 go to my blog homepage

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

"Spark" Rdd operation detailed 4--action operator

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.