RDD API Finishing

Source: Internet
Author: User

Rdd[t]transformations
RDD API Notes
Persist/cache
Map (f:t = U)
Keyby (f:t = K) Special map, mentioning key
FlatMap (f:t = Iterable[u]) A kind of map, similar to UDTF
Filter (f:t = Boolean) A kind of map
Distinct (numpartitions) The implementation of the RDD map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1) is a special combinebykey for Reducebykey, whose mergevalue function is consistent with the Mergecombiners function, both(x, y) => x
Repartition (numpartitions)/coalesce (numpartitions) Repartition is used to increase or decrease the RDD partition. Coalesce refers specifically to reducing partitions, which can be avoided by a narrow-dependency mapping shuffle
Sample ()/randomsplit ()/takesample () Sampling
Union (Rdd[t]) Do not go heavy. Use DISTINCT () to remove weight
Sortby[k] (f: (T) = K) The incoming F is the key function, the implementation of the RDD for keyBy(f).sortByKey().values() this operation for the RDD set aRangePartitioner
Intersection (Rdd[t]) Two sets take the intersection and go to the weight. The implementation of the Rdd map(v => (v, null)).cogroup(other.map(v => (v, null))).filter(两边都空).keys() is a K, List[V], List[V] form of cogroup, which may contain a shuffle operation, aligned to the partitions on both sides of the RDD.
Glom (): Rdd[array[t]] Merge data from each partition into an array. Each partition was originally an iterator to T.
Cartesian (Rdd[u]): rdd[(T, U)] A Cartesian product of two sets. The RDD practice is two rdd inner loop, outer loop yield out each pair (x, y)
Groupby[k] (f:t = k): rdd[(k, iterable[t]) Rdd suggests that if followed by Agg, the aggregateByKey reduceByKey two operations are essentially combinebykey if they are used directly or are more time-saving.
Pipe (command:string) The RDD data ProcessBuilder is created by creating additional process outputs.
Mappartitions (F:iterator[t] = Iterator[u])/mappartitionswithindex (f: (Int, iterator[t]) = [Iterator[u]) Each partition of the RDD does a map transform
Zip (Rdd[u]): rdd[(T, U)] Two RDD partitions are the same number, and each partition has a consistent number of data bars
Actions
RDD API Notes
foreach (f:t = Unit) The RDD implementation is called Sc.runjob (), and F acts on each record in each partition
Foreachpartition (f:iterator[t] = Unit) The RDD implementation is called Sc.runjob (), and F acts on each partition
Collect (): Array[t] The RDD is implemented to call Sc.runjob (), get results, combine multiple result arrays into an array
Tolocaliterator () To return all data in an iterator, the RDD implementation is called Sc.runjob (), each partition iterator goes to an array, and the driver end is collected and then flatmap again into a large iterator. Understood as a rather special driver-end cache
Collect[u] (f:partailfunction[t, U]): Rdd[u] The RDD is implemented as a filter(f.isDefinedAt).map(f) filter to find the satisfied data, and then a map operation executes the partial function.
Subtract (Rdd[t]) The RDD is implemented to be similar to intersection map(x => (x, null)).subtractByKey(other.map((_, null)), p2).keys
Reduce (f: (t, t) = t) The RDD is implemented to call Sc.runjob (), so that F is computed once for each partition of the RDD, and then once again when the merge is aggregated at the end.
Treereduce (f: (t, t) = = t, depth = 2) See Treeaggregate
Fold (zerovalue:t) (OP: (t, t) = t) Special reduce, with initial value, fold of functional semantics
Aggregate (Zerovalue:u) (Seqop: (U, T) = u, Combop: (u, u) = = u) Aggregation method with initial value, reduce aggregation, merge aggregation three complete conditions. The RDD practice is to pass the function into the partition to do the calculation, and finally summarize the results of each partition once again Combop calculation.
Treeaggregate (Zerovalue:u) (Seqop: (U, T) = u, Combop: (u, u) = = u) (depth = 2) At the partition, do two and more merge aggregations, that is, the merge calculation for each partition may also be shuffle. The remainder is the same as aggregate. Understood as more complex multi-order aggregate
Count () The RDD is implemented to call Sc.runjob () and sum the size of each partition at the driver end again
Countapprox (timeout, confidence) Commit individual Dagscheduler Special task, generate special task Listener, return in timeout time, return an approximate result, return value of calculation logic visible Approximateevaluator subclass
Countbyvalue (): Map[t, Long] The RDD implementation map(value => (value, null)).countByKey() is essentially a simple combinebykey that returns a map that will load into driver's memory and require a smaller dataset size
Countbyvalueapprox () With Countapprox ()
Countapproxdistinct () Experimental method, using the Streamlib library to achieve the Hyperloglog do
Zipwithindex (): rdd[(T, long)]/zipwithuniqueid (): rdd[(T, long)] Do a zip operation with the generated index
Take (num): array[t] Sweep a partition
First () Ready to take (1)
Top (n) (ordering) Each partition passes in the top handler function, gets the partitioned heap, uses rdd.reduce (), takes each partition's heap together, sorts, takes the first n
Max ()/min () Special reduce, incoming max/min comparison function
Saveasxxxxx Output storage Media
Checkpoint Show CP statement
Special Rddpairrddfunctions
RDD API Notes
Combinebykey[c] (createcombiner:v = C, Mergevalue: (c, V) and C, Mergecombiners: (c, c) = c): rdd[(K, c)] Traditional Mr Definition split, important base API
Aggregatebykey[u] (Zerovalue:u, Seqop: (U, V) = = u, Combop: (u, u) = u): rdd[(K, u)] In the RDD, turn the Zerovalue into a Createcombiner method and call the Combinebykey (). Essentially the two are the same.
Foldbykey (Zerovalue:v, func: (V, v) = v): rdd[(K, v)] Func is used as a mergevalue, and as a mergecombiners, called the Combinebykey ()
Samplebykey () Generate a key-related samplefunc, call Rdd.mappartitionswithindex (Samplefunc)
Reducebykey () Call Combinebykey
Reducebykeylocally (func: (V, v) = v): map[k, V] The RDD implementation for self.mapPartitions(reducePartition).reduce(mergeMaps) reducepartition is to generate a hashmap,mergemaps in each partition that is merging multiple HashMap
Countbykey () The RDD is implemented asmapValues(_ => 1L).reduceByKey(_ + _).collect().toMap
Countbykeyapprox () The RDD is implemented asmap(_._1).countByValueApprox
Countapproxdistinctbykey () A countapproxdistinct method similar to the RDD, the difference is the method in the Combinebykey inside
Groupbykey () A simple Combinebykey implementation
Partitionby (Partitioner) Set up a new partition structure for the RDD
Join (rdd[(K, W)]): rdd[(k, (V, W))] The RDD is implemented ascogroup(other, partitioner).flatMapValues(...)
Leftouterjoin (...) The realization of the same, but flatmapvalues inside the two Rdd,yield the result of the judgment logic changed
Rightouterjoin (...) Ditto
Fullouterjoin (...) Ditto
Collectasmap () The RDD is implemented ascollect().foreach(pairToMap)
Mapvalues (f:v = U) A simple map () operation
Flatmapvalues (f:v = Iterable[u]) A simple map () operation
Cogroup (rdd[(k, W)]): rdd[(K, (Iterable[v], iterable[w))] Basic API for aggregate operations, including various joins, intersection, etc.
Subtractbykey (rdd[(k, W)]): rdd[(k, V)] Remove the keys on the right from the original RDD.
Lookup (KEY:K): Seq[v] When the RDD is implemented, then the partition is based on key, which is more efficient to traverse the corresponding partition directly, otherwise it will all traverse. The implementation of all traversal isfilter(_._1 == key).map(_._2).collect()
Saveasxxx Write external storage
Keys () A simple map () operation
VALUES () A simple map () operation
Asyncrddactions

Countasync, Collectasync, Takeasync, Foreachasync, Foreachpartitionasync

Orderedrddfunctions

For Rdd[k:ordering, V]

RDD API Notes
Sortbykey () See the explanations in Rdd.sortby ()
Filterbyrange (Lower:k, Upper:k) This filter can be done when the RDD partition is Rangepartition
Doublerddfunctions

For Rdd[double]

RDD API Notes
SUM () The RDD implementation isreduce(_ + _)
Stats () The RDD implementation is statcounter The median mapPartitions(nums => Iterator(StatCounter(nums))).reduce((a, b) => a.merge(b)) , variance, Count three values in a single traversal, and merge () is his internal method.
Mean () The RDD implementation isstats().mean
Variance ()/samplevariance () The RDD implementation isstats().variance
Stdev ()/samplestdev () The RDD implementation is to stats().stdev seek standard deviation
Meanapprox ()/sumapprox () Call Runapproximatejob
Histogram () For more complex calculations, the RDD implementation is first mappartitions and then reduce, including several recursion

Complete the full text:)

RDD API Finishing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.