| foreach (f:t = Unit) |
The RDD implementation is called Sc.runjob (), and F acts on each record in each partition |
| Foreachpartition (f:iterator[t] = Unit) |
The RDD implementation is called Sc.runjob (), and F acts on each partition |
| Collect (): Array[t] |
The RDD is implemented to call Sc.runjob (), get results, combine multiple result arrays into an array |
| Tolocaliterator () |
To return all data in an iterator, the RDD implementation is called Sc.runjob (), each partition iterator goes to an array, and the driver end is collected and then flatmap again into a large iterator. Understood as a rather special driver-end cache |
| Collect[u] (f:partailfunction[t, U]): Rdd[u] |
The RDD is implemented as a filter(f.isDefinedAt).map(f) filter to find the satisfied data, and then a map operation executes the partial function. |
| Subtract (Rdd[t]) |
The RDD is implemented to be similar to intersection map(x => (x, null)).subtractByKey(other.map((_, null)), p2).keys |
| Reduce (f: (t, t) = t) |
The RDD is implemented to call Sc.runjob (), so that F is computed once for each partition of the RDD, and then once again when the merge is aggregated at the end. |
| Treereduce (f: (t, t) = = t, depth = 2) |
See Treeaggregate |
| Fold (zerovalue:t) (OP: (t, t) = t) |
Special reduce, with initial value, fold of functional semantics |
| Aggregate (Zerovalue:u) (Seqop: (U, T) = u, Combop: (u, u) = = u) |
Aggregation method with initial value, reduce aggregation, merge aggregation three complete conditions. The RDD practice is to pass the function into the partition to do the calculation, and finally summarize the results of each partition once again Combop calculation. |
| Treeaggregate (Zerovalue:u) (Seqop: (U, T) = u, Combop: (u, u) = = u) (depth = 2) |
At the partition, do two and more merge aggregations, that is, the merge calculation for each partition may also be shuffle. The remainder is the same as aggregate. Understood as more complex multi-order aggregate |
| Count () |
The RDD is implemented to call Sc.runjob () and sum the size of each partition at the driver end again |
| Countapprox (timeout, confidence) |
Commit individual Dagscheduler Special task, generate special task Listener, return in timeout time, return an approximate result, return value of calculation logic visible Approximateevaluator subclass |
| Countbyvalue (): Map[t, Long] |
The RDD implementation map(value => (value, null)).countByKey() is essentially a simple combinebykey that returns a map that will load into driver's memory and require a smaller dataset size |
| Countbyvalueapprox () |
With Countapprox () |
| Countapproxdistinct () |
Experimental method, using the Streamlib library to achieve the Hyperloglog do |
| Zipwithindex (): rdd[(T, long)]/zipwithuniqueid (): rdd[(T, long)] |
Do a zip operation with the generated index |
| Take (num): array[t] |
Sweep a partition |
| First () |
Ready to take (1) |
| Top (n) (ordering) |
Each partition passes in the top handler function, gets the partitioned heap, uses rdd.reduce (), takes each partition's heap together, sorts, takes the first n |
| Max ()/min () |
Special reduce, incoming max/min comparison function |
| Saveasxxxxx |
Output storage Media |
| Checkpoint |
Show CP statement |