Transformation |
Meaning |
Map(Func) |
Return a new distributed dataset formed by passing each element of the source through a functionFunc. |
Filter(Func) |
Return a new dataset formed by selecting those elements of the source on whichFuncReturns true. |
Flatmap(Func) |
Similar to map, but each input item can be mapped to 0 or more output items (soFuncShoshould return a seq rather than a single item ). |
Mappartitions(Func) |
Similar to map, but runs separately on each partition (Block) of the RDD, soFuncMust be of Type iterator <t >=> iterator <u> when running on an RDD of type T. |
Mappartitionswithindex(Func) |
Similar to mappartitions, but also providesFuncWith an integer value representing the index of the partition, soFuncMust be of type (INT, iterator <t>) => iterator <u> when running on an RDD of type T. |
Sample(Withreplacement,Fraction,Seed) |
Sample a fractionFractionOf the data, with or without replacement, using a given random number generator seed. |
Union(Otherdataset) |
Return a new dataset that contains the union of the elements in the source dataset and the argument. |
Intersection(Otherdataset) |
Return a new RDD that contains the intersection of elements in the source dataset and the argument. |
Distinct([Numtasks]) |
Return a new dataset that contains the distinct elements of the source dataset. |
Groupbykey([Numtasks]) |
When called on a dataset of (K, v) pairs, returns a dataset of (K, iterable <v>) pairs. Note:If you are grouping in order to perform an aggregation (such as a sum or average) over each key, usingreduceByKey OrcombineByKey Will yield much better performance. Note:By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optionalnumTasks Argument to set a different number of tasks. |
Reducebykey(Func,[Numtasks]) |
When called on a dataset of (K, v) pairs, returns a dataset of (K, v) pairs where the values for each key are aggregated using the given reduce FunctionFunc, Which must be of type (V, V) => v. Like ingroupByKey , The number of reduce tasks is retriable through an optional second argument. |
Aggregatebykey(Zerovalue)(Seqop,Combop,[Numtasks]) |
When called on a dataset of (K, v) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral "zero" value. allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. like ingroupByKey , The number of reduce tasks is retriable through an optional second argument. |
Sortbykey([Ascending], [Numtasks]) |
When called on a dataset of (K, v) pairs where K implements ordered, returns a dataset of (K, v) pairs sorted by keys in ascending or descending order, as specified in the Booleanascending Argument. |
Join(Otherdataset,[Numtasks]) |
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W )) pairs with all pairs of elements for each key. outer joins are also supported throughleftOuterJoin AndrightOuterJoin . |
Cogroup(Otherdataset,[Numtasks]) |
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, iterable <v>, iterable <W>) tuples. this operation is also calledgroupWith . |
Cartesian(Otherdataset) |
When called on datasets of types T and U, returns a dataset of (t, u) pairs (all pairs of elements ). |
Pipe(Command,[Envvars]) |
Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process's stdin and lines output to its stdout are returned as an RDD of strings. |
Coalesce(Numpartitions) |
Decrease the number of partitions in the RDD to numpartitions. Useful for running operations more efficiently after filtering down a large dataset. |
Repartition(Numpartitions) |
Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it should SS them. This always shuffles all data over the network. |