1) Union (otherrdd)
RDD --> unionrdd
2) groupbykey (numpartitions)
RDD --> shuffledrdd --> mappartitionsrdd
Groupbykey () only needs to aggregate the records with the same key to complete a simple shuffle process.
3) performanceybykey (func, numpartitions)
Reduceybykey () is equivalent to the traditional mapreduce
RDD --> mappartitionsrdd --> shuffledrdd --> mappartitionsrdd
4) distinct (numpartitions)
RDD --> mappedrdd --> mappartitionsrdd --> shuffledrdd --> mappartitionsrdd
The distinct () function is used to repeat all data in deduplicate RDD.
5) cogroup (otherrdd, numpartitions)
RDD --> cogrouprdd --> mappartitionsrdd
Unlike groupbykey (), cogroup () requires two or more RDD
6) intersection (otherrdd)
RDD --> mappedrdd --> cogrouprdd --> mappedvaluesrdd --> filteredrdd --> mappedrdd
The intersection () function extracts public data from rdd a and rdd B.
7) join (otherrdd, numpartitions)
RDD --> cogrouprdd --> mappedvaluesrdd --> flatmappedvaluesrdd
Join () aggregates two RDDs [(k, v)] In the join mode in SQL.
8) sortbykey (ascending, numpartitions)
RDD --> shuffledrdd --> mappartitionsrdd
Sortbykey () sorts the records in RDD [(k, v)] by key. ascending = true indicates ascending order, and false indicates descending order.
9) Cartesian (otherrdd)
RDD --> cartesianrdd
Cartesian performs Cartesian sets on two RDD instances. The number of partitions in the generated cartesianrdd is partitionnum (rdd a) * partitionnum (rdd B ).
RDD conversion operation-RDD conversion process