Spark Pair Rdd Operation
1. Create a pair RDD
Val pairs = Lines.map (x = = (X.split ("") (0), X)
2. The conversion method of the Pair Rdd
Table 1 Conversion method of pair Rdd (set {(3,4), (3, 6)} as key-value pairs)
{(1,3), (3,5), (3,7)}
function name |
|
example |
result |
reducebykey () |
merge values with the same key |
Rdd.reducebykey ((x, y) =>x+y) |
{(), (3,10)} |
Groupbykey () |
groups values with the same key |
Rdd.groupbykey () |
{(1,[2]), (3, [4, 6])} |
mapvalues () |
applies to each value in the pair rdd without changing the key |
rdd.mapvalues (x=>x+1) |
keys () |
returns an RDD containing only keys |
Rdd.keys |
{1 , 3,3} |
values () |
returns an RDD that contains only values |
rdd.values |
{2,4,6} |
sortbykey () |
returns an Rdd sorted by key |
Rdd.sortbykey () |
{(3), (3,4) , 6)} |
Table 2 conversion actions for two pair rdd (rdd={1,2},{3,4},{3,6}, other={(3,9)})
Name of function |
Purpose |
Example |
Results |
Subtractbykey |
Delete the same element in the RDD as the key in the other RDD |
Rdd.subtractbykey (Other) |
{()} |
Join |
Internal connection to two Rdd |
Rdd.join (Other) |
{(3, (4,9)), (3, (6,9))} |
Leftouterjoin |
Left outer connection to ensure that the key of the first RDD must exist |
Rdd.leftouterjoin (Other) |
{(1, (2, None)), (3, (4, Some (9)), (3, (6, Some (9ISH)))} |
Rightouterjoin |
Right outer join to ensure that the key of the second RDD must exist |
Rdd.rightouterjoin (Other) |
{(3, (Some (4), 9)), (3, (Some (6), 9))} |
Cogroup |
Grouping two RDD data together with the same key |
Rdd.cogroup (Other) |
{(1, ([2],[])), (3, ([4,6], [9])} |
2.1 Aggregation Operations
Use Combinebykey () to find the data flow diagram for each key corresponding to the mean:
partition 1
Key | value|
| --| -–|
|coffee | 1|
|coffee | 2|
|panda | 3|
Partition 2
Key | value|
| --| -–|
|coffee | 9|
process for processing partition 1:
(coffee, 1), new key
Accumulators[coffee] = Createcombiner (1)
(coffee, 2), existing key
Accululators[coffer] = Mergevalue (Accumulators[coffee], 2)
(Panda, 3), new key
Accumulators[panda] = Createcombiner (3)
process for processing partition 2:
(coffee, 9), new key
Accumulators[coffee] = Createcombiner (9)
Merge partitions:
Mergeconbiners (Partation1.accumulators[coffee], Partation2.accumulators[coffee])
The functions used above are as follows:
def createcombiner (value): (value, 1)
def mergevalue (accumulator, value): (Accumulator[0]+value, accumulator[1]+1 )
def mergecombiners (Accumulator1, Accumulator2): (Accumulator1[0]+accumulator2[0], accumulator1[1]+ ACCUMULATOR2[1]
2.2 Data Grouping
Groupbykey () can group data on the same key as the RDD. For an rdd consisting of a type K key and a value of type V, the RDD result returned by the Groupbykey () operation is [K, Iterable[v]].
Attention:
Rdd.reducebykey (func) is equivalent to Rdd.groupbykey (). Mapvalues (values=> value.reduce (func)), but the former is more efficient . 3. The action method of the Pair Rdd
Table 3 Action actions for the pair Rdd (set {(3,4), (3,6)} as key-value pairs)
Name of function |
Description |
Example |
Results |
Countbykey () |
Counts the elements corresponding to each key individually |
Rdd.countbykey () |
{(+), {3,2}} |
Collectasmap () |
Returns the result as a map table |
Rdd.collectasmap () |
{(UP), (3,6)} |
Lookup (Key) |
Returns all values corresponding to the given key |
Rdd.lookup (3) |
[4,6] |
3.1 Data Partitioning
The Spark program can reduce network traffic overhead by partitioning. Partitioning is not good for all scenarios: for example, if a given rdd is scanned only once, then there is absolutely no need for partitioning, and partitioning is helpful only if the data is multiple times in a key-based operation such as connecting .
Assuming that we have a constant large file **userdata, and a small data **events every 5 minutes, it is now required that after the events data is produced every 5 minutes, userData to do a join operation on events.
Diagram of data flow when Partitionby () is not used on UserData:
Diagram of data flow when using Partitionby (): 3.2 Get the way the RDD is partitioned 3.3 operations that benefit from partitioning 3.4 affects the operation of the partitioning method 3.5 Custom Partitioning method