Spark Pair Rdd operation

Source: Internet
Author: User
Tags join
Spark Pair Rdd Operation 1. Create a pair RDD
Val pairs = Lines.map (x = = (X.split ("") (0), X)
2. The conversion method of the Pair Rdd

Table 1 Conversion method of pair Rdd (set {(3,4), (3, 6)} as key-value pairs)

{(1,3), (3,5), (3,7)}
function name example result
reducebykey () merge values with the same key Rdd.reducebykey ((x, y) =>x+y) {(), (3,10)}
Groupbykey () groups values with the same key Rdd.groupbykey () {(1,[2]), (3, [4, 6])}
mapvalues () applies to each value in the pair rdd without changing the key rdd.mapvalues (x=>x+1)
keys () returns an RDD containing only keys Rdd.keys {1 , 3,3}
values () returns an RDD that contains only values rdd.values {2,4,6}
sortbykey () returns an Rdd sorted by key Rdd.sortbykey () {(3), (3,4) , 6)}

Table 2 conversion actions for two pair rdd (rdd={1,2},{3,4},{3,6}, other={(3,9)})

Name of function Purpose Example Results
Subtractbykey Delete the same element in the RDD as the key in the other RDD Rdd.subtractbykey (Other) {()}
Join Internal connection to two Rdd Rdd.join (Other) {(3, (4,9)), (3, (6,9))}
Leftouterjoin Left outer connection to ensure that the key of the first RDD must exist Rdd.leftouterjoin (Other) {(1, (2, None)), (3, (4, Some (9)), (3, (6, Some (9ISH)))}
Rightouterjoin Right outer join to ensure that the key of the second RDD must exist Rdd.rightouterjoin (Other) {(3, (Some (4), 9)), (3, (Some (6), 9))}
Cogroup Grouping two RDD data together with the same key Rdd.cogroup (Other) {(1, ([2],[])), (3, ([4,6], [9])}
2.1 Aggregation Operations

Use Combinebykey () to find the data flow diagram for each key corresponding to the mean:

partition 1

Key | value|
| --| -–|
|coffee | 1|
|coffee | 2|
|panda | 3|

Partition 2

Key | value|
| --| -–|
|coffee | 9|

process for processing partition 1:

(coffee, 1), new key

Accumulators[coffee] = Createcombiner (1)

(coffee, 2), existing key

Accululators[coffer] = Mergevalue (Accumulators[coffee], 2)

(Panda, 3), new key

Accumulators[panda] = Createcombiner (3)

process for processing partition 2:

(coffee, 9), new key

Accumulators[coffee] = Createcombiner (9)

Merge partitions:

Mergeconbiners (Partation1.accumulators[coffee], Partation2.accumulators[coffee])

The functions used above are as follows:

def createcombiner (value): (value, 1)

def mergevalue (accumulator, value): (Accumulator[0]+value, accumulator[1]+1 )

def mergecombiners (Accumulator1, Accumulator2): (Accumulator1[0]+accumulator2[0], accumulator1[1]+ ACCUMULATOR2[1]
2.2 Data Grouping

Groupbykey () can group data on the same key as the RDD. For an rdd consisting of a type K key and a value of type V, the RDD result returned by the Groupbykey () operation is [K, Iterable[v]].

Attention:

Rdd.reducebykey (func) is equivalent to Rdd.groupbykey (). Mapvalues (values=> value.reduce (func)), but the former is more efficient . 3. The action method of the Pair Rdd

Table 3 Action actions for the pair Rdd (set {(3,4), (3,6)} as key-value pairs)

Name of function Description Example Results
Countbykey () Counts the elements corresponding to each key individually Rdd.countbykey () {(+), {3,2}}
Collectasmap () Returns the result as a map table Rdd.collectasmap () {(UP), (3,6)}
Lookup (Key) Returns all values corresponding to the given key Rdd.lookup (3) [4,6]
3.1 Data Partitioning

The Spark program can reduce network traffic overhead by partitioning. Partitioning is not good for all scenarios: for example, if a given rdd is scanned only once, then there is absolutely no need for partitioning, and partitioning is helpful only if the data is multiple times in a key-based operation such as connecting .

Assuming that we have a constant large file **userdata, and a small data **events every 5 minutes, it is now required that after the events data is produced every 5 minutes, userData to do a join operation on events.

Diagram of data flow when Partitionby () is not used on UserData:

Diagram of data flow when using Partitionby (): 3.2 Get the way the RDD is partitioned 3.3 operations that benefit from partitioning 3.4 affects the operation of the partitioning method 3.5 Custom Partitioning method

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.