Spark Pair Rdd operation

Last Update:2018-07-23 Source: Internet

Author: User

Tags join

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Spark Pair Rdd Operation 1. Create a pair RDD

Val pairs = Lines.map (x = = (X.split ("") (0), X)

2. The conversion method of the Pair Rdd

Table 1 Conversion method of pair Rdd (set {(3,4), (3, 6)} as key-value pairs)

{(1,3), (3,5), (3,7)}

function name		example	result
reducebykey ()	merge values with the same key	Rdd.reducebykey ((x, y) =>x+y)	{(), (3,10)}
Groupbykey ()	groups values with the same key	Rdd.groupbykey ()	{(1,[2]), (3, [4, 6])}
mapvalues ()	applies to each value in the pair rdd without changing the key	rdd.mapvalues (x=>x+1)
keys ()	returns an RDD containing only keys	Rdd.keys	{1 , 3,3}
values ()	returns an RDD that contains only values	rdd.values	{2,4,6}
sortbykey ()	returns an Rdd sorted by key	Rdd.sortbykey ()	{(3), (3,4) , 6)}

Table 2 conversion actions for two pair rdd (rdd={1,2},{3,4},{3,6}, other={(3,9)})

Name of function	Purpose	Example	Results
Subtractbykey	Delete the same element in the RDD as the key in the other RDD	Rdd.subtractbykey (Other)	{()}
Join	Internal connection to two Rdd	Rdd.join (Other)	{(3, (4,9)), (3, (6,9))}
Leftouterjoin	Left outer connection to ensure that the key of the first RDD must exist	Rdd.leftouterjoin (Other)	{(1, (2, None)), (3, (4, Some (9)), (3, (6, Some (9ISH)))}
Rightouterjoin	Right outer join to ensure that the key of the second RDD must exist	Rdd.rightouterjoin (Other)	{(3, (Some (4), 9)), (3, (Some (6), 9))}
Cogroup	Grouping two RDD data together with the same key	Rdd.cogroup (Other)	{(1, ([2],[])), (3, ([4,6], [9])}

2.1 Aggregation Operations

Use Combinebykey () to find the data flow diagram for each key corresponding to the mean:

partition 1

Key | value|
| --| -–|
|coffee | 1|
|coffee | 2|
|panda | 3|

Partition 2

Key | value|
| --| -–|
|coffee | 9|

process for processing partition 1:

(coffee, 1), new key

Accumulators[coffee] = Createcombiner (1)

(coffee, 2), existing key

Accululators[coffer] = Mergevalue (Accumulators[coffee], 2)

(Panda, 3), new key

Accumulators[panda] = Createcombiner (3)

process for processing partition 2:

(coffee, 9), new key

Accumulators[coffee] = Createcombiner (9)

Merge partitions:

Mergeconbiners (Partation1.accumulators[coffee], Partation2.accumulators[coffee])

The functions used above are as follows:

def createcombiner (value): (value, 1)

def mergevalue (accumulator, value): (Accumulator[0]+value, accumulator[1]+1 )

def mergecombiners (Accumulator1, Accumulator2): (Accumulator1[0]+accumulator2[0], accumulator1[1]+ ACCUMULATOR2[1]

2.2 Data Grouping

Groupbykey () can group data on the same key as the RDD. For an rdd consisting of a type K key and a value of type V, the RDD result returned by the Groupbykey () operation is [K, Iterable[v]].

Attention:

Rdd.reducebykey (func) is equivalent to Rdd.groupbykey (). Mapvalues (values=> value.reduce (func)), but the former is more efficient . 3. The action method of the Pair Rdd

Table 3 Action actions for the pair Rdd (set {(3,4), (3,6)} as key-value pairs)

Name of function	Description	Example	Results
Countbykey ()	Counts the elements corresponding to each key individually	Rdd.countbykey ()	{(+), {3,2}}
Collectasmap ()	Returns the result as a map table	Rdd.collectasmap ()	{(UP), (3,6)}
Lookup (Key)	Returns all values corresponding to the given key	Rdd.lookup (3)	[4,6]

3.1 Data Partitioning

The Spark program can reduce network traffic overhead by partitioning. Partitioning is not good for all scenarios: for example, if a given rdd is scanned only once, then there is absolutely no need for partitioning, and partitioning is helpful only if the data is multiple times in a key-based operation such as connecting .

Assuming that we have a constant large file **userdata, and a small data **events every 5 minutes, it is now required that after the events data is produced every 5 minutes, userData to do a join operation on events.

Diagram of data flow when Partitionby () is not used on UserData:

Diagram of data flow when using Partitionby (): 3.2 Get the way the RDD is partitioned 3.3 operations that benefit from partitioning 3.4 affects the operation of the partitioning method 3.5 Custom Partitioning method

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More