Handling Key values for RDD

Last Update:2016-09-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The RDD that holds the key/value pair is called the pair rdd.

1. Create the pair RDD:

1.1 How to create a pair RDD:

Many data formats generate a pair RDD directly when the RDD is imported. We can also use the map () to convert the common Rdd previously mentioned into a pair rdd.

1.2 Pair RDD Conversion Example:

In the following example, the original RDD is changed to the first word as key, and the whole line is the pair RDD of value.

There are no tuple types in Java, so Scala Scala is used. The Tuple2 class to create a tuple. Create Tuple:new Tuple2 (ELEM1,ELEM2); Access the elements of a tuple: accessed using the. _1 () and. _2 () methods.

Also, using the basic map () function in Python and Scala implementations, Java needs to use the function Maptopair ():

/** * Converts a common basic rdd into a pair RDD, business logic: the first word of each line is the key, and the entire sentence is returned as Value key/value Pairrdd. * @param javardd<string> * @return javapairrdd<string,string> */public javapairrdd<string,string> Firstwordkeyrdd (javardd<string> input) {javapairrdd<string,string> Pair_rdd = Input.mapToPair (new Pairfunction<string,string,string> () {@Overridepublic tuple2<string, string> call (String arg0) throws Exception {//TODO auto-generated method Stubreturn new Tuple2<string,string> (Arg0.split ("") [0],arg0];}}); return Pair_rdd;}

When creating Pairrdd from an in-memory collection, Python and Scala need to use the function sparkcontext.parallelize (), while Java uses the function Sparkcontext.parallelizepairs ().

2.Pair RDD Conversion Operation:

2.1 Pair Rdd Common List of conversion actions:

The conversion action used by the base RDD can also be used in the pair rdd. Because a tuple is used in the pair rdd, it is necessary to pass the function of the tuple to the pair rdd.

The following table lists the conversion actions commonly used with the pair rdd (case Rdd content: {(1, 2), (3, 4), (3, 6)})

Name of function	Role	invocation Example	return results
Reducebykey (func)	Combine values with the same key.	Rdd.reducebykey ((x, y) = + x + y)	{(UP), (3,10)}
Groupbykey ()	Group values with the same key.	Rdd.groupbykey ()	{(1,[2]), (3,[4,6])}
Combinebykey (Createcombiner,mergevalue, Mergecombiners,partitioner)	Combine values with the same key using a different result type.
Mapvalues (func)	Apply a function to each value of a pair RDD without changing the key.	Rdd.mapvalues (x =>x+1)	{(1,3), (3,5), (3,7)}
Flatmapvalues (func)	Apply a function that returns an iterator to each value of a pair RDD, and for each element returned, produce a key/value Entry with the old key. Often used for tokenization.	Rdd.flatmapvalues (x=> (x to 5)	{(1,3), (1,4), (1,5), (3,4), (3,5)}
Keys ()	Return an RDD of just the keys.	Rdd.keys ()	{1, 3, 3}
VALUES ()	Return an RDD of just the values.	Rdd.values ()	{2, 4, 6}
Sortbykey ()	Return an RDD sorted by the key.	Rdd.sortbykey ()	{(3,4), (3,6)}

The following table lists conversion actions between 2 rdd (Rdd = {(1, 2), (3, 4), (3, 6)} and other = {(3,9)}):

Name of function	Role	invocation Example	return results
Subtractbykey	Remove elements with a key present in the other RDD.	Rdd.subtractbykey (Other)	{(1, 2)}
Join	Perform an inner join between the RDDs.	Rdd.join (Other)	{(3, (4, 9)), (3, (6, 9))}
Rightouterjoin	Perform a join between the RDDs where the key must be present in the first RDD.	Rdd.rightouterjoin (Other)	{(3, (Some (4), 9)), (3, (Some (6), 9))}
Leftouterjoin	Perform a join between the RDDs where the key must be present in the other RDD.	Rdd.leftouterjoin (Other)	{(1, (2,none)), (3, (4,some (9))), (3, (6,some (9)))}
Cogroup	Group data from both RDDs sharing the same key.	Rdd.cogroup (Other)	{(1, ([2],[])), (3, ([4, 6],[9])}

2.2 Pair RDD Filter Operation:

The Pair Rdd is also an RDD, so the previously described operations (such as filter) also apply to Pairrdd. The following program filters lines that are longer than 20:

/** * Pairrdd filter for rows longer than 20. * @param javapairrdd<string,string> * @return javapairrdd<string,string> */public javapairrdd<string, string> filtermorethantwentylines (javapairrdd<string,string> input) {javapairrdd<string,string> Filter_rdd = Input.filter (New function<tuple2<string, string>,boolean> () {@Overridepublic Boolean call ( Tuple2<string, string> arg0) throws Exception {//TODO auto-generated method Stubreturn (Arg0._2.length () >20);}} ); return Filter_rdd;}

2.3 Aggregation Operations:

This article is from the "Snowflake" blog, make sure to keep this source http://6216083.blog.51cto.com/6206083/1846757

Handling Key values for RDD

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Handling Key values for RDD

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Handling Key values for RDD

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support