Handling Key values for RDD

Source: Internet
Author: User

The RDD that holds the key/value pair is called the pair rdd.


1. Create the pair RDD:

1.1 How to create a pair RDD:

Many data formats generate a pair RDD directly when the RDD is imported. We can also use the map () to convert the common Rdd previously mentioned into a pair rdd.

1.2 Pair RDD Conversion Example:

In the following example, the original RDD is changed to the first word as key, and the whole line is the pair RDD of value.

There are no tuple types in Java, so Scala Scala is used. The Tuple2 class to create a tuple. Create Tuple:new Tuple2 (ELEM1,ELEM2); Access the elements of a tuple: accessed using the. _1 () and. _2 () methods.

Also, using the basic map () function in Python and Scala implementations, Java needs to use the function Maptopair ():

/** * Converts a common basic rdd into a pair RDD, business logic: the first word of each line is the key, and the entire sentence is returned as Value key/value Pairrdd. * @param javardd<string> * @return javapairrdd<string,string> */public javapairrdd<string,string> Firstwordkeyrdd (javardd<string> input) {javapairrdd<string,string> Pair_rdd = Input.mapToPair (new Pairfunction<string,string,string> () {@Overridepublic tuple2<string, string> call (String arg0) throws Exception {//TODO auto-generated method Stubreturn new Tuple2<string,string> (Arg0.split ("") [0],arg0];}}); return Pair_rdd;}

When creating Pairrdd from an in-memory collection, Python and Scala need to use the function sparkcontext.parallelize (), while Java uses the function Sparkcontext.parallelizepairs ().


2.Pair RDD Conversion Operation:

2.1 Pair Rdd Common List of conversion actions:

The conversion action used by the base RDD can also be used in the pair rdd. Because a tuple is used in the pair rdd, it is necessary to pass the function of the tuple to the pair rdd.

The following table lists the conversion actions commonly used with the pair rdd (case Rdd content: {(1, 2), (3, 4), (3, 6)})

Name of function Role invocation Example return results
Reducebykey (func) Combine values with the same key. Rdd.reducebykey ((x, y) = + x + y) {(UP), (3,10)}
Groupbykey () Group values with the same key. Rdd.groupbykey () {(1,[2]), (3,[4,6])}
Combinebykey (Createcombiner,mergevalue, Mergecombiners,partitioner) Combine values with the same key using a different result type.

Mapvalues (func) Apply a function to each value of a pair RDD without changing the key. Rdd.mapvalues (x =>x+1) {(1,3), (3,5), (3,7)}
Flatmapvalues (func)

Apply a function that returns an iterator to each value of a pair RDD, and for each element returned, produce a key/value Entry with the old key. Often used for tokenization.

Rdd.flatmapvalues (x=> (x to 5) {(1,3), (1,4), (1,5), (3,4), (3,5)}
Keys () Return an RDD of just the keys. Rdd.keys () {1, 3, 3}
VALUES () Return an RDD of just the values. Rdd.values () {2, 4, 6}
Sortbykey () Return an RDD sorted by the key. Rdd.sortbykey () {(3,4), (3,6)}

The following table lists conversion actions between 2 rdd (Rdd = {(1, 2), (3, 4), (3, 6)} and other = {(3,9)}):

Name of function Role invocation Example return results
Subtractbykey Remove elements with a key present in the other RDD. Rdd.subtractbykey (Other) {(1, 2)}
Join Perform an inner join between the RDDs. Rdd.join (Other) {(3, (4, 9)), (3, (6, 9))}
Rightouterjoin Perform a join between the RDDs where the key must be present in the first RDD. Rdd.rightouterjoin (Other) {(3, (Some (4), 9)), (3, (Some (6), 9))}
Leftouterjoin Perform a join between the RDDs where the key must be present in the other RDD. Rdd.leftouterjoin (Other) {(1, (2,none)), (3, (4,some (9))), (3, (6,some (9)))}
Cogroup Group data from both RDDs sharing the same key. Rdd.cogroup (Other) {(1, ([2],[])), (3, ([4, 6],[9])}

2.2 Pair RDD Filter Operation:

The Pair Rdd is also an RDD, so the previously described operations (such as filter) also apply to Pairrdd. The following program filters lines that are longer than 20:

/** * Pairrdd filter for rows longer than 20. * @param javapairrdd<string,string> * @return javapairrdd<string,string> */public javapairrdd<string, string> filtermorethantwentylines (javapairrdd<string,string> input) {javapairrdd<string,string> Filter_rdd = Input.filter (New function<tuple2<string, string>,boolean> () {@Overridepublic Boolean call ( Tuple2<string, string> arg0) throws Exception {//TODO auto-generated method Stubreturn (Arg0._2.length () >20);}} ); return Filter_rdd;}


2.3 Aggregation Operations:



This article is from the "Snowflake" blog, make sure to keep this source http://6216083.blog.51cto.com/6206083/1846757

Handling Key values for RDD

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.