The RDD that holds the key/value pair is called the pair rdd.
1. Create the pair RDD:
1.1 How to create a pair RDD:
Many data formats generate a pair RDD directly when the RDD is imported. We can also use the map () to convert the common Rdd previously mentioned into a pair rdd.
1.2 Pair RDD Conversion Example:
In the following example, the original RDD is changed to the first word as key, and the whole line is the pair RDD of value.
There are no tuple types in Java, so Scala Scala is used. The Tuple2 class to create a tuple. Create Tuple:new Tuple2 (ELEM1,ELEM2); Access the elements of a tuple: accessed using the. _1 () and. _2 () methods.
Also, using the basic map () function in Python and Scala implementations, Java needs to use the function Maptopair ():
/** * Converts a common basic rdd into a pair RDD, business logic: the first word of each line is the key, and the entire sentence is returned as Value key/value Pairrdd. * @param javardd<string> * @return javapairrdd<string,string> */public javapairrdd<string,string> Firstwordkeyrdd (javardd<string> input) {javapairrdd<string,string> Pair_rdd = Input.mapToPair (new Pairfunction<string,string,string> () {@Overridepublic tuple2<string, string> call (String arg0) throws Exception {//TODO auto-generated method Stubreturn new Tuple2<string,string> (Arg0.split ("") [0],arg0];}}); return Pair_rdd;}
When creating Pairrdd from an in-memory collection, Python and Scala need to use the function sparkcontext.parallelize (), while Java uses the function Sparkcontext.parallelizepairs ().
2.Pair RDD Conversion Operation:
2.1 Pair Rdd Common List of conversion actions:
The conversion action used by the base RDD can also be used in the pair rdd. Because a tuple is used in the pair rdd, it is necessary to pass the function of the tuple to the pair rdd.
The following table lists the conversion actions commonly used with the pair rdd (case Rdd content: {(1, 2), (3, 4), (3, 6)})
Name of function |
Role |
invocation Example |
return results |
Reducebykey (func) |
Combine values with the same key. |
Rdd.reducebykey ((x, y) = + x + y) |
{(UP), (3,10)} |
Groupbykey () |
Group values with the same key. |
Rdd.groupbykey () |
{(1,[2]), (3,[4,6])} |
Combinebykey (Createcombiner,mergevalue, Mergecombiners,partitioner) |
Combine values with the same key using a different result type. |
|
|
Mapvalues (func) |
Apply a function to each value of a pair RDD without changing the key. |
Rdd.mapvalues (x =>x+1) |
{(1,3), (3,5), (3,7)} |
Flatmapvalues (func) |
Apply a function that returns an iterator to each value of a pair RDD, and for each element returned, produce a key/value Entry with the old key. Often used for tokenization. |
Rdd.flatmapvalues (x=> (x to 5) |
{(1,3), (1,4), (1,5), (3,4), (3,5)} |
Keys () |
Return an RDD of just the keys. |
Rdd.keys () |
{1, 3, 3} |
VALUES () |
Return an RDD of just the values. |
Rdd.values () |
{2, 4, 6} |
Sortbykey () |
Return an RDD sorted by the key. |
Rdd.sortbykey () |
{(3,4), (3,6)} |
The following table lists conversion actions between 2 rdd (Rdd = {(1, 2), (3, 4), (3, 6)} and other = {(3,9)}):
Name of function |
Role |
invocation Example |
return results |
Subtractbykey |
Remove elements with a key present in the other RDD. |
Rdd.subtractbykey (Other) |
{(1, 2)} |
Join |
Perform an inner join between the RDDs. |
Rdd.join (Other) |
{(3, (4, 9)), (3, (6, 9))} |
Rightouterjoin |
Perform a join between the RDDs where the key must be present in the first RDD. |
Rdd.rightouterjoin (Other) |
{(3, (Some (4), 9)), (3, (Some (6), 9))} |
Leftouterjoin |
Perform a join between the RDDs where the key must be present in the other RDD. |
Rdd.leftouterjoin (Other) |
{(1, (2,none)), (3, (4,some (9))), (3, (6,some (9)))} |
Cogroup |
Group data from both RDDs sharing the same key. |
Rdd.cogroup (Other) |
{(1, ([2],[])), (3, ([4, 6],[9])} |
2.2 Pair RDD Filter Operation:
The Pair Rdd is also an RDD, so the previously described operations (such as filter) also apply to Pairrdd. The following program filters lines that are longer than 20:
/** * Pairrdd filter for rows longer than 20. * @param javapairrdd<string,string> * @return javapairrdd<string,string> */public javapairrdd<string, string> filtermorethantwentylines (javapairrdd<string,string> input) {javapairrdd<string,string> Filter_rdd = Input.filter (New function<tuple2<string, string>,boolean> () {@Overridepublic Boolean call ( Tuple2<string, string> arg0) throws Exception {//TODO auto-generated method Stubreturn (Arg0._2.length () >20);}} ); return Filter_rdd;}
2.3 Aggregation Operations:
This article is from the "Snowflake" blog, make sure to keep this source http://6216083.blog.51cto.com/6206083/1846757
Handling Key values for RDD