Before learning any spark knowledge point, please understand spark correctly, and you can refer to: understanding spark correctly
This article details the spark key-value type of Rdd Java API
I. How the Key-value type of RDD is created
1, Sparkcontext.parallelizepairs
javapairrdd<string, integer> Javapairrdd = Sc.parallelizepairs (arrays.aslist (New Tuple2 ("Test", 3), new Tuple 2 ("KKK", 3));//results: [(test,3), (kkk,3)]system.out.println ("Javapairrdd =" + Javapairrdd.collect ());
2, the way of keyby
public class user implements serializable { private String userid; private integer amount; public user (String userid, integer amount) { this.userId = userId; this.amount = amount; } @Override public string tostring () { return "user{" + "Userid=" + userid + ' \ ' + ", amount=" + amount + '} '; }}javardd<user> userjavardd = sc.parallelize ( Arrays.aslist (New user ("U1", 20)); Javapairrdd<string, user> userjavapairrdd = userjavardd.keyby (new Function< User, string> () { @Override public string call (User user) throws Exception { Return user.getuserid (); });//Result: [(u1,user{userid= ' U1 ', amount=20})] System.out.println ("userjavapairrdd = " + userjavapairrdd.collect ());
3, Zip the way
Javardd<integer> Rdd = Sc.parallelize (Arrays.aslist (1, 1, 2, 3, 5, 8, 13));//Two Rdd Zip is also a way to create an key-value type Rdd javapairrdd<integer, integer> zippairrdd = Rdd.zip (RDD);//results: [(), (n), (2,2), ( 3,3), (5,5), (8,8), (13,13)]system.out.println ("Zippairrdd =" + Zippairrdd.collect ());
4, the way of GroupBy
Javardd<integer> rdd = sc.parallelize (Arrays.aslist (1, 1, 2, 3, 5, &NBSP;8,&NBSP;13)); Function<integer, boolean> iseven = new function<integer, boolean> () { @Override public boolean call (integer x) throws exception { return x % 2 == 0; }};//groups The even and odd numbers, generating rddjavapairrdd<boolean, iterable of the Key-value type <integer>> oddsandevens = rdd.groupby (IsEven);//results: [(False,[1, 1, 3, 5, &NBSP;13]), (true,[2, 8])]system.out.println ("oddsandevens = " + Oddsandevens.collect ());//Result: 1system.out.println ("oddsandevens.partitions.size = " + Oddsandevens.partitions (). Size ()); Oddsandevens = rdd.groupby (iseven, 2);//Result: [(false,[1, 1 , 3, 5,&NBSP;13]), (true,[2, 8])]system.out.println ("oddsandevens = " + Oddsandevens.collect ());//Result: 2system.out.println ("oddsandevens.partitions.size = " + Oddsandevens.partitions (). Size ());
Two, Combinebykey
javapairrdd<string, integer> javapairrdd = sc.parallelizepairs (Arrays.aslist (New tuple2 ("Coffee", 1), new tuple2 ("coffee", &NBSP;2), new tuple2 ("Panda", 3), new tuple2 ("Coffee", 9)), 2);//When a new key is encountered in a partition, Apply this function to the value corresponding to this key Function<integer, tuple2<integer, integer>> createcombiner = new Function<Integer, Tuple2<Integer, Integer>> () { @Override public tuple2<integer, integer> call (Integer value) throws exception { return new Tuple2<> (value, 1) }};//when you encounter a key in a partition that has already applied the Createcombiner function above, Apply this function to the value corresponding to this key function2<tuple2<integer, integer>, integer, tuple2<integer, integer>> mergevalue = new function2<tuple2<integer, integer>, Integer, Tuple2<Integer, Integer>> () { @Override public tuple2<integer, integer> call (TUPLE2<INTEGER,&NBSP;INTEGER>&NBSP;ACC, integer value) throws Exception { return new Tuple2<> (Acc._1 () + value , acc._2 () + 1); } };//Apply this function when you need to aggregate data from different partitions function2<tuple2<integer, Integer>, tuple2<integer, integer>, tuple2<integer, integer>> mergecombiners = new Function2<Tuple2<Integer, Integer>, Tuple2< Integer, integer>, tuple2<integer, integer>> () { @Override public tuple2<integer, integer> call (Tuple2<Integer, Integer >&NBSP;ACC1,&NBSP;TUPLE2<INTEGER,&NBSP;INTEGER>&NBSP;ACC2) throws Exception { return new Tuple2<> (Acc1._1 () + acc2._1 (), acc1._2 () + acc2._2 ()); } }; Javapairrdd<string, tuple2<integer, integer>> combinebykeyrdd = Javapairrdd.combinebykey (createcombiner, mergevalue, mergecombiners);//results: [(Coffee, (12,3)), ( Panda, (3,1))]system.out.println ("combinebykeyrdd = " + combinebykeyrdd.collect ());
The Combinebykey data stream is as follows:
650) this.width=650; "Src=" https://s4.51cto.com/wyfs02/M01/06/F6/wKiom1nBFg3yjqflAB0Z9l5GEhc792.png-wh_500x0-wm_ 3-wmp_4-s_778554710.png "title=" Combinebykey.png "alt=" Wkiom1nbfg3yjqflab0z9l5gehc792.png-wh_50 "/>
For a detailed explanation of the principles of Combinebykey: Spark core RDD API Rationale
Third, Aggregatebykey
javapairrdd<string, tuple2<integer, integer>> aggregatebykeyrdd = javapairrdd.aggregatebykey (new tuple2<> (0, 0), mergevalue, mergecombiners);//results: [(Coffee, (12,3)), (Panda, (3,1))]system.out.println (" aggregatebykeyrdd = " + aggregatebykeyrdd.collect ());//aggregatebykey is implemented by Combinebykey, The aggregatebykey above is equal to the combinebykeyrddfunction<integer, tuple2<integer, integer>> below. createcombineraggregatebykey = new function< Integer, tuple2<integer, integer>> () { @Override Public tuple2<integer, integer> call (Integer value) throws Exception {&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSp; return mergevalue.call (new Tuple2<> (0, 0), value); } };//results are: [(coffee, (12,3)), (Panda, (3,1))]system.out.println ( Javapairrdd.combinebykey (createcombineraggregatebykey, mergevalue, mergecombiners). Collect ());
Four, Reducebykey
Javapairrdd<string, integer> reducebykeyrdd = javapairrdd.reducebykey (new Function2<integer, integer, integer> () { @Override public integer call (integer value1, integer value2) throws exception { return value1 + value2; });//Result: [(coffee,12), (panda,3)]system.out.println ("reducebykeyrdd = " + reducebykeyrdd.collect ());//reducebykey Bottom is also combinebykey realized, above the reducebykey equals below combinebykeyfunction< Integer, integer> createcombinerreduce = new function<integer, integer> () { @Override public integer call (integer Integer) throws Exception { return integer; }};function2<integer, integer, integer> mergevaluereduce = new Function2<Integer, Integer, Integer> () { @Override public integer call (integer integer, integer INTEGER2) throws Exception { return integer + integer2; } };//results: [(coffee,12), (panda,3)]system.out.println (Javapairrdd.combinebykey (createcombinerreduce, mergevaluereduce, mergevaluereduce). Collect ());
Five, Foldbykey
Javapairrdd<string, integer> foldbykeyrdd = javapairrdd.foldbykey (0, new Function2<integer, integer, integer> () { @Override public integer call (Integer integer, integer integer2) throws exception { return integer + integer2; });//Result: [(coffee,12), (panda,3)]system.out.println ("foldbykeyrdd = " + foldbykeyrdd.collect ());//foldbykey Bottom is also combinebykey realized, above the foldbykey equals below combinebykeyfunction2< integer, integer, integer> mergevaluefold = new Function2<Integer, Integer, Integer> () { @Override public inteGer call (Integer integer, integer integer2) throws Exception { return integer + integer2; } }; function<integer, integer> createcombinerfold = new function<integer, Integer> () { @Override public integer call ( Integer integer) throws exception { return mergevaluefold.call (0, integer); }};//result: [(coffee,12), (panda,3)] System.out.println (Javapairrdd.combinebykey (createcombinerfold, mergevaluefold, mergevaluefold). Collect ());
Six, Groupbykey
Javapairrdd<string, iterable<integer>> groupbykeyrdd = javapairrdd.groupbykey ( );//results: [(coffee,[1, 2, 9]), (panda,[3])]system.out.println ("groupbykeyrdd = " + groupbykeyrdd.collect ());//groupbykey Bottom is also combinebykey realized, above the groupbykey equals below combinebykeyfunction< integer, list<integer>> createcombinergroup = new function<integer, List<integer>> () { @Override public List< Integer> call (Integer integer) throws Exception { List<Integer> list = new ArrayList<> (); list.add (integer); return list; }}; function2<list<integer>, integer, list<integer>> mergevaluegroup = new function2<list<integer>, integer, list<integer>> () { @Override public list<integer> call (list<integer > integers, integer integer) throws Exception { integers.add (integer); return integers; }}; function2<list<integer>, list<integer>, list<integer>> mergecombinersgroup = new function2<list< Integer>, list<integer>, list<integer>> () { @Override public list<integer> call (list<integer> integers, list< Integer> integERS2) throws Exception { integers.addall (INTEGERS2); return integers; } };//results: [(coffee,[1, 2,&NBSP;9]), (panda,[3])]system.out.println (Javapairrdd.combinebykey (createcombinergroup, Mergevaluegroup, mergecombinersgroup). Collect ());
It is difficult to document the API's original reasoning, if you want to go deeper and understand the principles of the API, you can refer to the following: Spark core RDD API Rationale
spark2.x deep into the end series six of the RDD Java API detailed three