spark2.x deep into the end series six of the RDD Java API detailed three

Source: Internet
Author: User
Tags iterable

Before learning any spark knowledge point, please understand spark correctly, and you can refer to: understanding spark correctly


This article details the spark key-value type of Rdd Java API


I. How the Key-value type of RDD is created

1, Sparkcontext.parallelizepairs

javapairrdd<string, integer> Javapairrdd = Sc.parallelizepairs (arrays.aslist (New Tuple2 ("Test", 3), new Tuple 2 ("KKK", 3));//results: [(test,3), (kkk,3)]system.out.println ("Javapairrdd =" + Javapairrdd.collect ());

2, the way of keyby

public class user implements serializable {    private  String userid;    private integer amount;    public  user (String userid, integer amount)  {         this.userId = userId;        this.amount =  amount;    }     @Override     public string  tostring ()  {        return  "user{"  +                  "Userid="  +  userid +  ' \ '  +                  ",  amount="  + amount +                  '} ';     }}javardd<user> userjavardd = sc.parallelize ( Arrays.aslist (New user ("U1",  20)); Javapairrdd<string, user> userjavapairrdd = userjavardd.keyby (new Function< User, string> ()  {     @Override     public string  call (User user)  throws Exception {         Return user.getuserid ();     });//Result: [(u1,user{userid= ' U1 ',  amount=20})] System.out.println ("userjavapairrdd = "  + userjavapairrdd.collect ());

3, Zip the way

Javardd<integer> Rdd = Sc.parallelize (Arrays.aslist (1, 1, 2, 3, 5, 8, 13));//Two Rdd Zip is also a way to create an key-value type Rdd javapairrdd<integer, integer> zippairrdd = Rdd.zip (RDD);//results: [(), (n), (2,2), ( 3,3), (5,5), (8,8), (13,13)]system.out.println ("Zippairrdd =" + Zippairrdd.collect ());

4, the way of GroupBy

Javardd<integer> rdd = sc.parallelize (Arrays.aslist (1, 1, 2, 3, 5, &NBSP;8,&NBSP;13)); Function<integer, boolean> iseven = new function<integer, boolean> ()  {     @Override     public boolean call (integer x)  throws exception {        return x % 2  == 0;    }};//groups The even and odd numbers, generating rddjavapairrdd<boolean, iterable of the Key-value type <integer>> oddsandevens = rdd.groupby (IsEven);//results: [(False,[1, 1, 3, 5, &NBSP;13]),  (true,[2, 8])]system.out.println ("oddsandevens = "  +  Oddsandevens.collect ());//Result: 1system.out.println ("oddsandevens.partitions.size = "  +  Oddsandevens.partitions (). Size ()); Oddsandevens = rdd.groupby (iseven, 2);//Result: [(false,[1, 1 ,  3, 5,&NBSP;13]),  (true,[2, 8])]system.out.println ("oddsandevens = "  +  Oddsandevens.collect ());//Result: 2system.out.println ("oddsandevens.partitions.size = "  +  Oddsandevens.partitions (). Size ());

Two, Combinebykey

javapairrdd<string, integer> javapairrdd =         sc.parallelizepairs (Arrays.aslist (New tuple2 ("Coffee",  1),  new tuple2 ("coffee", &NBSP;2),                 new  tuple2 ("Panda",  3),  new tuple2 ("Coffee",  9)),  2);//When a new key is encountered in a partition, Apply this function to the value corresponding to this key Function<integer, tuple2<integer, integer>> createcombiner  = new Function<Integer, Tuple2<Integer, Integer>> ()  {      @Override     public tuple2<integer, integer> call (Integer  value)  throws exception {        return new  Tuple2<> (value, 1)     }};//when you encounter a key in a partition that has already applied the Createcombiner function above, Apply this function to the value corresponding to this key function2<tuple2<integer, integer>, integer, tuple2<integer, integer>> mergevalue =         new function2<tuple2<integer, integer>,  Integer, Tuple2<Integer, Integer>> ()  {              @Override              public tuple2<integer, integer> call (TUPLE2&LT;INTEGER,&NBSP;INTEGER&GT;&NBSP;ACC,  integer value)  throws Exception {                 return new Tuple2<> (Acc._1 ()  + value ,  acc._2 ()  + 1);            }         };//Apply this function when you need to aggregate data from different partitions function2<tuple2<integer,  Integer>, tuple2<integer, integer>, tuple2<integer, integer>> mergecombiners =         new Function2<Tuple2<Integer, Integer>, Tuple2< Integer, integer>, tuple2<integer, integer>> ()  {              @Override              public tuple2<integer, integer> call (Tuple2<Integer, Integer &GT;&NBSP;ACC1,&NBSP;TUPLE2&LT;INTEGER,&NBSP;INTEGER&GT;&NBSP;ACC2)  throws Exception {                 return new  Tuple2<> (Acc1._1 ()  + acc2._1 (),  acc1._2 ()  + acc2._2 ());             }        }; Javapairrdd<string, tuple2<integer, integer>> combinebykeyrdd =         Javapairrdd.combinebykey (createcombiner, mergevalue, mergecombiners);//results: [(Coffee, (12,3)),  ( Panda, (3,1))]system.out.println ("combinebykeyrdd = "  + combinebykeyrdd.collect ());

The Combinebykey data stream is as follows:

650) this.width=650; "Src=" https://s4.51cto.com/wyfs02/M01/06/F6/wKiom1nBFg3yjqflAB0Z9l5GEhc792.png-wh_500x0-wm_ 3-wmp_4-s_778554710.png "title=" Combinebykey.png "alt=" Wkiom1nbfg3yjqflab0z9l5gehc792.png-wh_50 "/>

For a detailed explanation of the principles of Combinebykey: Spark core RDD API Rationale

Third, Aggregatebykey

javapairrdd<string, tuple2<integer, integer>> aggregatebykeyrdd =         javapairrdd.aggregatebykey (new tuple2<> (0, 0),  mergevalue, mergecombiners);//results: [(Coffee, (12,3)),  (Panda, (3,1))]system.out.println (" aggregatebykeyrdd =  " + aggregatebykeyrdd.collect ());//aggregatebykey is implemented by Combinebykey, The aggregatebykey above is equal to the combinebykeyrddfunction<integer, tuple2<integer, integer>>  below. createcombineraggregatebykey =        new function< Integer, tuple2<integer, integer>> ()  {              @Override              Public tuple2<integer, integer> call (Integer value)  throws Exception  {&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSp;        return mergevalue.call (new Tuple2<> (0,  0),  value);            }         };//results are:  [(coffee, (12,3)),  (Panda, (3,1))]system.out.println ( Javapairrdd.combinebykey (createcombineraggregatebykey, mergevalue, mergecombiners). Collect ());

Four, Reducebykey

Javapairrdd<string, integer> reducebykeyrdd = javapairrdd.reducebykey (new  Function2<integer, integer, integer> ()  {     @Override      public integer call (integer value1, integer value2)  throws  exception {        return value1 + value2;     });//Result: [(coffee,12),  (panda,3)]system.out.println ("reducebykeyrdd = "   + reducebykeyrdd.collect ());//reducebykey Bottom is also combinebykey realized, above the reducebykey equals below combinebykeyfunction< Integer, integer> createcombinerreduce = new function<integer, integer> ()  {     @Override     public integer call (integer  Integer)  throws Exception {        return integer;     }};function2<integer, integer, integer> mergevaluereduce =         new Function2<Integer, Integer, Integer> ()  {              @Override              public integer call (integer integer, integer  INTEGER2)  throws Exception {                 return integer + integer2;             }        };//results: [(coffee,12),   (panda,3)]system.out.println (Javapairrdd.combinebykey (createcombinerreduce, mergevaluereduce,  mergevaluereduce). Collect ());

Five, Foldbykey

Javapairrdd<string, integer> foldbykeyrdd = javapairrdd.foldbykey (0, new  Function2<integer, integer, integer> ()  {     @Override      public integer call (Integer integer, integer integer2)  throws  exception {        return integer + integer2;     });//Result: [(coffee,12),  (panda,3)]system.out.println ("foldbykeyrdd = "  + foldbykeyrdd.collect ());//foldbykey Bottom is also combinebykey realized, above the foldbykey equals below combinebykeyfunction2< integer, integer, integer> mergevaluefold =         new Function2<Integer, Integer, Integer> ()  {              @Override              public inteGer call (Integer integer, integer integer2)  throws Exception {                 return integer +  integer2;            }         }; function<integer, integer> createcombinerfold = new function<integer,  Integer> ()  {     @Override     public integer call ( Integer integer)  throws exception {        return  mergevaluefold.call (0, integer);     }};//result: [(coffee,12),  (panda,3)] System.out.println (Javapairrdd.combinebykey (createcombinerfold, mergevaluefold, mergevaluefold). Collect ());

Six, Groupbykey

Javapairrdd<string, iterable<integer>> groupbykeyrdd = javapairrdd.groupbykey ( );//results: [(coffee,[1, 2, 9]),  (panda,[3])]system.out.println ("groupbykeyrdd = "  +  groupbykeyrdd.collect ());//groupbykey Bottom is also combinebykey realized, above the groupbykey equals below combinebykeyfunction< integer, list<integer>> createcombinergroup = new function<integer,  List<integer>> ()  {     @Override     public List< Integer> call (Integer integer)  throws Exception {         List<Integer> list = new ArrayList<> ();         list.add (integer);        return  list;    }}; function2<list<integer>, integer, list<integer>> mergevaluegroup = new function2<list<integer>, integer, list<integer>> ()  {      @Override     public list<integer> call (list<integer > integers, integer integer)  throws Exception {         integers.add (integer);        return integers;     }}; function2<list<integer>, list<integer>, list<integer>>  mergecombinersgroup =        new function2<list< Integer>, list<integer>, list<integer>> ()  {              @Override              public list<integer> call (list<integer> integers, list< Integer> integERS2)  throws Exception {                 integers.addall (INTEGERS2);                 return integers;             }        };//results: [(coffee,[1,  2,&NBSP;9]),  (panda,[3])]system.out.println (Javapairrdd.combinebykey (createcombinergroup,  Mergevaluegroup, mergecombinersgroup). Collect ());


It is difficult to document the API's original reasoning, if you want to go deeper and understand the principles of the API, you can refer to the following: Spark core RDD API Rationale

spark2.x deep into the end series six of the RDD Java API detailed three

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.