Before learning Spark any point of knowledge, make a correct understanding of spark, and you can refer to: Understanding Spark correctly
This article provides an explanation of the join-related APIs
Sparkconf conf = new sparkconf (). Setappname ("AppName"). Setmaster ("local"); Javasparkcontext sc = new javasparkcontext (conf); javapairrdd<integer, integer> javapairrdd = sc.parallelizepairs (Arrays.aslist (new tuple2<> (1, 2), new Tuple2<> (3, 4), new Tuple2<> (3, 6), new tuple2<> (5, 6)); javapairrdd<integer, integer> otherjavapairrdd = sc.parallelizepairs (Arrays.aslist (new tuple2<> (3, 9), new Tuple2<> (4, 5)); /Result: [(4, ([],[5])), (1, ([2],[])), (3, ([4, 6],[9])), (5, ([6],[]))]system.out.println ( Javapairrdd.Cogroup (Otherjavapairrdd). Collect ());//Results: [(4, ([],[5])), (1, ([2],[])), (3, ([4, 6],[9])), (5, ([6],[]))]// groupwith and cogroup effects are identical System.out.println (Javapairrdd.groupwith (OTHERJAVAPAIRRDD) . Collect ());//Results: [(3, (4,9)), (3, (6,9))]//is based on Cogroup, is to take the cogroup result in the same key in two RDD has the value of the Data System.out.println (Javapairrdd.join (Otherjavapairrdd). Collect ());//Results: [(1, (2,optional.empty)), (3, (4,optional[9)), (3, (6,optional[9))), (5, (6,optional.empty))]// Based on the Cogroup implementation, the result needs to appear the key with the left side of the Rdd System.out.println (Javapairrdd.leftouterjoin (Otherjavapairrdd). Collect ());// Results: [(4, (optional.empty,5)), (3, (optional[4],9)), (3, (optional[6],9))]//based on cogroup implementation, The result is a key that needs to appear on the right side of the Rdd System.out.println (Javapairrdd.rightouterjoin (Otherjavapairrdd). Collect ());//Result: [(4, ( OPTIONAL.EMPTY,OPTIONAL[5]), (1, (Optional[2],optional.empty)), (3, (optional[4],optional[9))), (3, (optional[6],optional[9])), (5, (optional[6],optional.empty))]//Based on the Cogroup implementation, the key that results need to appear is all the KeySystem.out.println (Javapairrdd.fullouterjoin (Otherjavapairrdd) in two rdd. Collect ());
From the above can be seen, the most basic operation is cogroup this operation, the following is the schematic diagram of Cougroup:
650) this.width=650; "Src=" https://s5.51cto.com/wyfs02/M02/A5/A7/wKioL1nBGv2wEZjgAB0QijUwttE091.png-wh_500x0-wm_ 3-wmp_4-s_3307346149.png "title=" Cogroup.png "alt=" Wkiol1nbgv2wezjgab0qijuwtte091.png-wh_50 "/>
If you want a more thorough understanding of the cogroup principle, you can refer to the following:Spark core RDD API Rationale
spark2.x deep into the end series six of the RDD Java API detailed four