Spark advanced Sorting and TOPN issues revealed

Source: Internet
Author: User
Tags comparable iterable new set shuffle

[TOC]

Introduced

The key is how to sort the words by the number of words in the previous WordCount word count example.

As follows:

scala> val retRDD = sc.textFile("hdfs://ns1/hello").flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_)scala> val retSortRDD = retRDD.map(pair => (pair._2, pair._1)).sortByKey(false).map(pair => (pair._2, pair._1))scala> retSortRDD.collect().foreach(println)...(hello,3)(me,1)(you,1)(he,1)

The following tests all require the introduction of MAVEN dependencies

<dependency>    <groupId>org.scala-lang</groupId>    <artifactId>scala-library</artifactId>    <version>2.10.5</version></dependency><dependency>    <groupId>org.apache.spark</groupId>    <artifactId>spark-core_2.10</artifactId>    <version>1.6.2</version></dependency>
Spark Two-time sequencing test data and instructions

The data format that needs to be sorted two times is as follows:

field_1‘ ‘field_2(使用空格分割) 20 21 50 51 50 52 50 53 50 54 60 51 60 53 60 52 60 56 60 57 70 58 60 61 70 54

The following code comments are described in more detail here, and it is to be noted that, in the following sort procedure, you use Java and Scala to sort the operations separately, and:

    • Java version
      • Method 1: Make elements more---> need to use Secondarysort objects
      • Mode 2: Provide comparator---> need to use Secondarysort object
      • Whichever way you use it, you need to use a new variable object Secondarysort
    • Scala version
      • Mode 1: Make the element compare, in fact is Java version mode 1 Scala Implementation---> Need to use Secondarysort object
      • Method 2: Use the first method of SortBy, sort based on the original data---> do not need to use Secondarysort object
      • Method 3: Use SortBy's second way to convert raw data---> need to use Secondarysort object

So this two-order example contains the Java and Scala implementations of a total of 5 versions, which is very valuable!

Common objects

is actually the Secondarysort object, as follows:

package cn.xpleaf.bigdata.spark.java.core.domain;import scala.Serializable;public class SecondarySort implements Comparable<SecondarySort>, Serializable {    private int first;    private int second;    public SecondarySort(int first, int second) {        this.first = first;        this.second = second;    }    public int getFirst() {        return first;    }    public void setFirst(int first) {        this.first = first;    }    public int getSecond() {        return second;    }    public void setSecond(int second) {        this.second = second;    }    @Override    public int compareTo(SecondarySort that) {        int ret = this.getFirst()  - that.getFirst();        if(ret == 0) {            ret = that.getSecond() - this.getSecond();        }        return ret;    }    @Override    public String toString() {        return this.first + " " + this.second;    }}
Java version

The test code is as follows:

Package Cn.xpleaf.bigdata.spark.java.core.p3;import Cn.xpleaf.bigdata.spark.java.core.domain.SecondarySort; Import Org.apache.spark.sparkconf;import Org.apache.spark.api.java.javapairrdd;import Org.apache.spark.api.java.javardd;import Org.apache.spark.api.java.javasparkcontext;import Org.apache.spark.api.java.function.pairfunction;import Org.apache.spark.api.java.function.voidfunction;import Scala. Serializable;import Scala. Tuple2;import java.util.comparator;/** * Two Order of Java version * field_1 ' Field_2 (split with space) * 20 21 50 51 50 52 50 53 50 54 60 5 1 60 53 60 52 60 56 60 57 70 58 60 61 70 54 Requirements: first column in ascending order, if the first column is equal, sorted in descending order of the second column: to sort, use Sortbykey or sortby if you use Sortbykey , it can only be sorted by key, now it is the first column to do the key? Or the second column? Depending on the requirements, you can only use composite key (both the first column and the second column), because the composite key must be comparable, or the operation provides a comparator problem is to look at the operation, and did not provide us with a comparator, there is no choice but to make the elements of the comparison Use a custom object to use the Comprable interface */public class _01sparksecondarysortops {public static void main (string[] args) {Spar kconf conf = new sparkconf (). Setmaster ("local[2]"). Setappname (_01sparKSecondarySortOps.class.getSimpleName ());        Javasparkcontext JSC = new Javasparkcontext (conf);        javardd<string> Linesrdd = Jsc.textfile ("D:/data/spark/secondsort.csv"); Javapairrdd<secondarysort, string> Ssrdd = Linesrdd.maptopair (new pairfunction<string, SecondarySort, String                > () {@Override public tuple2<secondarysort, string> call (String line) throws Exception {                string[] fields = Line.split ("");                int first = integer.valueof (Fields[0].trim ());                int second = integer.valueof (Fields[1].trim ());                Secondarysort ss = New Secondarysort (first, second);            return new Tuple2<secondarysort, string> (ss, "");        }        });   /*///First: Make the element comparative Javapairrdd<secondarysort, string> Sbkrdd = Ssrdd.sortbykey (true, 1); Set partition to 1, so that the data is overall orderly, otherwise just partition in order *//** * The second way, provide a comparator * In contrast to the previous method, whichTimes are: First column descending, second column ascending/javapairrdd<secondarysort, string> Sbkrdd = Ssrdd.sortbykey (New mycomparator<sec                Ondarysort> () {@Override public int compare (Secondarysort O1, Secondarysort O2) {                int ret = O2.getfirst ()-O1.getfirst ();                if (ret = = 0) {ret = O1.getsecond ()-O2.getsecond ();            } return ret;        }}, True, 1);  Sbkrdd.foreach (New Voidfunction<tuple2<secondarysort, string>> () {@Override public void            Call (Tuple2<secondarysort, string> tuple2) throws Exception {System.out.println (tuple2._1);        }        });    Jsc.close (); }}/** * Do an intermediate transition interface * Comparison requires the implementation of the serialization interface, otherwise it will also report the exception * is used in the adapter Adapter mode * Adapter mode (Adapter pattern) is a bridge between two incompatible interfaces, here is very good embodiment. */interface mycomparator<t> extends Comparator<t>, serializable{}

The output results are as follows:

740 58730 54530 54203 2174 5873 5771 5571 5670 5470 5570 5670 5770 5870 5863 6160 5160 5260 5360 5660 5660 5760 5760 6150 5150 5250 5350 5350 5450 6250 51250 52240 51131 4220 2120 5320 52212 2117 87 825 63 41 2
Scala version

The test code is as follows:

Package Cn.xpleaf.bigdata.spark.scala.core.p3import Cn.xpleaf.bigdata.spark.java.core.domain.SecondarySortimport Org.apache.spark.rdd.RDDimport Org.apache.spark. {sparkconf, Sparkcontext}import scala.reflect.ClassTagobject _05sparksecondarysortops {def main (args:array[string]) : Unit = {val conf = new sparkconf (). Setmaster ("local[2]"). Setappname (_05sparksecondarysortops.getclass.getsimplen        AME) Val sc = new Sparkcontext (conf) val Linesrdd = Sc.textfile ("d:/data/spark/secondsort.csv")/*            Val ssrdd:rdd[(Secondarysort, String)] = Linesrdd.map (line + = {val fields = Line.split ("")  Val first = integer.valueof (Fields (0). Trim ()) Val second = integer.valueof (Fields (1). Trim ()) Val SS = new Secondarysort (first, second) (SS, "")})//One way, the use of elements with the comparative Val sbkrdd:rdd[(Seconda Rysort, String)] = Ssrdd.sortbykey (true, 1) sbkrdd.foreach{case (Ss:secondarysort, str:string) + = { Pattern matching mode println (ss)}} */*////////////////////////////////// Nesrdd.sortby (line-= line, numpartitions = 1) (new ordering[string] {override def compare (X:string, y:str  ing): Int = {val Xfields = X.split ("") Val YFields = Y.split ("") var ret = Xfields (0). Toint-yfields (0). ToInt if (ret = = 0) {ret = YFields (1). Toint-xfields (1).         ToInt} ret}}, classtag.object.asinstanceof[classtag[string]) */ The second way to use SortBy is to convert the original data--->sortby () The function of the first parameter is to do the conversion of the data val retrdd:rdd[string] = Linesrdd.sortby (line =& Gt            {//F: (t) = K//Here the type of T is String,k is Secondarysort type val fields = Line.split ("") Val first = integer.valueof (Fields (0). Trim ()) Val second = integer.valueof (Fields (1). Trim ()) v Al ss = New SecondarYsort (First, second) SS}, True, 1) (new Ordering[secondarysort] {override def compare (x:sec                    Ondarysort, y:secondarysort): Int = {var ret = X.getfirst-y.getfirst if (ret = = 0) { ret = Y.getsecond-x.getsecond} ret}, Classtag.object. Asinstanceof[classtag[secondarysort]] Retrdd.foreach (println) Sc.stop ()}}

The output results are as follows:

1 23 45 67 827 812 21120 52220 5320 2131 4240 51150 52250 51250 6250 5450 5350 5350 5250 5160 6160 5760 5760 5660 5660 5360 5260 5163 6170 5870 5870 5770 5670 5570 5471 5671 5573 5774 58203 21530 54730 54740 58
TOPN problem Requirement and description

Requirements and data are described below:

  * TopN问题的说明:  *     TopN问题显然是可以使用action算子take来完成,但是因为take需要将所有数据都拉取到Driver上才能完成操作,  *     所以Driver的内存压力非常大,不建议使用take.  *  * 这里要进行TopN问题的分析,数据及需求如下:  * chinese ls 91  * english ww 56  * chinese zs 90  * chinese zl 76  * english zq 88  * chinese wb 95  * chinese sj 74  * english ts 87  * english ys 67  * english mz 77  * chinese yj 98  * english gk 96  *  * 需求:排出每个科目的前三名

Below are the use of low performance Groupbykey and performance of the Combinebykey to operate, detailed instructions have been given in the code, attention to the idea is very important, especially the use of Combinebykey to solve the groupbykey performance problems, If you are interested, you can read the code and the ideas it embodies, because it is closely related to the theory of Spark itself.

Use Groupbykey to resolve

The test code is as follows:

Package Cn.xpleaf.bigdata.spark.scala.core.p3import org.apache.log4j. {level, Logger}import org.apache.spark.rdd.RDDimport Org.apache.spark. {sparkconf, sparkcontext}import scala.collection.mutable/** * TOPN problem Description: * TOPN problem obviously can be done using the action operator take, but because takes requires the  Data is pulled to driver to complete the operation, * So driver memory pressure is very large, do not recommend take. * * Here to carry out the analysis of TOPN problem, the data and requirements are as follows: * Chinese LS, English ww, Chinese Zs * Chinese ZL * 中文版 * English ZQ * Chine SE WB * Chinese SJ * 中文版 * English ts * 中文版 * English ys * 中文版 * English MZ * Chinese YJ 98 * 中文版 96 * * Requirements: Discharge Top three * ideas for each account: the Advanced line map operation is converted to a tuple of (subject, name + score) and Groupbykey according to the key subject, so that you can get the Gbkrdd * and then map the operation in map Use TreeSet to get the top three (both size and can be sorted) * * PROBLEM: * The above plan in the production process with caution * because, the execution of Groupbykey, will be the same key data will be pulled into the same partition, and then perform operations, * the process of drawing is Shuffle, is a distributed performance killer!  Another, if the key corresponds to too much data, it is likely to cause data skew, or oom, * then you need to avoid this mode of operation. * How do you do that?  You can refer to the idea of the TOPN problem in Mr, in which the data is filtered in each map task, even though it needs to be shuffle to a node, but the amount of data is greatly reduced. * The idea in Spark is that it can be used in each partition toThe data is filtered, and then the data filtered by each partition is merged and sorted again to get the result of the final sort. * Obviously, this can solve the previous data to the same partition caused the data volume is too large problem!  Because the work of partition filtering already can greatly reduce the amount of data. * So what operators in spark can do this?  That is Combinebykey or aggregatebykey, its specific usage can refer to my previous blog post, here I use the Combinebykey to operate. */object _06sparktopnops {def main (args:array[string]): Unit = {val conf = new sparkconf (). Setmaster ("local[2 ] "). Setappname (_06sparktopnops.getclass.getsimplename ()) Val sc = new Sparkcontext (conf) logger.getlogger (" O Rg.apache.spark "). SetLevel (Level.off) Logger.getlogger (" Org.apache.hadoop "). SetLevel (Level.off) Logger.getlo Gger ("Org.spark_project"). SetLevel (Level.off)//1. Convert to Linesrdd val linesrdd:rdd[string] = sc.textfile ("D:/da            Ta/spark/topn.txt ")//2. Convert to Pairsrdd Val pairsrdd:rdd[(String, string)] = Linesrdd.map (line + = {            Val fields = Line.split ("") val subject = Fields (0). Trim () Val name = Fields (1). Trim () Val score = Fields (2). Trim () (Subject, NAME + "" + Score)//("Chinese", "ZS 90")})//3. Convert to Gbkrdd Val gbkrdd:rdd[(String, iterable[string] )] = Pairsrdd.groupbykey () println ("==========topn ago ==========") Gbkrdd.foreach (println)//(English,c Ompactbuffer (WW, ZQ, TS, ys, MZ, GK))//(Chinese,compactbuffer (LS, ZS, Zl, WB, SJ 74 , YJ 98))//4. Convert to Retrdd Val retrdd:rdd[(String, iterable[string])] = gbkrdd.map (tuple = {VA R ts = new mutable. Treeset[string] () (New Myordering ()) val subject = tuple._1//chinese val namescores = tupl            E._2//("LS", "ww", "ZS 90", ...)                For (Namescore <-namescores) {//traverse each score "LS 91"///Add to TreeSet ts.add (Namescore)             if (Ts.size > 3) {//If the size is greater than 3, the last result is popped ts = ts.dropright (1)}}        (Subject, TS)}) println ("==========topn after ========== ") Retrdd.foreach (println) sc.stop ()}}//gbkrdd.map Sorting comparison rules for sorting, according to the requirements, should be descending class TreeSet ING extends ordering[string] {override def compare (X:string, y:string): Int = {//x or Y in the format: "ZS-All" va L Xfields = X.split ("") Val YFields = Y.split ("") Val Xscore = Xfields (1). toint val Yscore = Yfield S (1). toint val ret = yscore-xscore ret}}

The output results are as follows:

==========TopN前==========(chinese,CompactBuffer(ls 91, zs 90, zl 76, wb 95, sj 74, yj 98))(english,CompactBuffer(ww 56, zq 88, ts 87, ys 67, mz 77, gk 96))==========TopN后==========(chinese,TreeSet(yj 98, wb 95, ls 91))(english,TreeSet(gk 96, zq 88, ts 87))
Use Combinebykey to resolve

The test code is as follows:

Package Cn.xpleaf.bigdata.spark.scala.core.p3import org.apache.log4j. {level, Logger}import Org.apache.spark. {sparkconf, sparkcontext}import org.apache.spark.rdd.RDDimport scala.collection.mutable/** * Using the Combinebykey operator to optimize the previous TOPN problem * for the use of Combinebykey operators, you can refer to my blog post, which has a very detailed example * must be mastered, because it is very important */object _07sparktopnops {D EF Main (args:array[string]): Unit = {val conf = new sparkconf (). Setmaster ("local[2]"). Setappname (_07sparktopnops. GetClass (). Getsimplename ()) Val sc = new Sparkcontext (conf) logger.getlogger ("Org.apache.spark"). SetLevel (Le Vel. OFF) Logger.getlogger ("Org.apache.hadoop"). SetLevel (Level.off) Logger.getlogger ("Org.spark_project"). Setleve L (Level.off)//1. Convert to Linesrdd val linesrdd:rdd[string] = sc.textfile ("D:/data/spark/topn.txt")//2. Turn            Change to Pairsrdd Val pairsrdd:rdd[(String, string)] = Linesrdd.map (line + = {val fields = Line.split ("") Val subject = Fields (0). Trim () Val name = Fields (1). Trim () Val score = Fields (2). Trim (Subject, name + "" + Score)//("Chinese", "ZS 90        ")}) println (" ==========topn ago ========== ") Pairsrdd.foreach (println)//(CHINESE,SJ 74) (CHINESE,LS)//(English,ts)//(ENGLISH,WW)//(English,ys)//(Chinese,zs 90 )//(ENGLISH,MZ)//(CHINESE,ZL)//(CHINESE,YJ 98)//(ENGLISH,ZQ)//(Engli SH,GK)//(CHINESE,WB 95)//3. Convert to Cbkrdd Val cbkrdd:rdd[(String, mutable. Treeset[string])] = Pairsrdd.combinebykey (Createcombiner, Mergevalue, mergecombiners) println ("==========topn after ===== ===== ") Cbkrdd.foreach (println)//(Chinese,treeset (YJ 98, WB, LS)//(English,treeset (GK 96, ZQ, TS 87)}//Create a container where a treeset is returned as the container for the value of the same key in each partition Def createcombiner (namescore:string): mutable.   Treeset[string] = {//Namescore format: "ZS 90"     Specifies the collation myordering, in descending order of val ts = new mutable.  Treeset[string] () (New Myordering ()) Ts.add (namescore) TS}//merge the same value as key in the partition, and use TreeSet to sort def Mergevalue (ts:mutable. Treeset[string], namescore:string): mutable.  Treeset[string] = {Ts.add (namescore) if (Ts.size > 3) {//If more than 3, delete one and return Ts.dropright (1)// The collection in Scala does not change itself, but it returns a new set} TS}//Merges the same value collection with key in different partitions, and uses TreeSet to sort def mergecombiners (ts1:mutable. Treeset[string], ts2:mutable. Treeset[string]): mutable. Treeset[string] = {var newts = new mutable.            Treeset[string] () (New myordering ())//Adds the value of the collection in partition 1 to the new TreeSet, sorting and controlling the size for (Namescore <-ts1) { Newts.add (Namescore) if (Newts.size > 3) {//If the quantity is greater than 3, delete one and then assign to itself newts = newts.dr            Opright (1)}}//Adds the value of the collection in partition 2 to the new treeset, sorting and controlling the size for (Namescore <-ts2) { Newts.add (Namescore)            if (Newts.size > 3) {//If the quantity is greater than 3, then delete one and then assign to itself newts = Newts.dropright (1)} } newts}}

The output results are as follows:

==========TopN前==========(chinese,ls 91)(chinese,sj 74)(english,ww 56)(english,ts 87)(chinese,zs 90)(english,ys 67)(chinese,zl 76)(english,mz 77)(english,zq 88)(chinese,yj 98)(chinese,wb 95)(english,gk 96)==========TopN后==========(english,TreeSet(gk 96, zq 88, ts 87))(chinese,TreeSet(yj 98, wb 95, ls 91))

Spark advanced Sorting and TOPN issues revealed

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.