"Spark Java API" Action (4)-sortby, takeordered, Takesample__java

Source: Internet
Author: User
Tags ord rand static class
SortBy Official Document Description:
Return this RDD sorted by the given key function.
function Prototypes:
def Sortby[s] (f:jfunction[t, S], Ascending:boolean, numpartitions:int): Javardd[t]

SortBy sorts the elements in Rdd according to the given F function. Source Analysis:

def Sortby[k] (   
   F: (T) => K,    
  Ascending:boolean = True,    
  numpartitions:int = this.partitions.length)    
  ( Implicit ord:ordering[k], ctag:classtag[k]): rdd[t] = withscope {  
    this.keyby[k] (f)      
    . Sortbykey (Ascending, numpartitions)      
    . Values
}
/** * Creates tuples of the elements in this 
RDD by applying ' f '. 
*/
def Keyby[k] (f:t => k): rdd[(k, T)] = withscope {  
  val cleanedf = Sc.clean (f)  
  map (x => (CLEANEDF (x ), x)
}

the implementation of the SortBy function relies on the Sortbykey function, as can be seen from the source code. The function accepts three parameters, the first parameter is a function with a generic parameter T, the return type is the same as the element type in Rdd, and is mainly converted by the map of the keyby function, converting each element into an element of the tuples type; the second parameter is ascending, which is an optional argument , mainly used for sorting the elements in Rdd, the default is true, ascending, and the third parameter is Numpartitions, which is also an optional parameter, which is mainly used to partition the sorted Rdd, and the default number of partitions is the same as before the order is partitions.length. instance:

list<integer> data = Arrays.aslist (5, 1, 1, 4, 4, 2, 2);
javardd<integer> Javardd = javasparkcontext.parallelize (data, 3);
Final Random Random = new Random (MB);
To convert Rdd, each element consists of two parts
javardd<string> javaRDD1 = Javardd.map (New Function<integer, string> () {    
  @ Override public    
  String called (Integer v1) throws Exception {return        
    v1.tostring () + "_" + random.nextint (MB);    
  }
});
System.out.println (Javardd1.collect ());
Sort by the second part of each element in Rdd
javardd<string> Resultrdd = Javardd1.sortby (New function<string, object> () {    
  @Override public    
  Object called (String v1) throws Exception {return        
    V1.split ("_") [1];    
  }
}, false,3);
SYSTEM.OUT.PRINTLN ("Result--------------" + resultrdd.collect ());
takeordered Official Document Description:
Returns the "smallest" elements from this RDD using the "natural 
" for T while ordering the order.
function Prototypes:
def takeordered (Num:int): jlist[t]
def takeordered (Num:int, comp:comparator[t]): Jlist[t]

The takeordered function is used to return the former NUM element from the RDD, either by default (ascending) or by specifying a collation. Source Analysis:

def takeordered (Num:int) (implicit ord:ordering[t]): array[t] = withscope {  
  if (num = = 0) {    
    array.empty  
  } els e {    
    val Maprdds = mappartitions {items =>      
    //Priority keeps the largest elements, so let ' s reverse the Orderin G.      
    Val queue = new boundedpriorityqueue[t] (num) (ord.reverse)      
    queue ++= util.collection.Utils.takeOrdered (items, num) (ORD)      
    Iterator.single (queue)    
  }    
  if (mapRDDs.partitions.length = = 0) {      
    array.empty    
  } else {      
    Maprdds.reduce {(queue1, queue2) => queue1 ++= queue2 queue1}.toarray.sorted      
  (ORD)    
  }  
 }
}

from the source analysis can be seen, the use of mappartitions in each partition in the partitioning of the sorting, the local sort of each partition only returns the NUM element, here Note that the returned MAPRDDS element is the Boundedpriorityqueue priority queue, The reduce function is manipulated for MAPRDDS, and the global sort is transformed into arrays. Example:

Note that Comparator needs to serialize the public
static class Takeorderedcomparator implements Serializable,comparator<integer> {    
    @Override public    
    int compare (integer o1, integer o2) {        
      return-o1.compareto (O2);    
    }
}
list<integer> data = Arrays.aslist (5, 1, 0, 4, 4, 2, 2);
javardd<integer> Javardd = javasparkcontext.parallelize (data, 3);
System.out.println ("takeordered-----1-------------" + javardd.takeordered (2));
list<integer> list = javardd.takeordered (2, New Takeorderedcomparator ());
System.out.println ("takeordered----2--------------" + list);
Takesample Official Document Description:
Return a fixed-size sampled subset the This RDD in an array
function Prototypes:
def takesample (Withreplacement:boolean, Num:int): Jlist[t]

The

takesample function returns an array of randomly sampled num elements in the dataset. Source Analysis:

def takesample (Withreplacement:boolean, num:int, Seed:long = Utils.random.nextLong): array[t] = {  Val Numstdev = 10.0 if (num < 0) {throw new IllegalArgumentException ("Negative Number of elements  
    Requested ")} else if (num = = 0) {return new Array[t] (0)} val initialcount = This.count () if (Initialcount = = 0) {return new Array[t] (0)} val maxsamplesize = Int.maxvalue-(Numstdev * MATH.SQRT (Int.maxvalue)). ToInt if (num > Maxsamplesize) {throw new IllegalArgumentException ("Cannot su Pport a sample size > Int.maxvalue-"+ S" $numStDev * MATH.SQRT (Int.maxvalue))} val rand = new Rand Om (Seed) if (!withreplacement && num >= initialcount) {return utils.randomizeinplace this.co Llect (), Rand)} val fraction = samplingutils.computefractionforsamplesize (num, initialcount, Withreplaceme NT) var samples =This.sample (withreplacement, Fraction, Rand.nextint ()). Collect ()//If the The, didn ' t turn out large enough  
    , keep trying to take samples; This shouldn ' t happen often because we have a big multiplier for the initial size var numiters = 0 while (sa Mples.length < num) {logwarning (s) Needed to re-sample due to insufficient sample size. Repeat # $numIters ") samples = This.sample (withreplacement, Fraction, Rand.nextint ()). Collect () Numiter S + 1} utils.randomizeinplace (samples, Rand). Take (NUM)}

from the source can be seen, the Takesample function is similar to the sample function, the function accepts three parameters, the first parameter withreplacement, indicates whether the sample is put back, true indicates that there is a drop back sampling, false indicates that no put back sampling The second parameter, num, represents the number of sampled data returned, which is also the difference between the Takesample function and the sample function; The third parameter, seed, represents the seed for the specified random number generator. In addition, the Takesample function first computes the fraction, which is the sampling scale, and then invokes the sample function to sample and collect () the sampled data, and finally calls the take function to return the NUM element. Note that if the number of samples is greater than the number of elements in the RDD, and the selected no put back sample, the number of RDD elements is returned. Example:

list<integer> data = Arrays.aslist (5, 1, 0, 4, 4, 2, 2);
javardd<integer> Javardd = javasparkcontext.parallelize (data, 3);
System.out.println ("takesample-----1-------------" + javardd.takesample (true,2));
System.out.println ("takesample-----2-------------" + javardd.takesample (true,2,100));
Returns 20 elements
System.out.println ("takesample-----3-------------" + javardd.takesample (true,20,100));
Returns 7 elements
System.out.println ("takesample-----4-------------" + javardd.takesample (false,20,100));

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.