Spark Growth Path (4)-Partition system

Source: Internet
Author: User
Tags abstract empty min require

Spark Partitioner Hashpartitioner and Rangepartitioner code explained
Partitioner Overview Map

Classified as follows: Org.apache.spark under Hashpartitioner and Rangepartitioner Org.apache.spark.scheduler under the Coalescedpartitioner Org.apache.spark.sql.execution under the Coalescedpartitioner org.apache.spark.mllib.linalg.distributed under the Gridpartitioner Org.apache.spark.sql.execution under the Partitionidpassthrough Org.apache.spark.api.python under the Pythonpartitioner

A total of 7 partitions, focusing on the Org.apache.spark under the Hashpartitioner and Rangepartitioner.

The partitioner operates only on the (k,v) Form of the RDD. Partitioner

Partitioner is an abstract class that defines the member that a partitioner should have:

Abstract class Partitioner extends Serializable {
  def numpartitions:int
  def getpartition (key:any): Int
}
Numpartitions: Gets the number of partitions. Getpartition: Gets the partition ID based on the key value.

The Partitioner class also has a companion object, which is provided by default for a partition.

Object Partitioner {/** * Choose a partitioner to use for a cogroup-like operation between a number of RDDs.
   * * If any of the RDDs already have a partitioner, choose that one. * * Otherwise, we use a default hashpartitioner. For the number of partitions, if * spark.default.parallelism are set, then we'll use the value from Sparkcontext * de
   Faultparallelism, otherwise we ' ll use the max number of upstream partitions.  * * Unless Spark.default.parallelism is set, the number of partitions would be the * same as the number of partitions
   In the largest upstream RDD, as this should * is least likely to cause out-of-memory errors.
   * * We Use the method parameters (Rdd, others) to enforce callers passing at least 1 rdd. */def Defaultpartitioner (Rdd:rdd[_], others:rdd[_]*): Partitioner = {val Rdds = (Seq (RDD) + + others) Val has Partitioner = Rdds.filter (_.partitioner.exists (_.numpartitions > 0)) if (haspartitioner.nonempty) {haSpartitioner.maxby (_.partitions.length). Partitioner.get} else {if (Rdd.context.conf.contains ("SPARK.DEFAULT.PA Rallelism ")) {new Hashpartitioner (rdd.context.defaultParallelism)} else {new Hashpartitioner (Rdds . Map (_.partitions.length). max)}}}}

The Defaultpartitioner method details the default Partitioner generation strategy, either the partition of the Rdds with the largest number of partitions in the parent class, and if the parent class Rdds has no partitioner (not Pairrdd), then Hashpartitioner is returned. There are 2 ways to get the number of partitions, if Spark.default.parallelism is set, the value is the number of partitions, if this value is not set, the parent partition number is the largest. Hashpartitioner

Class Hashpartitioner (Partitions:int) extends Partitioner {
  require (partitions >= 0, S "Number of partitions ($par Titions) cannot be negative. ")

  def numpartitions:int = Partitions

  def getpartition (key:any): Int = key match {case
    null = 0 Case
    _ =&G T Utils.nonnegativemod (Key.hashcode, numpartitions)
  }

  override Def equals (Other:any): Boolean = other match { Case
    H:hashpartitioner =
      H.numpartitions = = Numpartitions Case
    _ =
      false
  }

  Override def Hashcode:int = Numpartitions
}

Relatively simple, the partition ID number with the Hashcode value of key and the number of partitions, if the remainder is less than 0, then the number of remainder + partition, the last value returned is the partition ID that the key belongs to. Rangepartitioner

Class Rangepartitioner[k:ordering:classtag, V] (partitions:int, rdd:rdd[_ <: Product2[k, V]], privat e var Ascending:boolean = true) extends Partitioner {//We allow partitions = 0, which happens when sorting an empt
  Y RDD under the default settings.

  Require (partitions >= 0, S "Number of partitions cannot is negative but found $partitions.")
  private var ordering = Implicitly[ordering[k]]//An array of upper bounds for the first (PARTITIONS-1) partitions private var rangebounds:array[k] = {if (partitions <= 1) {Array.empty} else {//this is the SA
      Mple size We need to has roughly balanced output partitions, capped at 1M. Val samplesize = math.min (20.0 * partitions, 1E6)//Assume the input partitions are roughly balanced and over-sampl
      e a little bit. Val samplesizeperpartition = Math.ceil (3.0 * samplesize/rdd.partitions.length). ToInt Val (numitems, sketched) = Ra Ngepartitioner.sketch (Rdd.map (_._1), samplesizeperpartition) if (NumItems = = 0L) {Array.empty} else {//If a partition co Ntains much more than the average number of items, we re-sample from it//to ensure that enough items is collect
        Ed from that partition.
        Val fraction = math.min (Samplesize/math.max (NumItems, 1L), 1.0) Val candidates = arraybuffer.empty[(K, Float)] Val imbalancedpartitions = mutable. Set.empty[int] Sketched.foreach {case (IDX, n, sample) = if (fraction * n > Samplesizeperpartiti ON) {imbalancedpartitions + = idx} else {//the weight is 1 over the sampling Probabil
            ity. Val weight = (n.todouble/sample.length). Tofloat for (key <-sample) {candidates + = (key, Weight)}}} if (Imbalancedpartitions.nonempty) {//Re-sample imbalance
   d partitions with the desired sampling probability.       Val imbalanced = new Partitionpruningrdd (Rdd.map (_._1), imbalancedpartitions.contains) val seed = Byteswa
          P32 (-rdd.id-1) Val resampled = imbalanced.sample (Withreplacement = False, fraction, seed). Collect () Val weight = (1.0/fraction). Tofloat candidates ++= resampled.map (x = (x, weight))} RANGEP Artitioner.determinebounds (candidates, Partitions)}}}

The specific explanation does not say much, the reference article already has. The more important is the demarcation function, the demarcation function uses the pond sampling algorithm, a sampling algorithm which does not know the sample total.

If you just want to understand the partitioning strategy, you can directly analyze the spark1.1 code, the code after 1.1 is optimized for performance (using the pond sampling algorithm to reduce the number of global traversal), but the strategy is basically the same. are sampled to obtain the boundary values between each partition, and each key comes in and determines the bounds within which the partition is stored. It can be concluded that the partition is ordered between the data in partition A is smaller than the data in partition B, but the data in the partition is not ordered.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.