Difference between repartition and partitionby in spark

Source: Internet
Author: User
Tags shuffle

Both repartition and partitionby are used to repartition data.HashpartitionerThe difference is that partitionby can only be used for pairrdd, but when both are used for pairrdd, the results are different:

It is not difficult to find that the results of partitionby are what we expected. Let's open the source code of repartition for viewing:

/**   * Return a new RDD that has exactly numPartitions partitions.   *   * Can increase or decrease the level of parallelism in this RDD. Internally, this uses   * a shuffle to redistribute data.   *   * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,   * which can avoid performing a shuffle.   *   * TODO Fix the Shuffle+Repartition data loss issue described in SPARK-23207.   */  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {    coalesce(numPartitions, shuffle = true)  }  /**   * Return a new RDD that is reduced into `numPartitions` partitions.   *   * This results in a narrow dependency, e.g. if you go from 1000 partitions   * to 100 partitions, there will not be a shuffle, instead each of the 100   * new partitions will claim 10 of the current partitions. If a larger number   * of partitions is requested, it will stay at the current number of partitions.   *   * However, if you‘re doing a drastic coalesce, e.g. to numPartitions = 1,   * this may result in your computation taking place on fewer nodes than   * you like (e.g. one node in the case of numPartitions = 1). To avoid this,   * you can pass shuffle = true. This will add a shuffle step, but means the   * current upstream partitions will be executed in parallel (per whatever   * the current partitioning is).   *   * @note With shuffle = true, you can actually coalesce to a larger number   * of partitions. This is useful if you have a small number of partitions,   * say 100, potentially with a few partitions being abnormally large. Calling   * coalesce(1000, shuffle = true) will result in 1000 partitions with the   * data distributed using a hash partitioner. The optional partition coalescer   * passed in must be serializable.   */  def coalesce(numPartitions: Int, shuffle: Boolean = false,               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)              (implicit ord: Ordering[T] = null)      : RDD[T] = withScope {    require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")    if (shuffle) {      /** Distributes elements evenly across output partitions, starting from a random partition. */      val distributePartition = (index: Int, items: Iterator[T]) => {        var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions)        items.map { t =>          // Note that the hash code of the key will just be the key itself. The HashPartitioner          // will mod it with the number of total partitions.          position = position + 1          (position, t)        }      } : Iterator[(Int, T)]      // include a shuffle step so that our upstream tasks are still distributed      new CoalescedRDD(        new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),        new HashPartitioner(numPartitions)),        numPartitions,        partitionCoalescer).values    } else {      new CoalescedRDD(this, numPartitions, partitionCoalescer)    }  }

Even rairrdd does not use its own key. repartition uses a random number instead of the original key !!

Difference between repartition and partitionby in spark

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.