Spark Rdd coalesce () method and repartition () method, rddcoalesce

Source: Internet
Author: User
Tags spark rdd

Spark Rdd coalesce () method and repartition () method, rddcoalesce

In Rdd of Spark, Rdd is partitioned.

Sometimes you need to reset the number of Rdd partitions. For example, in Rdd partitions, there are many Rdd partitions, but the data volume of each Rdd is small. You need to set a reasonable partition. Or you need to increase the number of Rdd partitions. In addition, you can set the number of generated files by setting an Rdd partition.

You can reset the Rdd partition by using the coalesce () method and repartition () method ().

What are the differences between the two methods? Let's look at the source code:

  def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)      : RDD[T] = withScope {    if (shuffle) {      /** Distributes elements evenly across output partitions, starting from a random partition. */      val distributePartition = (index: Int, items: Iterator[T]) => {        var position = (new Random(index)).nextInt(numPartitions)        items.map { t =>          // Note that the hash code of the key will just be the key itself. The HashPartitioner          // will mod it with the number of total partitions.          position = position + 1          (position, t)        }      } : Iterator[(Int, T)]      // include a shuffle step so that our upstream tasks are still distributed      new CoalescedRDD(        new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),        new HashPartitioner(numPartitions)),        numPartitions).values    } else {      new CoalescedRDD(this, numPartitions)    }  }

The coalesce () method returns the Rdd of a specified partition.

If a narrow dependency is generated, no shuffle will occur. For example, if 1000 partitions are reset to 10, no shuffle will occur.

For more information about Rdd dependencies, see here. There are two types of Rdd dependencies: narrow dependency and wide dependency.

Narrow dependency means that a partition of the parent Rdd can only be referenced by a partition of the Child Rdd, that is, a partition of the parent Rdd corresponds to a partition of the Child Rdd, or multiple parent Rdd partitions correspond to one sub-Rdd partition.

The wide dependency means that the sub-RDD partition depends on multiple or all partitions of the parent RDD, that is, one partition of a parent RDD corresponds to multiple partitions of a sub-RDD. One parent RDD partition corresponds to multiple sub-RDD partitions, which are divided into two situations: one parent RDD corresponds to all sub-RDD partitions (Join without collaborative Division) or one parent RDD corresponds to not all multiple RDD partitions (such as groupByKey ).

As shown in: map is a narrow dependency, while join leads to wide dependency.

Back to the partition, if the number of partitions changes dramatically, such as setting numPartitions = 1, this may cause less nodes to run computing than you think. To avoid this situation, you can set shuffle = true,

This will increase the shuffle operation.

This error may occur when the number of partitions changes from thousands of partitions in the parent Rdd to several.

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 77.0 failed 4 times, most recent failure: Lost task 1.3 in stage 77.0 (TID 6334, 192.168.8.61): java.io.IOException: Unable to acquire 16777216 bytes of memory        at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:351)        at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:332)        at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:461)        at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:139)        at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:489)        at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)        at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)        at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)        at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)        at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)        at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)        at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:63)        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)        at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:63)        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)        at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99)        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)        at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)        at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:96)        at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:95)        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:209)        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)        at org.apache.spark.scheduler.Task.run(Task.scala:88)        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)        at java.lang.Thread.run(Thread.java:744)

This error can be solved by setting shuffle to true.

When the number of partitions of the parent Rdd increases, for example, the Rdd partition is 100 and is set to 1000. If shuffle is set to false, this does not work.

At this time, you need to set shuffle to true, then Rdd will return the Rdd of a 1000 partition after shuffle. The data partition method uses hash partitioner by default.

Finally, let's take a look at the source code of the repartition () method:

  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {    coalesce(numPartitions, shuffle = true)  }

From the source code, we can see that the repartition () method is the case where the coalesce () method shuffle is true. If you only want to reduce the number of partitions of the parent Rdd, and the number of partitions to be set is not very intense, you can directly use the coalesce method to avoid shuffle operations and improve efficiency.

If there are any mistakes or omissions, please kindly advise and I will correct them.

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.