Spark Rdd coalesce () method and repartition () method, rddcoalesce

Last Update:2016-04-15 Source: Internet

Author: User

Tags spark rdd

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Spark Rdd coalesce () method and repartition () method, rddcoalesce

In Rdd of Spark, Rdd is partitioned.

Sometimes you need to reset the number of Rdd partitions. For example, in Rdd partitions, there are many Rdd partitions, but the data volume of each Rdd is small. You need to set a reasonable partition. Or you need to increase the number of Rdd partitions. In addition, you can set the number of generated files by setting an Rdd partition.

You can reset the Rdd partition by using the coalesce () method and repartition () method ().

What are the differences between the two methods? Let's look at the source code:

  def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)      : RDD[T] = withScope {    if (shuffle) {      /** Distributes elements evenly across output partitions, starting from a random partition. */      val distributePartition = (index: Int, items: Iterator[T]) => {        var position = (new Random(index)).nextInt(numPartitions)        items.map { t =>          // Note that the hash code of the key will just be the key itself. The HashPartitioner          // will mod it with the number of total partitions.          position = position + 1          (position, t)        }      } : Iterator[(Int, T)]      // include a shuffle step so that our upstream tasks are still distributed      new CoalescedRDD(        new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),        new HashPartitioner(numPartitions)),        numPartitions).values    } else {      new CoalescedRDD(this, numPartitions)    }  }

The coalesce () method returns the Rdd of a specified partition.

If a narrow dependency is generated, no shuffle will occur. For example, if 1000 partitions are reset to 10, no shuffle will occur.

For more information about Rdd dependencies, see here. There are two types of Rdd dependencies: narrow dependency and wide dependency.

Narrow dependency means that a partition of the parent Rdd can only be referenced by a partition of the Child Rdd, that is, a partition of the parent Rdd corresponds to a partition of the Child Rdd, or multiple parent Rdd partitions correspond to one sub-Rdd partition.

The wide dependency means that the sub-RDD partition depends on multiple or all partitions of the parent RDD, that is, one partition of a parent RDD corresponds to multiple partitions of a sub-RDD. One parent RDD partition corresponds to multiple sub-RDD partitions, which are divided into two situations: one parent RDD corresponds to all sub-RDD partitions (Join without collaborative Division) or one parent RDD corresponds to not all multiple RDD partitions (such as groupByKey ).

As shown in: map is a narrow dependency, while join leads to wide dependency.

Back to the partition, if the number of partitions changes dramatically, such as setting numPartitions = 1, this may cause less nodes to run computing than you think. To avoid this situation, you can set shuffle = true,

This will increase the shuffle operation.

This error may occur when the number of partitions changes from thousands of partitions in the parent Rdd to several.

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 77.0 failed 4 times, most recent failure: Lost task 1.3 in stage 77.0 (TID 6334, 192.168.8.61): java.io.IOException: Unable to acquire 16777216 bytes of memory        at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:351)        at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:332)        at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:461)        at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:139)        at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:489)        at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)        at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)        at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)        at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)        at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)        at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)        at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:63)        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)        at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:63)        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)        at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99)        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)        at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)        at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:96)        at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:95)        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:209)        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)        at org.apache.spark.scheduler.Task.run(Task.scala:88)        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)        at java.lang.Thread.run(Thread.java:744)

This error can be solved by setting shuffle to true.

When the number of partitions of the parent Rdd increases, for example, the Rdd partition is 100 and is set to 1000. If shuffle is set to false, this does not work.

At this time, you need to set shuffle to true, then Rdd will return the Rdd of a 1000 partition after shuffle. The data partition method uses hash partitioner by default.

Finally, let's take a look at the source code of the repartition () method:

  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {    coalesce(numPartitions, shuffle = true)  }

From the source code, we can see that the repartition () method is the case where the coalesce () method shuffle is true. If you only want to reduce the number of partitions of the parent Rdd, and the number of partitions to be set is not very intense, you can directly use the coalesce method to avoid shuffle operations and improve efficiency.

If there are any mistakes or omissions, please kindly advise and I will correct them.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More