RDD Partition 2GB Limit

Source: Internet
Author: User
Tags spark rdd

Purpose of this article

?

When using spark recently to process large data, you have encountered a problem with partition 2G throttling (KEN). Found a solution, and collected some information on the internet, recorded here, as a memo.

?

Problem phenomenon

?

When you encounter this problem, the Spark log will report the following log,

Episode 1

15/04/16 14:13:03 WARN Scheduler. Tasksetmanager:lost task 19.0 in Stage 6.0 (TID, 10.215.149.47): java.lang.IllegalArgumentException: size exceeds Integer.max_value
at Sun.nio.ch.FileChannelImpl.map ( filechannelimpl.java:828)
at Org.apache.spark.storage.DiskStore.getBytes (diskstore.scala:123)
at Org.apache.spark.storage.DiskStore.getBytes (diskstore.scala:132)
at Org.apache.spark.storage.BlockManager.doGetLocal (blockmanager.scala:517)
at Org.apache.spark.storage.BlockManager.getLocal (blockmanager.scala:432)
at Org.apache.spark.storage.BlockManager.get (blockmanager.scala:618)
at Org.apache.spark.CacheManager.putInBlockManager (cachemanager.scala:146)
at Org.apache.spark.CacheManager.getOrCompute (cachemanager.scala:70)

?

Episode 2

15/04/16 14:19:45 INFO Scheduler. Tasksetmanager:starting task 20.2 in Stage 6.0 (TID 146, 10.196.151.213, process_local, 1666 bytes)

15/04/16 14:19:45 INFO Scheduler. Tasksetmanager:lost task 20.2 in Stage 6.0 (TID 146) on executor 10.196.151.213:java.lang.illegalargumentexception ( Size exceeds Integer.max_value) [Duplicate 1]

15/04/16 14:19:45 INFO Scheduler. Tasksetmanager:starting task 20.3 in Stage 6.0 (TID 147, 10.196.151.213, process_local, 1666 bytes)

15/04/16 14:19:45 INFO Scheduler. Tasksetmanager:lost task 20.3 in Stage 6.0 (TID 147) on executor 10.196.151.213:java.lang.illegalargumentexception (Size exceeds integer.max_value) [Duplicate 2]

15/04/16 14:19:45 ERROR Scheduler. Tasksetmanager:task in stage 6.0 failed 4 times; Aborting job

15/04/16 14:19:45 INFO cluster. Yarnclusterscheduler:cancelling Stage 6

15/04/16 14:19:45 INFO cluster. Yarnclusterscheduler:stage 6 was cancelled

15/04/16 14:19:45 INFO Scheduler. Dagscheduler:job 6 Failed:collectasmap at decisiontree.scala:653, took 239.760845 s

15/04/16 14:19:45 ERROR yarn. Applicationmaster:user class threw exception:job aborted due to stage failure:task on stage 6.0 failed 4 times, most Recent Failure:lost task 20.3 in Stage 6.0 (TID 147, 10.196.151.213): Java.lang.IllegalArgumentException:Size exceeds I Nteger. Max_value

At Sun.nio.ch.FileChannelImpl.map (filechannelimpl.java:828)

?

Note the red highlight, the exception is the amount of data for a partition more than Integer.max_value (2147483647 = 2GB).

?

Workaround

?

Manually set the number of partitions for the RDD. The current spark default RDD partition is 18, then manually set to 1000, which solves the problem. You can use the rdd.repartition (numpart:int) function to reset the number of partitions after the RDD is loaded.

?

Why 2G Limit

?

The spark community has a lot to do with this limitation (TU), and the official spark team has noticed the problem, but until the 1.2 version, the problem remains unresolved. Because of the implementation framework that involves the entire RDD, the cost of improvement is quite large!

?

Here are some relevant information that interested readers can read further:

    • 2GB limit in spark for blocks
    • Create Largebytebuffer abstraction for eliminating 2GB limit on blocks
    • Why does Spark RDD partition have 2GB limit for HDFS
    • Java code for throwing exceptions: Filechannelimpl.java

?

Personal Thoughts (Yu) (Jian)

?

This restriction has some rationality. Because the partition operation in the RDD is executed concurrently, the computational efficiency is limited if the partition is too small and causes too few concurrent numbers. So, based on this limitation, the Spark application developer will proactively expand the number of partition, which is to increase concurrency and ultimately improve computing performance.

?

These are just a few can think, if not correct, also please shoot bricks.

RDD Partition 2GB Limit

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.