RDD Partition 2GB Limit

Last Update:2015-04-25 Source: Internet

Author: User

Tags spark rdd

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Purpose of this article

When using spark recently to process large data, you have encountered a problem with partition 2G throttling (KEN). Found a solution, and collected some information on the internet, recorded here, as a memo.

Problem phenomenon

When you encounter this problem, the Spark log will report the following log,

Episode 1

15/04/16 14:13:03 WARN Scheduler. Tasksetmanager:lost task 19.0 in Stage 6.0 (TID, 10.215.149.47): java.lang.IllegalArgumentException: size exceeds Integer.max_value
at Sun.nio.ch.FileChannelImpl.map ( filechannelimpl.java:828)
at Org.apache.spark.storage.DiskStore.getBytes (diskstore.scala:123)
at Org.apache.spark.storage.DiskStore.getBytes (diskstore.scala:132)
at Org.apache.spark.storage.BlockManager.doGetLocal (blockmanager.scala:517)
at Org.apache.spark.storage.BlockManager.getLocal (blockmanager.scala:432)
at Org.apache.spark.storage.BlockManager.get (blockmanager.scala:618)
at Org.apache.spark.CacheManager.putInBlockManager (cachemanager.scala:146)
at Org.apache.spark.CacheManager.getOrCompute (cachemanager.scala:70)

Episode 2

15/04/16 14:19:45 INFO Scheduler. Tasksetmanager:starting task 20.2 in Stage 6.0 (TID 146, 10.196.151.213, process_local, 1666 bytes)

15/04/16 14:19:45 INFO Scheduler. Tasksetmanager:lost task 20.2 in Stage 6.0 (TID 146) on executor 10.196.151.213:java.lang.illegalargumentexception ( Size exceeds Integer.max_value) [Duplicate 1]

15/04/16 14:19:45 INFO Scheduler. Tasksetmanager:starting task 20.3 in Stage 6.0 (TID 147, 10.196.151.213, process_local, 1666 bytes)

15/04/16 14:19:45 INFO Scheduler. Tasksetmanager:lost task 20.3 in Stage 6.0 (TID 147) on executor 10.196.151.213:java.lang.illegalargumentexception (Size exceeds integer.max_value) [Duplicate 2]

15/04/16 14:19:45 ERROR Scheduler. Tasksetmanager:task in stage 6.0 failed 4 times; Aborting job

15/04/16 14:19:45 INFO cluster. Yarnclusterscheduler:cancelling Stage 6

15/04/16 14:19:45 INFO cluster. Yarnclusterscheduler:stage 6 was cancelled

15/04/16 14:19:45 INFO Scheduler. Dagscheduler:job 6 Failed:collectasmap at decisiontree.scala:653, took 239.760845 s

15/04/16 14:19:45 ERROR yarn. Applicationmaster:user class threw exception:job aborted due to stage failure:task on stage 6.0 failed 4 times, most Recent Failure:lost task 20.3 in Stage 6.0 (TID 147, 10.196.151.213): Java.lang.IllegalArgumentException:Size exceeds I Nteger. Max_value

At Sun.nio.ch.FileChannelImpl.map (filechannelimpl.java:828)

Note the red highlight, the exception is the amount of data for a partition more than Integer.max_value (2147483647 = 2GB).

Workaround

Manually set the number of partitions for the RDD. The current spark default RDD partition is 18, then manually set to 1000, which solves the problem. You can use the rdd.repartition (numpart:int) function to reset the number of partitions after the RDD is loaded.

Why 2G Limit

The spark community has a lot to do with this limitation (TU), and the official spark team has noticed the problem, but until the 1.2 version, the problem remains unresolved. Because of the implementation framework that involves the entire RDD, the cost of improvement is quite large!

Here are some relevant information that interested readers can read further:

2GB limit in spark for blocks
Create Largebytebuffer abstraction for eliminating 2GB limit on blocks
Why does Spark RDD partition have 2GB limit for HDFS
Java code for throwing exceptions: Filechannelimpl.java

Personal Thoughts (Yu) (Jian)

This restriction has some rationality. Because the partition operation in the RDD is executed concurrently, the computational efficiency is limited if the partition is too small and causes too few concurrent numbers. So, based on this limitation, the Spark application developer will proactively expand the number of partition, which is to increase concurrency and ultimately improve computing performance.

These are just a few can think, if not correct, also please shoot bricks.

RDD Partition 2GB Limit

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More