Purpose of this article
?
When using spark recently to process large data, you have encountered a problem with partition 2G throttling (KEN). Found a solution, and collected some information on the internet, recorded here, as a memo.
?
Problem phenomenon
?
When you encounter this problem, the Spark log will report the following log,
Episode 1
15/04/16 14:13:03 WARN Scheduler. Tasksetmanager:lost task 19.0 in Stage 6.0 (TID, 10.215.149.47): java.lang.IllegalArgumentException: size exceeds Integer.max_value at Sun.nio.ch.FileChannelImpl.map ( filechannelimpl.java:828) at Org.apache.spark.storage.DiskStore.getBytes (diskstore.scala:123) at Org.apache.spark.storage.DiskStore.getBytes (diskstore.scala:132) at Org.apache.spark.storage.BlockManager.doGetLocal (blockmanager.scala:517) at Org.apache.spark.storage.BlockManager.getLocal (blockmanager.scala:432) at Org.apache.spark.storage.BlockManager.get (blockmanager.scala:618) at Org.apache.spark.CacheManager.putInBlockManager (cachemanager.scala:146) at Org.apache.spark.CacheManager.getOrCompute (cachemanager.scala:70) |
?
Episode 2
15/04/16 14:19:45 INFO Scheduler. Tasksetmanager:starting task 20.2 in Stage 6.0 (TID 146, 10.196.151.213, process_local, 1666 bytes) 15/04/16 14:19:45 INFO Scheduler. Tasksetmanager:lost task 20.2 in Stage 6.0 (TID 146) on executor 10.196.151.213:java.lang.illegalargumentexception ( Size exceeds Integer.max_value) [Duplicate 1] 15/04/16 14:19:45 INFO Scheduler. Tasksetmanager:starting task 20.3 in Stage 6.0 (TID 147, 10.196.151.213, process_local, 1666 bytes) 15/04/16 14:19:45 INFO Scheduler. Tasksetmanager:lost task 20.3 in Stage 6.0 (TID 147) on executor 10.196.151.213:java.lang.illegalargumentexception (Size exceeds integer.max_value) [Duplicate 2] 15/04/16 14:19:45 ERROR Scheduler. Tasksetmanager:task in stage 6.0 failed 4 times; Aborting job 15/04/16 14:19:45 INFO cluster. Yarnclusterscheduler:cancelling Stage 6 15/04/16 14:19:45 INFO cluster. Yarnclusterscheduler:stage 6 was cancelled 15/04/16 14:19:45 INFO Scheduler. Dagscheduler:job 6 Failed:collectasmap at decisiontree.scala:653, took 239.760845 s 15/04/16 14:19:45 ERROR yarn. Applicationmaster:user class threw exception:job aborted due to stage failure:task on stage 6.0 failed 4 times, most Recent Failure:lost task 20.3 in Stage 6.0 (TID 147, 10.196.151.213): Java.lang.IllegalArgumentException:Size exceeds I Nteger. Max_value At Sun.nio.ch.FileChannelImpl.map (filechannelimpl.java:828) |
?
Note the red highlight, the exception is the amount of data for a partition more than Integer.max_value (2147483647 = 2GB).
?
Workaround
?
Manually set the number of partitions for the RDD. The current spark default RDD partition is 18, then manually set to 1000, which solves the problem. You can use the rdd.repartition (numpart:int) function to reset the number of partitions after the RDD is loaded.
?
Why 2G Limit
?
The spark community has a lot to do with this limitation (TU), and the official spark team has noticed the problem, but until the 1.2 version, the problem remains unresolved. Because of the implementation framework that involves the entire RDD, the cost of improvement is quite large!
?
Here are some relevant information that interested readers can read further:
- 2GB limit in spark for blocks
- Create Largebytebuffer abstraction for eliminating 2GB limit on blocks
- Why does Spark RDD partition have 2GB limit for HDFS
- Java code for throwing exceptions: Filechannelimpl.java
?
Personal Thoughts (Yu) (Jian)
?
This restriction has some rationality. Because the partition operation in the RDD is executed concurrently, the computational efficiency is limited if the partition is too small and causes too few concurrent numbers. So, based on this limitation, the Spark application developer will proactively expand the number of partition, which is to increase concurrency and ultimately improve computing performance.
?
These are just a few can think, if not correct, also please shoot bricks.
RDD Partition 2GB Limit