Objective
The new memory model is presented in this JIRA, JIRA-10000, corresponding to the design document in this: Unified-memory-management.
Memory Manager
In the Spark 1.6 release, Memorymanager is selected by the
Spark.memory.uselegacymode=false
Spark.memory.uselegacymode=true (represents the previous version of 1.6 used)
Decision-making. If you use a model prior to 1.6, this will be managed using Staticmemorymanager, otherwise the new Unifiedmemorymanager
Let's take a look at 1.6 before, for a executor, what parts of memory are composed:
1.ExecutionMemory. This memory area is intended to solve the shuffles,joins, sorts and aggregations process in order to avoid the need for frequent IO buffer. Configured via Spark.shuffle.memoryFraction (default 0.2).
2.StorageMemory. This memory area is meant to solve the block cache (which is how you show calls to Dd.cache, Rdd.persist, and so on), as well as broadcasts, and the storage of task results. The parameter spark.storage.memoryFraction can be passed (default 0.6).
3.OtherMemory. Reserved for the system, because the program itself is required to run memory. (default is 0.2).
In addition, in order to prevent oom, generally there will be a safetyfraction, such as Executionmemory real available memory is spark.shuffle.memoryFraction * Spark.shuffle.safetyFraction is 0.8 * 0.2, only 16% of the memory is available. This memory allocation mechanism, the biggest problem is that no one can exceed their own limit, the number is defined by how much, although another piece of memory idle. This is more serious in storagememory and executionmemory, they are the large consumption of memory.
This problem leads to the unified Memory management model, which focuses on breaking the distinct boundaries of executionmemory and Storagememory.
Othermemory
Other memory has also been adjusted in 1.6 to ensure that at least 300m is available. You can also set the spark.testing.reservedMemory manually. The actual available memory is then subtracted from the reservedmemory to get usablememory. Executionmemory and Storagememory Share usablememory * 0.75 of memory. 0.75 can be set with the new parameter spark.memory.fraction. Currently the default value for Spark.memory.storageFraction is 0.5, so the executionmemory,storagememory default is to evenly divide the available memory mentioned above.
Unifiedmemorymanager
This class provides a two-core approach:
Acquireexecutionmemory
Acquirestoragememory
Acquireexecutionmemory
Each time you apply for executionmemory, the Maybegrowexecutionpool method is called, and we can get several meaningful conclusions based on this method:
If executionmemory memory is sufficient, no memory is triggered to request memory for storage
each task is limited to poolsize/(2 * numactivetasks) ~ maxpoolsize/ Between the numactivetasks.
1 maxpoolsize = maxmemorymath.min (storagememoryused,storageregionsize)
2
3 Poolsize = executionmemorypool.poolsize (memory held by the current Executionmemorypool)
If executionmemory has insufficient memory, it triggers an operation to index the memory to the storagememory.
If the executionmemory memory is not enough, then will be to storagememory to memory, how exactly? Look at the following code to understand:
1 Val memoryreclaimablefromstorage =math.max (Storagememorypool.memoryfree, Storagememorypool.poolsize- storageregionsize) See Storagememorypool's remaining memory and Storagememorypool from the Executionmemory borrowed memory that big, take the largest one, as can be re-return the maximum memory. This is the way it is expressed in a formula:
Maximum memory executionmemory can borrow = Storagememory borrowed memory + storagememory free memory
Of course, if the actual need is less than the maximum value that can be borrowed, the actual required value will prevail. The following code shows this logic:
1 val spacereclaimed = Storagememorypool.shrinkpooltofreespace (
2 math.min (Extramemoryneeded,memoryreclaimablefromstorage))
3
4 Onheapexecutionmemorypool.incrementpoolsize (spacereclaimed)
Acquirestoragememory
The process is similar to acquireexecutionmemory, but the difference is that if and only if Executionmemory has free memory, storagememory can borrow that memory. This logic is embodied in this line of code:
Val memoryborrowedfromexecution = Math.min (Onheapexecutionmemorypool.memoryfree, numbytes)
So storagememory the memory borrowed from Executionmemory depends entirely on whether Executionmemory has free memory at that time.
MemoryPool
The interaction between Storagememory and Executionmemory is mentioned earlier. The specific representation of memory is now done by MemoryPool.
The Unifiedmemorymanage maintains three objects:
@GuardedBy ("This")
Protected Val Storagememorypool = new Storagememorypool (this)
@GuardedBy ("This")
Protected Val Onheapexecutionmemorypool = new Executionmemorypool (this, "On-heap execution")
@GuardedBy ("This")
Protected Val Offheapexecutionmemorypool = new Executionmemorypool (this, "Off-heap execution")
The actual memory count is actually done by these objects. Like what
Borrowing and lending of memory into
Task Current memory usage tracking
The value of the note is that we previously knew that when the system shuffle, it is possible to use in-heap/off-heap memory. In Unifiedmemorymanage, different objects are used to track. If you turn on Offheapexecutionmemorypool, there is no interaction with storagememory, and there is no concept of dynamic memory.
Summarize
Theoretically can reduce the number of shuffle spill, the extreme situation may not have spill process, can greatly reduce the number of IO
If your memory is too tight, you may not be able to mitigate the problem.
If your program is biased, such as a heavy exectionmemory or one of the storagememory, it may have a better effect.