Spark1.6 Memory Management
The memory management module of Spark has changed since version 1.6.0. The memory management module of the old version implements the StaticMemoryManager class and is now called "legacy ". The "Legacy" mode is disabled by default, which means that when you use Spark1.5.x and Spark1.6.x to run the same code, there will be different results. Pay attention to this. For compatibility consideration, you can set spark. memory. useLegacyMode to available. The default value is false.
This article introduces the new memory management model after spark1.6.0, which implements the UnifiedMemoryManager class.
In this figure, you can see three main memory areas.
1. Reserved Memory. This part of Memory is Reserved by the system and its size is hard-coded. In Spark1.6.0, the size is 300 MB, which means that this part of memory cannot be included in Spark memory computing unless you re-compile the source code or set spark. testing. reservedMemory, its size cannot be changed, because park. testing. reservedMemory is only a test parameter, so it is not recommended in production. Note that this part of memory is called "Reserved". In fact, it is not used by spark to do anything, but it limits the size of memory that you can allocate in spark. Even if you want to use all JVM heap memory for spark to cache data, you cannot use this part of idle memory (it is not a waste, but it actually stores some internal objects of Spark ). For your reference, if you cannot set the executor to at least 1.5 * Reserved Memory = Mb heap Memory, the task will fail and the "please use larger heap size" error message will be prompted.
2. User Memory. This part of Memory is after the Spark Memory is allocated, and what this part is used for depends on you. You can store the data structure used by the RDD transformations process. For example, you can use mapPartitions transformation to rewrite Spark aggregation, and mapPartitions transformations to save the hash table to ensure that aggregation runs. This part of data is stored in User Memory. Again, this is User Memory. It is up to you to decide what to store and how to use it. Spark does not care what you use this region for, and how to use it, it does not consider whether your code will cause memory overflow in this area.
3. Spark Memory. This part of Memory is managed by Spark. Memory size calculation: ("Java Heap"-"Reserved Memory") * spark. memory. fraction, and the default size in spark1.6.0 is: ("Java Heap"-300 MB) * 0.75. For example, if the heap Memory size is 4 GB, there will be 28 47mb Spark Memory, Spark Memory = (4*1024MB-300) * 0.75 = 2847 MB. The Memory is divided into two parts: Storage Memory and Execution memory. The boundary of these two parts is set by the spark. Memory. storageFraction parameter. The default value is 0.5, which is 50%. The advantage of the new memory management model is that this boundary is not fixed and can be moved under memory pressure. If the memory in one region is insufficient, you can borrow the memory from the other region. Next we will discuss how to move and use it:
1. Storage Memory. This part of Memory can be used to cache spark data or temporary space for unroll serialized data. Broadcast variables are also stored here as blocks. You are surprised that unroll, because you may say that you do not need so much space to unroll the block to make it available-when there is not enough memory to unroll the bolock, if the persistence level is allowed, the unroll block in this part of memory will be directly obtained. As for broadcast variables, when its persistence level is MEMORY_AND_DISK, it will be cached here.
2. Execution Memory. This part of Memory is used to store some objects during task Execution. For example, it can be used to shuffle the intermediate cache of the map, or to store the hash table of the hash aggregation process. when there is not enough memory, this part of memory supports overflow to the disk, but the blocks of this part of memory will not be squeezed out by tasks of other threads.
Next let's talk about the boundary movement between Storage Memory and Execution Memory. From the nature of Execution Memory, you cannot squeeze out the data in the Memory space, because the data in the Memory is used for calculation, if the block data task cannot be found in the calculation process, it will fail. But this is not the case for the Storage Memory. It is only used to cache data in the Memory. If the block data inside is evicted, the metadata ing information of the block will be updated to inform the block that it has been removed. You can read the data from the HDD (or re-calculate it if the cache level is not overflow ).
Therefore, we can only exection Memory to squeeze out space from Storage Memory, and vice versa. So when will Execution Memory squeeze out space from Storage Memory? There are two possibilities:
• As long as Storage Memory has available space, you can increase the Execution Memory size and reduce the Storage Memory size.
• The Storage Memory space has exceeded the initial size and is fully occupied. In this case, you can forcibly remove Blocks from Storage Memory, reduce its space to the initial size.
In turn, Storage Memory can borrow space from Execution Memory only when the Execution Memory space is free, that is to say, as long as the Execution Memory is not enough, it can squeeze out space from the Storage Memory, regardless of whether there is space in the Storage Memory, and the Storage Memory can only be borrowed when the Execution Memory has no space to seize.
Initial Storage Memory size: "Spark Memory" * spark. memory. storageFraction = ("Java Heap"-"Reserved Memory") * spark. memory. fraction * spark. memory. storageFraction. Based on the default value, that is, ("Java Heap"-300 MB) * 0.75*0.5 = ("Java Heap"-300 MB) * 0.375. if Java Heap is 4 GB, there is a Storage Memory space of 1423.5MB.
This means that when we use Spark cacheu and load all the data to executor, at least the Storage Memory size must be equal to the default initial value. Because when the Storage Memory region is not full, the Execution Memory region has expanded more than the initial size, so we cannot forcibly expel the space data preemptible by Execution Memory, therefore, the Storage Memory will eventually become smaller.
I hope this article will help you better understand spark's new memory management mechanism and use it for application.
Translated from: spark-memory-management
For more Spark tutorials, see the following:
Install and configure Spark in CentOS 7.0
Spark1.0.0 Deployment Guide
Spark official documentation-Chinese Translation
Install Spark0.8.0 in CentOS 6.2 (64-bit)
Introduction to Spark and its installation and use in Ubuntu
Install the Spark cluster (on CentOS)
Hadoop vs Spark Performance Comparison
Spark installation and learning
Spark Parallel Computing Model
Ubuntu 14.04 LTS install Spark 1.6.0 (pseudo-distributed)
Spark details: click here
Spark: click here
This article permanently updates the link address: