Within Spark, memory management consists of two blocks, part of the JVM's heap memory (On-heap memories), which is partially The Dirver parameter executor-memory and spark.executor.memory are specified, and the other part is out-of-heap memory (off-heap memory), and the out-of-heap ram is off by default. It needs to be opened and sized by spark.memory.offheap.enabled and spark.memory.offheap.size, and the presence of the heap can be recycled quickly (GC is periodically recycled) while expanding the JVM's controllable memory. There are two types of memory management, namely, executor and storage, the former is the memory occupied by operations such as shuffle when calculating, the latter is the memory space occupied by the RDD cache. There are two types of memory allocation, namely, static memory allocation, and unified memory allocation, the difference between the two types of memory allocation is storage and executor attached to the dividing line of memory, static memory allocation is executor and storage both memory is static, calculated by the formula Unified memory management is not specifically divided according to their needs; if neither is sufficient, it is serialized into memory, and if one party has insufficient memory and the total memory is Yu Fu, the memory is automatically expanded. For the memory allocation of the storage domain, mainly for the RDD cache, in the cache when the storage policy can be specified, and when the RDD is cached,
storage space will have discontinuous space into continuous space, this process is called UnrollThis part of the memory management is through the LINKEDHASHMAP to the space management, as the cache, if the memory space is not enough, will be based on the LRU policy of Elimination (eviction), for the elimination of block if the configuration cache policy contains disk policy, is serialized to the physical disk for saving, a process called drop. For memory allocation of the executor domain, each task will be assigned to the current allocated size of [1/2n~1/n] (here is emphasized because if the allocation type is the unified memory management will dynamically change) the size of the space, executor domain memory is mainly shuffle use, Here are two scenarios, shuffle write and shuffle read,write occupy the memory strategy is more complex, if it is the general sort, mainly with the heap memory, if it is tungsten sort, Is the way in which the out-of-heap memory is combined with the memory in the heap (if the external memory is not enough), and whether the sort is a normal sort or tungsten is determined by spark. For shuffle read, the main use is in-heap memory. Reference:https://www.ibm.com/developerworks/cn/analytics/library/ba-cn-apache-spark-memory-management/index.html
Apache Spark Memory Management detailed