Apache Spark Memory Management detailed
As a memory-based distributed computing engine, Spark's memory management module plays a very important role in the whole system. Understanding the fundamentals of spark memory management helps to better develop spark applications and perform performance tuning. The purpose of this paper is to comb out the thread of Spark memory management, and draw the reader's deep discussion on this topic. The principles described in this article are based on the Spark 2.1 release, which requires the reader to have a certain spark and Java Foundation to understand the concepts of RDD, Shuffle, JVM, etc.
When executing the Spark application, the Spark cluster launches the Driver and Executor two JVM processes, which are the primary control process, responsible for creating the spark context, submitting the spark job (job), and translating the job into a compute task (Task), in each Exe The cutor process coordinates the scheduling of tasks, which perform specific compute tasks on the work node and return the results to Driver, while providing storage capabilities for the RDD that needs to be persisted [1]. Because the memory management of Driver is relatively simple, this paper mainly analyzes the memory management of Executor, the Spark memory in the following refers to Executor memory.
1. In-heap and out-of-heap memory planning
As a JVM process, Executor's memory management is built on the JVM's memory management, and Spark allocates the JVM's in-Heap (on-heap) space in more detail to make the most of the memory. At the same time, Spark introduces out-of-heap (off-heap) memory, allowing it to create space directly in the system memory of the working node, further optimizing the use of memory.
Figure 1. In-heap and out-of-heap memory 1.1 in-heap memory
The size of the memory within the heap, configured by the –executor-memory or spark.executor.memory parameter when the Spark application starts. Concurrent tasks running within Executor share the JVM heap memory, which is used to cache RDD data and broadcast (broadcast) data in memory that is planned to store (Storage) memory, and these tasks are performed Shuffle The memory that is occupied is planned to execute (execution) memory, the rest of the parts do not have a special plan, those instances of objects inside spark, or object instances in user-defined spark applications, occupy the remaining space. In different management modes, these three parts occupy a varying amount of space (described in section 2nd below).
Spark's management of in-heap memory is a logical "programmatic" management, because the application and release of the object instance's memory is done by the JVM, and Spark can only record the memory after the request and before it is released, so let's look at its specific process:
- Spark a new object instance in the code
- The JVM allocates space from within the heap, creates objects, and returns object references
- Spark saves a reference to the object, recording the memory that the object occupies
- Spark records the memory freed by the object, removing the reference to the object
- Waits for the JVM's garbage collection mechanism to release the heap memory occupied by the object
We know that the object of the JVM can be stored in a serialized way, and the process of serialization is to convert the object into a binary byte stream, which can essentially be understood as converting a chain store of non-contiguous space into a continuous space or block storage, while accessing the inverse process of serialization-deserialization, converting a byte stream into an object, Serialization can save storage space, but increases the computational overhead of storing and reading.
For objects serialized in Spark, the memory size that is consumed by the byte stream is calculated directly, whereas for non-serialized objects, the memory used is estimated by a periodic sampling approximation, that is, not every new data item computes the amount of memory consumed at a time. [2] This method reduces the time overhead but is likely to have a large error, leading to the possibility that the actual memory at a given moment may be far beyond expectations. In addition, instances of objects that are marked as freed by Spark are most likely not actually reclaimed by the JVM, causing the actual available memory to be less than the available memory for Spark records. Therefore, Spark does not accurately record the actual available heap memory, and thus cannot completely avoid the exception of memory overflow (OOM, out of Memories).
While it is not possible to accurately control the application and release of memory within the heap, Spark can determine whether to cache new RDD in storage memory, and whether to allocate memory for new tasks, by independently planning management of storage memory and execution memory, to some extent increasing memory utilization and reducing the occurrence of exceptions.
1.2 Out-of-heap memory
To further optimize memory usage and improve the efficiency of Shuffle sequencing, Spark introduces out-of-heap (off-heap) memory, allowing it to open space directly in the system memory of the working node and store serialized binary data. Using the JDK unsafe API (starting with Spark 2.0, when managing storage memory outside the heap is no longer based on Tachyon, but as with out-of-heap execution memory, based on the JDK unsafe API implementation [3]), Spark can direct the OS out-of-heap memory, reducing unnecessary internal and frequent GC scans and recoveries, improving processing performance. The out-of-heap memory can be accurately applied and released, and the space occupied by the serialized data can be accurately computed, thus reducing the difficulty of management and reducing the error compared to the in-heap memory.
Out-of-heap memory is not enabled by default, can be enabled by configuring the spark.memory.offHeap.enabled parameter, and the size of the out-of-heap space is set by the Spark.memory.offHeap.size parameter. In addition to having no other space, the heap memory is partitioned in the same way as memory in the heap, and all running concurrent tasks share storage memory and execution memory.
1.3 Memory Management Interface
Spark provides a unified interface for storing memory and performing memory management--memorymanager, and the same task within a Executor invokes the method of this interface to request or free memory:
Listing 1. Main methods of memory management interface
123456789101112 |
//申请存储内存
def acquireStorageMemory(blockId: BlockId, numBytes: Long, memoryMode: MemoryMode): Boolean
//申请展开内存
def acquireUnrollMemory(blockId: BlockId, numBytes: Long, memoryMode: MemoryMode): Boolean
//申请执行内存
def acquireExecutionMemory(numBytes: Long, taskAttemptId: Long, memoryMode: MemoryMode): Long
//释放存储内存
def releaseStorageMemory(numBytes: Long, memoryMode: MemoryMode): Unit
//释放执行内存
def releaseExecutionMemory(numBytes: Long, taskAttemptId: Long, memoryMode: MemoryMode): Unit
//释放展开内存
def releaseUnrollMemory(numBytes: Long, memoryMode: MemoryMode): Unit
|
We see that when you call these methods, you need to specify their memory mode (Memorymode), which determines whether the operation is done inside or outside the heap.
Memorymanager implementation, after Spark 1.6 is the default for unified management (Unified memory manager) mode, the static memory manager method used before 1.6 is still retained, can be configured The Spark.memory.useLegacyMode parameter is enabled. The difference between the two approaches is the way the space is allocated, and the 2nd section below describes each of these two ways.
2. Memory space allocation 2.1 static memory management
Under the static memory management mechanism used by spark, storage memory, execution memory, and other memory sizes are fixed during the spark application run, but the user can be configured before the application starts, and the allocation of memory in the heap is 2:
Figure 2. Static memory management diagram-in-heap
As you can see, the amount of memory available in the heap needs to be calculated in the following way:
Listing 2. Free in-heap memory space
12 |
可用的存储内存 = systemMaxMemory * spark.storage.memoryFraction * spark.storage.safetyFraction 可用的执行内存 = systemMaxMemory * spark.shuffle.memoryFraction * spark.shuffle.safetyFraction |
Where systemmaxmemory depends on the size of the memory in the current JVM heap, and the last available execution memory or storage memory is multiplied on this basis with the respective memoryfraction parameters and Safetyfraction parameters. The two safetyfraction parameters in the above formula are meant to be logically reserved for 1-safetyfraction such an insurance area to reduce the risk of OOM due to actual memory exceeding the current preset range (mentioned above, The memory sampling estimate for non-serialized objects produces an error. It is worth noting that this reserved insurance area is only a logical plan, when used specifically, Spark is not treated differently, and "other memory" is given to the JVM to manage.
The space allocation outside the heap is simpler, with only storage memory and execution memory, as shown in 3. The amount of space available for execution memory and storage memory is determined directly by the parameter spark.memory.storageFraction, which eliminates the need to set the insurance area because the space available for out-of-heap memory can be calculated accurately.
Figure 3. Static memory management diagram-out-of-heap
The static memory management mechanism is simple to implement, but if the user is not familiar with the Spark storage mechanism, or does not depend on the specific data size and the calculation task or does the corresponding configuration, it is easy to create a "half sea, half flame" situation, that is, storage memory and execution memory of the side of the remaining large amount of space, While the other side was full early, had to eliminate or remove the old content to store the new content. Due to the advent of the new memory management mechanism, this approach has rarely been used by developers, and Spark still retains its implementation for the purpose of compatibility with older versions of the application.
2.2 Unified Memory Management
The unified memory management mechanism introduced after Spark 1.6 differs from static memory management in that the storage memory and execution memory share the same block space and can dynamically occupy each other's idle areas, as shown in 4 and Figure 5
Figure 4. Unified memory management diagram-in-heapFigure 5. Unified memory management diagram-out-of-heap
One of the most important optimizations is the dynamic occupancy mechanism, which has the following rules:
- Sets the basic storage memory and the execution memory area (spark.storage.storageFraction parameter), which determines the range of space that each side owns
- When both sides of the space are insufficient, then storage to the hard disk, if their own space is not enough to the other side, you can borrow each other's space; (Insufficient storage space means not enough to put down a complete Block)
- After the memory space is occupied by the other side, you can let the other party dump the occupied parts to the hard disk, and then "return" the Borrowed space
- Storage memory space is occupied by the other side, cannot let the other side "return", because need to consider many factors in the Shuffle process, the implementation is more complex [4]
Figure 6. Dynamic occupancy mechanism diagram
With a unified memory management mechanism, spark improves the utilization of in-heap and out-of-heap memory resources to a certain extent, reducing the developer's difficulty in maintaining Spark memory, but does not mean that developers can rest assured. For example, if storage memory is too large, or too much data is cached, it can lead to frequent, full-scale garbage collection, reducing performance when the task is executed because the cached RDD data is usually a long-term memory-resident [5]. So to get the most out of Spark's performance, developers need to learn more about how storage memory and execution memory are managed and implemented.
3. Storage Memory Management 3.1 RDD persistence mechanism
The
Elastic distributed Data Set (RDD), the most fundamental data abstraction for Spark, is a collection of read-only partition records (Partition) that can be created only on datasets that are in stable physical storage, or in other existing RDD The Perform transform on (transformation) operation produces a new RDD. The dependency between the converted Rdd and the original Rdd constitutes the descent (lineage). By descent, Spark guarantees that every RDD can be restored. But all of the changes to the RDD are inert, that is, Spark creates a task to read the RDD, and then really triggers the execution of the transformation, only when an action that returns a result to Driver occurs. When a
Task reads a partition at the beginning of the start, it first determines whether the partition has been persisted, or if not, it needs to check the Checkpoint or recalculate by descent. So if you want to perform multiple actions on an RDD, you can use the persist or cache method in the first action to persist or buffer the RDD in memory or on disk, thus increasing the computational speed in subsequent actions. In fact, the cache method uses the default Memory_only storage level to persist the RDD to memory, so caching is a special kind of persistence. in-heap and out-of-heap memory design, you can plan and control the memory used in cache RDD with unified planning and management management (other scenarios that store memory, such as caching broadcast data, are not covered in this article).
The persistence of the RDD is the responsibility of the Spark's Storage module [7], which enables the decoupling of the RDD from the physical storage. The Storage module manages the data generated by Spark during the calculation process, encapsulating those functions that access data in memory or on disk, locally or remotely. In the implementation of the Driver end and the Executor end of the Storage module constitutes a master-slave architecture, that is, the Driver end of Blockmanager for Master,executor is Blockmanager. The Storage module logically blocks as the basic unit of storage, and each Partition of the RDD is treated with only one block (Blockid in the Rdd_rdd-id_partition-id format). Master is responsible for the management and maintenance of the metadata information for blocks throughout the Spark application, and Slave needs to escalate the status of block updates to master, while receiving commands from master, such as adding or removing an RDD.
Figure 7. Storage Module
In the case of RDD persistence, Spark prescribes 7 different storage levels, such as Memory_only, Memory_and_disk, and storage levels, which are a combination of the following 5 variables:
Listing 3. Storage level
1234567 |
class StorageLevel private( private var _useDisk: Boolean, //磁盘 private var _useMemory: Boolean, //这里其实是指堆内内存 private var _useOffHeap: Boolean, //堆外内存 private var _deserialized: Boolean, //是否为非序列化 private var _replication: Int = 1 //副本个数 ) |
By analyzing the data structure, we can see that the storage level defines the Partition (and Block) of the RDD from three dimensions:
- Storage location: Disk/In-heap memory/out-of-heap memory. If Memory_and_disk is stored on both disk and in-heap memory, a redundant backup is implemented. Off_heap is only in the out-of-heap memory storage and cannot be stored to a different location at the moment when the out-of-heap memory is selected.
- Storage type: Whether the Block is in non-serialized form after it has been cached to storage memory. such as Memory_only is stored in a non-serialized manner, Off_heap is stored in serialization mode.
- Number of replicas: More than 1 o'clock requires remote redundancy to be backed up to another node. If disk_only_2 requires a remote backup of 1 copies.
3.2 The process of the RDD cache
RDD before caching to storage memory, data in Partition is typically accessed as an iterator (Iterator) structure, which is a way to iterate through the collection of data in the Scala language. With Iterator, you can get each serialized or non-serialized data item (record) in the partition, which logically occupies the space of the other part of the memory within the JVM heap, and the different record spaces of the same Partition are not contiguous.
After the RDD is cached to storage memory, the Partition is converted to Block,record to occupy a contiguous space in the heap or out-of-heap storage memory. The process of converting partition from discontinuous storage spaces to contiguous storage spaces, which spark calls "unwinding" (unroll). The Block has both serialized and non-serialized storage formats, depending on the storage level of the RDD. The non-serialized block is defined by a DESERIALIZEDMEMORYENTRY data structure, an array is used to store all the object instances, and the serialized block is defined by the SERIALIZEDMEMORYENTRY data structure. Use byte buffers (bytebuffer) to store binary data. Each Executor Storage module uses a chain MAP structure (LINKEDHASHMAP) to manage instances of all Block objects in the heap and out-of-heap storage memory [6], which indirectly records the application and release of memory for this linkedhashmap additions and deletions.
Because there is no guarantee that the storage space can accommodate all the data in Iterator at once, the current calculation task Unroll to Memorymanager to request enough unroll space to temporarily occupy, the space is not enough to unroll failure, sufficient space can continue. For serialized Partition, the required unroll space can be calculated directly, one application at a time. The non-serialized Partition, in turn, is requested in the process of traversing the record, that is, each record is read, sampling estimates its required unroll space and making the request, which can be interrupted when the space is insufficient, freeing up the occupied unroll space. If the final unroll succeeds, the Unroll space occupied by the current Partition is converted to the storage space of the normal cache RDD, as shown in 8.
Figure 8. Spark Unroll
As you can see in Figure 3 and Figure 5, in the case of static memory management, Spark specifically divides a block of unroll space in storage memory, which is fixed in size, and does not have a special distinction between unroll spaces when the uniform memory is managed, and is processed according to the dynamic occupancy mechanism when there is not enough storage space.
3.3 Elimination and landing
Because all the compute tasks of the same Executor share limited storage memory space, when a new block needs to be cached but the remaining space is insufficient and cannot be occupied dynamically, the old block in Linkedhashmap is eliminated (eviction), and the obsolete Block If the storage level contains both storage-to-disk requirements, you want to drop the disk (drop), or delete the block directly.
The elimination rules for storage memory are:
- The obsolete block will be the same as the memorymode of the new block, which is the same as the heap or in-heap memory
- Old and new blocks cannot belong to the same RDD, avoid the cycle of elimination
- Old Block-owned RDD cannot be read to avoid a consistency issue
- Traverse the block in the Linkedhashmap and retire in the least recently used (LRU) order until the space required for the new Block is met. Where LRU is the characteristic of Linkedhashmap.
The process of landing the disk is relatively simple, if its storage level conforms to the condition of true _usedisk, and then according to its _deserialized to determine whether the form of non-serialization, if it is serialized, and finally the data stored to disk, in the Storage module update its information.
4. Perform memory management 4.1 Multi-tasking memory allocation
Tasks running within Executor also share execution memory, and Spark uses a HASHMAP structure to store the task-to-memory mapping. The amount of execution memory that each task can occupy is 1/2n ~ 1/n, where N is the number of tasks that are running within the current Executor. At the start of each task, the task is requested to request a minimum of 1/2n of execution memory to the Memorymanager, and the task is blocked if it cannot be met until another task frees up enough execution memory to be awakened.
4.2 Shuffle Memory Footprint
Execution memory is primarily used to store the memory occupied by the task during execution of Shuffle, and Shuffle is the process of repartitioning the RDD data in accordance with certain rules, and we look at Shuffle's Write and Read two stages of the execution memory usage:
- If you choose the normal sorting method on the map side, the Externalsorter will be used for the outer row, and the data in memory is mainly occupied in the heap execution space.
- If you select the Tungsten sort method on the map side, the data stored in the serialized form is sorted directly using Shuffleexternalsorter, which can occupy out-of-heap or in-heap execution space when storing data in memory. Depends on whether the user has turned on the out-of-heap memory and if the out-of-heap memory is sufficient.
- When data is aggregated on the reduce side, the data is handed to aggregator processing, which consumes the in-heap execution space when the data is stored in memory.
- If you need to sort the final results, you will be handing the data back to Externalsorter processing, taking up the execution space in the heap.
In Externalsorter and aggregator, Spark uses a hash table called Appendonlymap to store data in memory in the heap, but all data in the Shuffle process cannot be saved to that hash table. When the memory used by this hash table is periodically sampled and estimated, and when it is too large to be applied from Memorymanager to the new execution memory, Spark stores its entire contents into a disk file, a process known as overflow (spill), Files that are spilled to disk will eventually be merged (merge).
The tungsten used in the Shuffle Write phase is the Databricks company's plan to optimize the memory and CPU usage of Spark [9], which addresses some of the JVM's performance limitations and drawbacks. Spark will automatically choose whether to use tungsten sorting based on the Shuffle situation. The page-based memory management mechanism used by tungsten is built on Memorymanager, which is a step-by-tungsten abstraction of the use of execution memory so that the Shuffle process does not need to be concerned about whether the data is stored in or out of the heap. Each memory page is defined with a memoryblock, and the two variables, Object obj and long offset, uniformly identify the address of a memory page in system memory. The Memoryblock in the heap is the memory allocated in the form of a long array whose obj value is the object reference of the array, and offset is the initial offset address of the long array in the JVM, which can be used to locate the absolute address of the array within the heap; Memory outside the heap Block is a block of memory directly requested, and its obj is Null,offset is the 64-bit absolute address of this memory block in system memory. Spark cleverly encapsulates the in-heap and out-of-heap memory pages with Memoryblock, and uses the page table (pagetable) to manage the pages of memory each Task requests to.
All memory under Tungsten page management is represented by a 64-bit logical address, which consists of a page number and an in-page offset:
- Page number: Occupies 13 bits, uniquely identifies a memory page, and Spark requests a free page number before requesting the memory page.
- In-page offset: 51 bits, which is the offset address of the data in the page when the data is stored using a memory page.
With a unified addressing approach, Spark can navigate to the heap or out-of-heap memory with a pointer to a 64-bit logical address, and the entire Shuffle Write ordering process is very efficient, with no deserialization required, and the entire process is highly effective for memory access efficiency and CPU [10] The efficiency of use resulted in a noticeable increase.
Spark's storage memory and execution memory are managed differently: for storage memory, Spark uses a linkedhashmap to centrally manage all block,block from the Partition of the RDD that needs to be cached, and for memory execution, Spark uses Appendonlymap to store the data in the Shuffle process, and even becomes a page-memory management in tungsten sort, opening up a new JVM memory management mechanism.
Conclusion
Spark's memory management is a complex set of mechanisms, and spark version update faster, the author level is limited, inevitably there is a narrative unclear, the wrong place, if the reader has good advice and a deeper understanding, but also hope to enlighten.
Reference Resources
- Spark Cluster Mode Overview
- Spark Sort Based Shuffle Memory Analysis
- Spark Off_heap
- Unified Memory Management in Spark 1.6
- Tuning spark:garbage Collection Tuning
- Spark Architecture
- Spark Technology Insider: In-depth analysis of the spark core architecture in the implementation Principle Chapter 8th Storage module detailed
- Spark Sort Based Shuffle Memory Analysis
- Project tungsten:bringing Apache Spark Closer to Bare Metal
- Spark tungsten-sort Based Shuffle Analysis
- Discover the secrets of Spark tungsten
- Spark Task Memory Management (ON-HEAP&OFF-HEAP)
Apache Spark Memory Management detailed