Yarn memory allocation management mechanism and related parameter configuration, yarn Mechanism
Understanding the memory management and allocation mechanism of Yarn is particularly important for us to build and deploy clusters, develop and maintain applications, and I have done some research for your reference.
I. Related configurations
Yarn memory allocation and management mainly involves the concepts of ResourceManage, ApplicationMatser, and NodeManager, and related optimization should also be carried out closely around these aspects. There is also a concept of iner. Now we can understand it as a Container for running map/reduce tasks, which will be described in detail later.
1.1 RM memory resource configuration, which is related to resource scheduling
RM1: yarn. scheduler. minimum-allocation-mb: The minimum memory that can be applied to a single AM container
RM2: yarn. scheduler. maximum-allocation-mb is allocated to the maximum memory that can be applied by a single AM container.
Note:
L The minimum value can calculate the maximum number of containers on a node.
L once set, it cannot be changed dynamically
Memory resource configuration of 1.2 NM, which is related to hardware resources
NM1: yarn. nodemanager. resource. memory-mb node maximum available memory
NM2: yarn. nodemanager. vmem-pmem-ratio virtual memory rate, 2.1 by default
Note:
L The RM1 and RM2 values cannot be greater than the NM1 values.
L NM1 can calculate the maximum number of containers on a node, max (Container) = NM1/RM2
L once set, it cannot be changed dynamically
1.3 AM memory configuration parameters, which are task-related
AM1: memory size allocated to map iner by mapreduce. map. memory. mb
AM2: memory size allocated to reduce iner by mapreduce. reduce. memory. mb
L The two values should be between RM1 and RM2.
L The value of AM2 should be twice the value of AM1.
L these two values can be changed at startup.
AM3: mapreduce. map. java. opts specifies the jvm parameters used to run the map task, such as-Xmx and-Xms.
AM4: mapreduce. reduce. java. opts specifies jvm parameters for running reduce tasks, such as-Xmx and-Xms.
Note:
1. The two values should be between AM1 and AM2.
2. understanding of these configuration concepts
If you know the parameters, you still need to understand how to allocate them. The following figure shows the meaning of each parameter.
As shown in, first look at the bottom brown part,
AM parameter mapreduce. map. memory. mb = 1536 MB, indicating that AM needs to apply for 2048 MB of resources for map iner, but the memory actually allocated by RM is mb, because yarn. scheduler. mininum-allocation-mb = 1024 MB, which defines that RM should be allocated at least 1024 MB, and mb exceeds this value, therefore, the actual value allocated to AM is 2048 MB (this involves normalization factors. We will introduce normalization factors at the end of this Article ).
AM parameter mapreduce. map. java. opts =-Xmx 1024 m indicates that the jvm memory used to run the map task is 1024 MB. Because the map task runs in the Container, the value of this parameter is slightly smaller than that of mapreduce. map. memory. mb = 1536MB.
NM parameter yarn. nodemanager. vmem-pmem-radio = 2.1, which indicates that NodeManager can allocate 2.1 times of virtual memory to map/reduce Container, the actual Virtual Memory allocated to the map iner Container is 2048*2.1 = 3225.6 MB. If the actual memory used exceeds this value, NM will kill the map Container, an exception occurs during task execution.
AM parameter mapreduce. reduce. memory. mb = 3072 MB, indicating that the size of the Container allocated to reduce Container is 3072 MB, while that of map Container is 1536 MB, the reduce iner Container is preferably twice the map Container size.
The NM parameter yarn. nodemanager. resource. mem. mb = 24576 MB. This value indicates the available memory allocated to the Node Manager, that is, the memory size of the node used to execute the yarn task. This value should be configured based on the actual server memory size. For example, if the memory of the hadoop Cluster machine is 128 GB, we can allocate 80% of the memory to yarn, that is, 102 GB.
The two RM parameters are 8192 MB and MB respectively, indicating the maximum and minimum values allocated to AM map/reduce Container.
Iii. Task submission process 3.1 task submission process
Step 1: submit the application to ResourceManager;
Step 2: ResourceManager applies for resources for the Application ApplicationMaster and communicates with a NodeManager to start ApplicationMaster;
Step 3: ApplicationMaster communicates with ResourceManager to apply for resources for internal tasks. Once a resource is obtained, it communicates with NodeManager to start the corresponding task.
Step 4: After all tasks are completed, ApplicationMaster logs out of ResourceManager and the entire application is running.
3.2 about Container
(1) Container is the abstraction of resources in YARN. It encapsulates a certain amount of resources on a node (CPU and memory resources ). It has nothing to do with Linux iner. It is just a concept proposed by YARN (implementation can be seen as a Java class that can be serialized/deserialized ).
(2) The resource scheduler in ResouceManager asynchronously allocates the Container application from ApplicationMaster to ResourceManager to ApplicationMaster;
(3) The Container operation is initiated by the ApplicationMaster to the NodeManager where the resource is located. The Container must provide internal execution task commands (any commands can be used, such as java, Python, and C ++ process startup commands) and the environment variables and external resources (such as dictionary files, executable files, and jar packages) required for executing the command ).
In addition, the Container required by an application is divided into two categories:
(1) run the appliner of ApplicationMaster: This is applied and started by ResourceManager (to the internal resource Scheduler). When you submit an application, you can specify the resources required for the unique ApplicationMaster;
(2) Container for running various tasks: This is applied from ApplicationMaster to ResourceManager and started by ApplicationMaster to communicate with NodeManager.
The above two types of iner may be on any node, and their locations are generally random, that is, the ApplicationMaster may run on the same node with the tasks it manages.
Container is one of the most important concepts in YARN. It is important to understand the resource model of YARN.
Note: For example, map/reduce tasks run in the iner, so the mapreduce mentioned above. map (reduce ). memory. the mb size is greater than that of mapreduce. map (reduce ). java. the size of the opts value.
Iv. HDP platform parameter optimization suggestions
Based on the knowledge above, we can set relevant parameters according to our actual situation. Of course, we also need to continuously check and adjust the parameters during the testing process.
The following are the configuration suggestions provided by hortonworks:
Http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.1/bk_installing_manually_book/content/rpm-chap1-11.html
4.1 Memory Allocation
Reserved Memory = Reserved for stack memory + Reserved for HBase Memory (If HBase is on the same node)
The total system memory is 126 GB, and the reserved memory is 24 GB for the operating system. If Hbase exists, the reserved memory is 24 GB for hbase.
The following calculation assumes that Hbase is deployed on the Datanode node.
4.2 containers calculation:
Min _container_size = 2048 MB
Containers = min (2 * CORES, 1.8 * DISKS, (Total available RAM)/MIN_CONTAINER_SIZE)
# Of containers = min (2*12, 1.8*12, (78*1024)/2048)
# Of containers = min (24, 21.6, 39)
# Of containers = 22
Container memory computing:
RAM-per-container = max (MIN_CONTAINER_SIZE, (Total Available RAM)/containers ))
RAM-per-container = max (2048, (78*1024)/22 ))
RAM-per-container = 3630 MB
4.3 Yarn and Mapreduce parameter configuration:
Yarn. nodemanager. resource. memory-mb = containers * RAM-per-container
Yarn. schediner. minimum-allocation-mb = RAM-per-container
Yarn. scheduler. maximum-allocation-mb = containers * RAM-per-container
Mapreduce. map. memory. mb = RAM-per-container
Mapreduce. reduce. memory. mb = 2 * RAM-per-container
Mapreduce. map. java. opts = 0.8 * RAM-per-container
Mapreduce. reduce. java. opts = 0.8*2 * RAM-per-container
Yarn. nodemanager. resource. memory-mb = 22*3630 MB
Yarn. scheduler. minimum-allocation-mb = 3630 MB
Yarn. scheduler. maximum-allocation-mb = 22*3630 MB
Mapreduce. map. memory. mb = 3630 MB
Mapreduce. reduce. memory. mb = 22*3630 MB
Mapreduce. map. java. opts = 0.8*3630 MB
Mapreduce. reduce. java. opts = 0.8*2*3630 MB
Appendix: normalization factor Introduction
To facilitate resource management and scheduling, Hadoop YARN has a built-in resource normalization algorithm, which specifies the minimum available resource, maximum available resource, and resource normalization factor, if the amount of resources requested by the application is less than the minimum amount of resources that can be requested, YARN will change its size to the minimum amount that can be applied. That is to say, the application will not obtain resources less than the amount of resources that you have applied, but it is not necessarily equal. If the amount of resources requested by the application is greater than the maximum amount of available resources, an exception will be thrown and the application cannot be successful. The normalization factor is used to normalize the application resources, if the resource requested by the application is not an integer multiple of the factor, the value corresponding to the smallest integer multiple is changed to ceil (a/B) * B, a is the resource applied by the application, and B is the normalization factor.
For example, in the yarn-site.xml, the related parameters are as follows:
Yarn. scheduler. minimum-allocation-mb: minimum Memory size that can be applied. The default value is 1024.
Yarn. schedres. minimum-allocation-vcores: minimum number of available CPUs. The default value is 1.
Yarn. scheduler. maximum-allocation-mb: maximum applied memory. The default value is 8096.
Yarn. schedres. maximum-allocation-vcores: maximum number of available CPUs. The default value is 4.
For normalization factors, different schedulers are different as follows:
FIFO and Capacity schedity. The normalization factor is equal to the minimum amount of resources that can be requested and cannot be configured separately.
Fair schedres: normalization factor is set by parameters yarn. schedment. increment-allocation-mb and yarn. schedres. increment-allocation-vcores. The default values are 1024 and 1.
According to the above introduction, the amount of resources requested by the application may be greater than the amount of resources requested by the resource. For example, the minimum amount of memory available for the YARN is 1024, And the normalization factor is 1024, if an application requests 1500 memory, it will get 2048 memory. If the normalization factor is 512, it will get 1536 memory.