Summary one:
There are a total of the following aspects of memory configuration:
The following sample data is the configuration in GDC
(1) Each node can be used for container memory and virtual memory
NM of memory resource configuration, mainly through the following two parameters (these two values are yarn platform features, should be configured in Yarn-sit.xml):
YARN.NODEMANAGER.RESOURCE.MEMORY-MB 94208
Yarn.nodemanager.vmem-pmem-ratio 2.1
Description: The maximum memory available for each node, and two values in RM should not exceed this value. This number can be used to calculate the maximum number of container, which is: divide this value by the minimum container memory in RM. The virtual memory rate, which is the percentage of memory used for a task, is 2.1 times times the default value, which is 2.1 times times the maximum virtual memory used by each task for physical memory. Note: The first parameter is not modifiable, and once set, the entire operation is not dynamically modified, and the default size of this value is 8G, even if the computer memory is less than 8G will be used in 8G memory.
(2) Maximum and minimum physical memory limits available for each map and reduce;
The memory resource configuration of RM is mainly carried out by the following two parameters (these two values are yarn platform features, which should be configured in Yarn-sit.xml):
YARN.SCHEDULER.MINIMUM-ALLOCATION-MB 2048
YARN.SCHEDULER.MAXIMUM-ALLOCATION-MB 8192
Description: A single container can request the minimum and maximum memory, the application can not exceed the maximum value when running the application memory, less than the minimum value is assigned the minimum value, from this point of view, the minimum is a bit like the page in the operating system. There is another use of the minimum value to calculate the maximum number of container for a node note: These two values cannot be changed dynamically once set (the dynamic change described here refers to the application runtime).
(3) Actual memory used by each task
Am memory configuration related parameters, described in MapReduce as an example (these two values are the AM attribute, should be configured in Mapred-site.xml), as follows:
MAPREDUCE.MAP.MEMORY.MB 4096
MAPREDUCE.REDUCE.MEMORY.MB 8192
Description: These two parameters specify the memory size of the two tasks (Map and Reduce Task) used for MapReduce, and their values should be between the maximum minimum container in RM. If not configured, it is obtained by the following simple formula:
Max (Min_container_size, (total Available RAM)/containers)
The general reduce should be twice times the map. Note: These two values can be changed by the parameter when the application is started;
(4) memory used by the JVM in each task
Other memory-related parameters in am, as well as JVM-related parameters, can be configured with the following options:
mapreduce.map.java.opts-xmx3072m
mapreduce.reduce.java.opts-xmx6144m
Note: These two parameters are mainly for the need to run the JVM program (Java, Scala, etc.) prepared, through these two settings can pass parameters to the JVM, memory-related is,-XMX,-XMS and other options. This value should be less than MAP.MB and REDUCE.MB in AM
In addition to XMX, memory is required for permanent generations and stacks. If there are too many calls in the stack, and the memory used and the JVM memory exceeds the defined maximum memory usage, the task will be killed directly.
So
(1) The task that a node theory can run up to is:
Yarn.nodemanager.resource.memory-mb/yarn.scheduler.minimum-allocation-mb
(2) In fact, if you run map all, the number of map tasks that can be run is:
Yarn.nodemanager.resource.memory-mb/mapreduce.map.memory.mb
Of course, the parameter MAPREDUCE.MAP.MEMORY.MB can be specified when the job is run
Summary two:
As a general recommendation, we ' ve found that allowing for 1-2 Containers per disk and per core gives the best balance for Cluster utilization. Example cluster node with disks and cores, we'll allow for the maximum Containers to being allocated to EA CH node.
==================================================
Introduction: Hadoop yarn supports both memory and CPU scheduling of two resources (only memory is supported by default, if you want to schedule the CPU further, you need to do some configuration yourself), this article describes how yarn is scheduling and isolating these resources.
Hadoop yarn supports both memory and CPU scheduling of two resources (only memory is supported by default, if you want to schedule the CPU further and you need to do some configuration yourself), this article describes how yarn is scheduling and isolating these resources.
In yarn, resource management is done jointly by ResourceManager and NodeManager, where the scheduler in ResourceManager is responsible for allocating resources, while NodeManager is responsible for the supply and isolation of resources. ResourceManager you assign a resource on a NodeManager to a task (this is called "Resource Scheduling"), NodeManager needs to provide the appropriate resources for the task as required, or even guarantee that the resources should be exclusive and provide the basis for the task to run. This is known as resource isolation.
Before formally introducing specific resource scheduling and isolation, take a look at the characteristics of both memory and CPU resources, which are two different kinds of resources. The amount of memory resources will determine the life and death of the task, if the memory is not enough, the task may fail to run, in contrast, the CPU resources are different, it will only determine the speed of the task, it will not affect the life and death.
Enclosure: http://blog.chinaunix.net/uid-28311809-id-4383551.html
In this blog, we mainly introduce the improvement of yarn to MRV1, and the simple memory configuration of yarn and the abstraction container of yarn resources.
I know the main problem of MRV1 is: at runtime, Jobtracker is responsible for both resource management and task scheduling, which leads to its expansibility and low resource utilization. The problem is related to its original design, such as:
As can be seen, the MRV1 is carried out around the mapreduce, and there is not much consideration for other data processing methods that appear later. According to the design ideas, each of us to develop a data processing method (such as Spark), we have to implement the corresponding cluster resource management and data processing. As a result, yarn is naturally developed.
The biggest improvement of yarn to MRV1 is to separate resource management from task scheduling, so that various data processing methods can share resource management, as shown in:
As we can see, yarn is a kind of unified resource management method, which is separated from the Jobtracker in MRv1. The benefits are obvious: resource sharing, scalability, and so on.
The main difference between MRV1 and yarn: In MRv1, Jobtracker is responsible for resource management and operation control, while yarn, Jobtracker is divided into two parts: ResourceManager (RM) and Applicationmaster (AM). As shown in the following:
From this, we can clearly see that the MRV1 both in resource management and task scheduling are jobtracker to complete. This led to the jobtracker load being too large to manage and expand. For yarn, we can see clearly that resource management and task scheduling are divided into two parts: RM and am.
The effects of yarn and MRv1 on programming: We know that MRV1 consists mainly of three parts: the programming model (API), the Data processing engine (Maptask and Reducetask), and the operating environment (Jobtracker and Tasktracker); Yarn inherits the programming model and data processing of MRV1, and changes only the running environment, so it has no effect on programming.
To better illustrate yarn's resource management, first look at yarn's framework, as shown in:
As you can see, when a customer submits an assignment to RM, the AM is responsible for requesting the resource from RM and proposing a task execution to Namemanager (NM). This means that in this process, RM is responsible for resource scheduling, AM responsible for task scheduling. Important NOTE: RM is responsible for the resource management and scheduling of the whole cluster; NodeManager (NM) is responsible for the resource management and scheduling of a single node; nm timed to communicate with RM in the form of a heartbeat, reporting the health status and memory usage of the node; AM by interacting with RM to get resources, Then, by interacting with NM, the compute task starts.
The above content is explained in detail by the memory resource configuration: The above content is explained in detail by the memory resource configuration:
The memory resource configuration of RM is mainly carried out by the following two parameters (these two values are yarn platform features, which should be configured in Yarn-sit.xml):
Yarn.scheduler.minimum-allocation-mb
Yarn.scheduler.maximum-allocation-mb
Description: A single container can request the minimum and maximum memory, the application can not exceed the maximum value when running the application memory, less than the minimum value is assigned the minimum value, from this point of view, the minimum is a bit like the page in the operating system. There is another use of the minimum value to calculate the maximum number of container for a node note: These two values cannot be changed dynamically once set (the dynamic change described here refers to the application runtime).
NM of memory resource configuration, mainly through the following two parameters (these two values are yarn platform features, should be configured in Yarn-sit.xml):
Yarn.nodemanager.resource.memory-mb
Yarn.nodemanager.vmem-pmem-ratio
Description: The maximum memory available for each node, and two values in RM should not exceed this value. This number can be used to calculate the maximum number of container, which is: divide this value by the minimum container memory in RM. The virtual memory rate, which is the percentage of memory used by the task, is 2.1 times times the default; Note: The first parameter is not modifiable, once set, the entire operation is not dynamically modified, and the default size of the value is 8G, even if the computer memory is less than 8G will be used in 8G memory.
Am memory configuration related parameters, described in MapReduce as an example (these two values are the AM attribute, should be configured in Mapred-site.xml), as follows:
Mapreduce.map.memory.mb
Mapreduce.reduce.memory.mb
Description: These two parameters specify the memory size of the two tasks (Map and Reduce Task) used for MapReduce, and their values should be between the maximum minimum container in RM. If not configured, it is obtained by the following simple formula:
Max (Min_container_size, (total Available RAM)/containers)
The general reduce should be twice times the map. Note: These two values can be changed by the parameter when the application is started;
Other memory-related parameters in am, as well as JVM-related parameters, can be configured with the following options:
Mapreduce.map.java.opts
Mapreduce.reduce.java.opts
Note: These two parameters are mainly for the need to run the JVM program (Java, Scala, etc.) prepared, through these two settings can pass parameters to the JVM, memory-related is,-XMX,-XMS and other options. This numeric size should be between MAP.MB and REDUCE.MB in AM.
We summarize the above content, when configuring yarn memory is mainly configured in the following three aspects: each map and reduce the physical memory limit available, the size of the JVM per task, virtual memory limit;
The following is a specific error instance for memory-related instructions, with the following error:
CONTAINER[PID=41884,CONTAINERID=CONTAINER_1405950053048_0016_01_000284] is running beyond virtual memory limits. Current usage:314.6 MB of 2.9 GB physical memory used; 8.7 GB of 6.2 GB virtual memory used. Killing container.
The configuration is as follows:
Click (here) to collapse or open
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>100000</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>10000</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>3000</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>2000</value>
</property>
By configuration we see that the container's minimum and maximum memory are: 3000m and 10000m, while the default value of the reduce setting is less than 2000m,map, so two values are 3000m, which means "2.9 GB physical in log"
Memory used ". Because the default virtual memory rate (that is, 2.1 times times) is used, the total virtual memory for the map task and the reduce task is 3000*2.1=6.2g. And the application of virtual memory exceeded this value, so error. Solve the office
Method: Adjusts the memory size when the yarn is activated to adjust the virtual memory rate or when the application is running.
In the frame management of yarn, whether am is to request resources from RM or NM to manage the resources of its own node, it is done through container. Container is the resource abstraction of yarn, where resources include memory and cups. Under the face
Container, for a more detailed introduction. In order to be people to container have a comparative image of the understanding, first look:
From which we can see, the first AM through the request package Resourcerequest from RM to request resources, when the resource is acquired, am to encapsulate it, encapsulated into a Containerlaunchcontext object, through this object, am and nm to communicate,
To start the task. The following is a detailed analysis of the protocol buffs definition of resourcerequest, container, and Containerlaunchcontext.
The resourcerequest structure is as follows:
Click (here) to collapse or open
Message Resourcerequestproto {
Optional Priorityproto priority = 1; Resource Priority
Optional String resource_name = 2; Host where the desired resource resides
Optional Resourceproto capability = 3; Amount of resources (MEM, CPU)
Optional Int32 num_containers = 4; Meet the number of conditions container
Optional BOOL relax_locality = 5; Default = TRUE;
}
The above structure is briefly described by ordinal:
2: When submitting the application, expect to receive from which host, but in the end is AM and RM negotiation decision;
3: Contains only two resources, namely: memory and CPU, request method:<memory_num,cup_num>
Note: 1, because 2 and 4 do not limit the amount of resources requested, the AP in the resource application is unlimited. 2, yarn adopts the method of covering resources, that is, the resource request that is sent every time will overwrite the same node and the same priority resource request.
That is, there can be only one resource request of the same priority in the same node.
Container structure:
Click (here) to collapse or open
Message Containerproto {
Optional Containeridproto id = 1; Container ID
Optional Nodeidproto nodeId = 2; The node where the container (Resource) resides
Optional String node_http_address = 3;
Optional Resourceproto resource = 4; Number of container allocated
Optional Priorityproto priority = 5; Priority of Container
Optional Hadoop.common.TokenProto Container_token = 6; Container token for safety certification
}
Note: Each container can typically run a task, and when am receives multiple container, it is further assigned to a character. such as: MapReduce
Containerlaunchcontext structure:
Click (here) to collapse or open
Message Containerlaunchcontextproto {
Repeated Stringlocalresourcemapproto localresources = 1; The container runs the program required in the resource, for example: Jar package
Optional bytes tokens = Securitytokens in 2;//security mode
Repeated Stringbytesmapproto service_data = 3;
Repeated stringstringmapproto environment = 4; Container environment variables required for startup
Repeated string command = 5; The command for the container run program, such as a JAVA program, $java_home/bin/java org.ourclassrepeated Applicationaclmapproto application_ ACLs = 6;//access to application that the container belongs to
Control List
}
The following combined with a piece of code, only take Containerlaunchcontext as an example to describe (should have written a simple finite state machine, easy to understand, but not enough time):
Click (here) to collapse or open
To apply for a new containerlaunchcontext:
Containerlaunchcontext CTX = Records.newrecord (Containerlaunchcontext.class);
Fill in the necessary information:
Ctx.setenvironment (...);
Childrsrc.setresource (...);
Ctx.setlocalresources (...);
Ctx.setcommands (...);
To start a task:
Startreq.setcontainerlaunchcontext (CTX);
Finally, the container is summarized as follows: Container is the resource abstraction of yarn, encapsulates some resources on the node, mainly CPU and memory, container is AM to NM, its operation is initiated by AM to the resource nm, and finally run
Of There are two types of container: One is the container required for the AM operation, and the other is the AP that applies to RM to perform the task.
This article originates from: http://blog.chinaunix.net/uid/28311809/abstract/1.html
Also refer to:
by Rohit Bakhshi
On
September 10th, 2013
Share on Facebook Share on Twitter Share on LinkedIn Share on Google_plusone_share Share on Reddit Share on Hackernew S
As part of HDP 2.0 Beta, YARN takes the resource management capabilities that were in MapReduce and packages them so they Can is used by new engines. This also streamlines MapReduce-do-it does best, process data. With YARN, you can now run multiple applications in Hadoop, all sharing a common resource management.
In this blog post we'll walk through how to plan for and configure processing capacity in your Enterprise HDP 2.0 cluster Deployment. This would cover YARN and MapReduce 2. We'll use a example physical cluster of slave nodes each with gigabyte GB RAM, disks and 2 hex core CPUs (total cores).
Yarnyarn takes into account all the available compute resources on the cluster. Based on the available resources, YARN'll negotiate resource requests from applications (such as MapReduce) running in t He cluster. YARN then provides processing capacity-application by allocating Containers. A Container are the basic unit of processing capacity in YARN, and are an encapsulation of resource elements (memory, CPU ET C.).
Configuring YARN
In a Hadoop cluster, it's vital to balance the usage of RAM, CPUs and disk so that processing are not constrained by any one of these cluster resources. As a general recommendation, we ' ve found that allowing for 1-2 Containers per disk and per core gives the best balance for Cluster utilization. Example cluster node with disks and cores, we'll allow for the maximum Containers to being allocated to EA CH node.
Each machine in our cluster have a RAM of up to GB. Some of this RAM should is reserved for Operating System usage. On each node, we'll assign GB RAM for YARN to use and keep 8 GB for the Operating System. The following property sets the maximum memory YARN can utilize on the node:
In Yarn-site.xml
1
2
Yarn.nodemanager.resource.memory-mb
40960
The next step is to provide YARN guidance in how to break up the total resources available into Containers. Specifying the minimum unit of RAM to allocate for a Container. We want to allow for a maximum of Containers, and thus need (up to + GB total RAM)/(# of Containers) = 2 GB minimum per Container
In Yarn-site.xml
1
2
Yarn.scheduler.minimum-allocation-mb
2048
YARN would allocate Containers with RAM amounts greater than the YARN.SCHEDULER.MINIMUM-ALLOCATION-MB.
Configuring MapReduce 2
MapReduce 2 runs on top of yarn and utilizes yarn Containers to schedule and execute its map and reduce tasks.
When configuring MapReduce 2 resource utilization on YARN, there is three aspects to consider:
Physical RAM limit for each Map and Reduce task
The JVM heap size limit for each task
The amount of virtual memory each task would get
Can define how much maximum memory each Map and Reduce task would take. Since each Map and each Reduce would run in a separate Container, these maximum memory settings should is at least equal to Or more than the YARN minimum Container allocation.
For our example cluster, we had the minimum RAM for a Container (YARN.SCHEDULER.MINIMUM-ALLOCATION-MB) = 2 GB. We ' ll thus assign 4 GB for Map task Containers, and 8 GB for Reduce tasks Containers.
In Mapred-site.xml:
1
2
3
4
Mapreduce.map.memory.mb
4096
Mapreduce.reduce.memory.mb
8192
Each Container would run JVMs for the Map and Reduce tasks. The JVM heap size should is set to lower than the MAP and Reduce memory defined above, so that they is within the bounds Of the Container memory allocated by YARN.
In Mapred-site.xml:
1
2
3
4
Mapreduce.map.java.opts
-xmx3072m
Mapreduce.reduce.java.opts
-xmx6144m
The above settings configure the upper limit of the physical RAM that Map and Reduce tasks would use. The virtual memory (physical + paged memory) upper limit for each MAP and Reduce task are determined by the virtual memory Ratio each YARN Container is allowed. This was set by the following configuration, and the default value is 2.1:
In Yarn-site.xml:
1
2
Yarn.nodemanager.vmem-pmem-ratio
2.1
Thus, with the above settings in our example cluster, each MAP task'll get the following memory allocations with the FOL Lowing:
Total physical RAM allocated = 4 GB
JVM heap Space upper limit within the MAP task Container = 3 GB
Virtual Memory Upper limit = 4*2.1 = 8.2 GB
With YARN and MapReduce 2, there is no longer pre-configured static slots for Map and Reduce tasks. The entire cluster is available for dynamic resource allocation of Maps and reduces as needed by the job. In we example cluster, with the above configurations, YARN'll be able to allocate on each node up to ten mappers (40/4) or 5 reducers (40/8) or a permutation within that.
Next Steps
With HDP 2.0 Beta, you can use Apache Ambari to configure YARN and MapReduce 2. Download HDP 2.0 Beta and deploy today!
Get the latest updates on our Blogs
Share on:
Share on Facebook Share on Twitter Share on LinkedIn Share on Google_plusone_share Share on Reddit Share on Hackernew S
Categorized by:
YARN
Yarn Memory Management