This article is a summary and application of some experience in cluster resource management in the development of real-time computing platform by the micro-BO operation and data Platform (DIP), which mainly focuses on the following issues:
- How does heterogeneous resources integrate?
- How do real-time computing the physical resources between applications be isolated?
- How is cluster resource utilization improved?
- How to reduce the cost of cluster operation and maintenance?
1. Background This is our initial real-time computing architecture, roughly divided into three parts: (1) Log collection; using Rsynlog, Flume, scribe gathers the log data sent by each business imagining, and the business party can write the data directly to Kafka if conditions permit. (2) Log transfer; using Kafka as a high-speed transmission channel between the Log collection component and the real-time application is actually a log buffer. (3) Real-time computing platform; Real-time computing platform the following two types of applications are included according to the usage scenarios: (1) Self-service real-time applications: powered by spark streaming, spark SQL built universal real-time processing module, designed to simplify user development, deployment, operation and maintenance of real-time application of the work, most of the time the user through our web page to complete the creation of real-time applications; (2) third-party application hosting: The Application of computational processing logic and its own business integration is relatively close, The reusable modules are not well abstracted, usually by the business side using spark streaming self-developed, and then uniformly deployed through the DIP platform; both types of real-time applications run in the form of spark streaming using Hadoop Yarn Fairscheduler is on top of the cluster as a resource manager. This article only introduces the cluster resource management scheme in real-time computing platform, and the contents of log collection and log transmission can be referred to the following two articles separately: kafka Topic Partition Replica Assignment implementation principle and resource isolation scheme: Http://www.cnblogs.com/yurunmiao/p/5550906.htmlFlume filechannel Optimization (extension) Practice Guide:/HTTP// Www.cnblogs.com/yurunmiao/p/5603097.html 2. Hadoop yarn Cluster resource management in our Hadoop yarn cluster, the cluster resource Manager uses the Fair Scheduler (Fairscheduler), which divides the resources by business parties and assigns a separate queue to each business party. This queue is associated with a certain resource (CPU, MEM). To ensure the use of resources for each business party, we set the resource lower value for each queue (MinresourCES) and resource cap values (maxresources) are set to the same and disable resource preemption between queues (business). The self-service real-time applications and third-party application submissions mentioned above need to be submitted to the respective queues of the business parties to run. Now that the queues are allocated for each business party and the amount of resources is specified, each business party is equivalent to having a certain number of servers in our Hadoop yarn cluster, and the cluster runtime looks like this: simplicity, We use the queue name instead of the name of the business party, each container resource is 1 core, 1g Mem, thus the following information can be obtained: (1) The total resources of the Hadoop Yarn cluster: 42cores,42g mem; (2) three business parties, namely: QueueA, Queueb, Queuec; (3) QueueA occupies 6 server resources: cores,18g Mem,queueb, QUEUEC each Occupy 4 server resources: cores,12g mem; Here we need to be aware , Hadoop Yarn Fairscheduler can only control the resource usage of individual queues, which is a logical resource control and does not actually control which servers these resources (Container) are assigned to run on. In QueueA, for example, QueueA has 6 server resources from the point of view of resource usage allocation, but it does not mean that it will monopolize some 6 servers in the cluster. The cluster runtime actually looks like this: queuea, Queueb, QUEUEC "logically share" cluster resources within the scope of their respective resource usage. The representation of these resources is understood to be container (each container runtime needs to allocate a certain amount of resources, such as 1 core,1g mem), and all the container of each business party are mixed and run on each server in the cluster. hadoop Yarn FAIRSCHEDUELR's resource management approach works well in our off-line computing scenarios, but there are a lot of problems in real-time computing scenarios. (1) How to integrate heterogeneous resources? Heterogeneous resources include two aspects: Inconsistent server models, inconsistent server roles. a. Server model inconsistency Many business parties are not able to provide servers with the same configuration as our existing clusters when accessing real-time applications, most of which have low performance. These business parties are often driven by development, computational, operational efficiency considerations, migrating from past computing engines (e.g. NODEJS-STATSD) to spark streaming,is measured in terms of the circumstances. hadoop has taken into account the inconsistency of the model, its solution is to each server (compute node, Hadoop NodeManager) resources can be used through configuration files (yarn-site.xml,yarn.nodemanager.resource.cpu-vcores, YARN.NODEMANAGER.RESOURCE.MEMORY-MB) is set. According to the actual situation of the business party server, we can reasonably set the amount of resources that can be used by this server, and then divide the corresponding queue resources according to the sum of the resources of the business party servers. This business side problem is solved, but other business parties will have the question: "Our real-time business is very important, all applications are high-performance servers, if our application is scheduled to run in these poor performance of the server, computing performance will be lossy?" ”。 This concern for business is really worth considering, the Spark streaming application is 7 * 24 hours uninterrupted, if the application's containers is dispatched to a different performance of Hadoop NodeManager server, there is no guarantee that container on poorly performing servers will not slow down the execution progress of the entire real-time application. b. Server role inconsistencies new business parties not only need to expand the cluster's compute nodes (Hadoop NodeManager), but also to scale Kafka brokers nodes. In most cases, where a business party cannot provide such a redundant machine, we envision a "hybrid deployment" technology scenario where the compute nodes are deployed on the same server as the Kafka brokers node, but this can make the "Computational performance loss" problem in (1) worse. NOTE: In real-time computing scenarios, with spark streaming application as an example, Hadoop NodeManager container is a CPU-intensive application, Kafka brokers is an IO-intensive application, If it is an independent business, the effect of both hybrid deployment is good, dip platform in this area has been very well practiced. (2) How do real-time computing the physical resources between applications be isolated? Physical resources include four aspects: CPU, MEM, DISK, NETWORK. If a Hadoop NodeManager server runs spark streaming application Containers from different business parties, these containerThere may be a case of interaction between S. Currently Hadoop yarn only supports CPU, MEM resource management, while providing thread-based monitoring of the memory usage mechanism and based on the Cgroup resource isolation mechanism, but for disk, network lacks the corresponding resource isolation mechanism, Resource contention between container can lead to inefficient container execution, which in turn affects the entire spark streaming application. (3) How does cluster resource utilization improve? Business parties when submitting Spark streaming application, you can specify the resource quotas for the live app runtime by following three parameters num-executors:spark streaming Application how many Containers;executor-cores:spark streaming application each container need to apply for the run time cores Executor-memory:spark streaming application Each container need to apply how many mem; these three parameter values need to consider two factors: Business Party Queue Resource redundancy: Does the DIP platform have sufficient redundant resources for the new Spark streaming application access in the queue for business parties; spark streaming Application Resource requirements: If the allocation of more resources for the application will lead to waste of resources, if the allocation of less resources, will lead to insufficient computing resources, thus affecting the business; as the manager of the cluster resources, we also need to consider more issues: a. Cluster CPU, MEM Resource allocation imbalance; Cluster CPU resources are exhausted, mem resources are redundant, or the cluster mem resources are exhausted, CPU resources are redundant, b. Fragmentation of cluster resources; Hadoop NodeManager cpu/mem resources are redundant, but cannot be assigned to spark streaming application, because the Hadoop nodemanager cpu/ MEM's resources remain insufficient to meet the resource requirements of any one business party's spark streaming application container. Such asThe number of Hadoop nodemanager that occur with this phenomenon can lead to a large number of resources in the cluster being unusable (wasted), while the spark streaming Application may also appear to be unable to apply for container or to apply for a sufficient number of container. These two problems arise: the resources in the business party queue are still redundant, but the submitted applications are not yet allocated resources. (4) How to reduce the cost of cluster operation and maintenance? The operation and maintenance cost of cluster is mainly from the third party real-time application hosting, the different applications have different dependencies on the system environment, such as: Spark streaming is developed using Python, it needs a lot of libraries to run, which need all the Hadoop in the whole cluster NodeManager machine on the installation, a variety of version conflicts brought us a lot of operational pressure, but also for our cluster environment of a "pollution." 3. Elastic cluster resource management heterogeneous resources, physical isolation, resource utilization, operation and maintenance costs have led to our thinking of Hadoop yarn cluster resource management, the direct thought is to build their own clusters for each business party, as shown in: based on our previous experience, The cost of operations monitoring in this way is extremely high, and we quickly negate this approach. Since multiple clusters cannot be built, can multiple "resource pools" be partitioned within a cluster? Each resource pool is associated with several machines, one or more resource pools assigned to each business party, and the applications submitted by the business party run only within the allocated resource pool, that is, only those servers that are associated with the resource pool. we discuss in the "resource pool" way whether we can solve the four issues mentioned above. (1) Heterogeneous resources; We make one or more resource pools of the same or similar servers, and the size of the resource pool depends on the specific business requirements. After such processing, we can assume that the computing performance of any server within a pool of resources is equivalent. The application submitted by the business party will only run in one resource pool, the performance of multiple compute nodes in this resource pool is the same, and the performance loss caused by inconsistent server models can be solved. If a business party needs a "hybrid deployment," You can set up a dedicated resource pool to associate those servers that require a hybrid deployment of multiple services (specifically Kafka Broker, Hadoop NodeManager) to that resource pool. Typically, this "hybrid deployment" mode of resource pooling is specific to the application, the server roles within the resource pool are consistent, and the performance loss issues that may result from inconsistent server roles are addressed. (2) Physical resourcesSOURCE isolation; Each business party can only use the physical resources within its own pool of resources, and the problem of physical resource competition between the various business parties is resolved. for a particular business party, there may be multiple real-time applications running within its own pool of resources, and there may still be physical resource competition between these real-time applications, as we think: a. Because resource pools are exclusive to business parties, The business party should take full account of the physical resource competition between applications when developing and deploying the application, that is to say, the problem of physical resource resource between multiple applications of the same business party is entrusted to the business party itself; b. If there is a special case, we can divide multiple resource pools for the same business party. Even one application of a resource pool; such a physical resource isolation problem is resolved. (3) Resource utilization; "one cluster, multiple queues" mode, the business side does not need to consider the problem of resource utilization, only need to deploy the application based on the resource size allocated by the queue and the resources required by the application. From the perspective of the business side, as long as the use of resources in the queue allocation of resources within the range is not a problem, this angle is actually more "selfish", not to consider their own behavior to other business parties, the imbalance in resource allocation, resource fragmentation of the nature of the problem stems from this. "One cluster, multiple resource pools" mode, the resources within a resource pool are completely exclusive, the business party must take the initiative to fully consider the imbalance of resource allocation and fragmentation of resources within the scope of its own resource pool. This proactive engagement of each business party can lead to a significant increase in overall cluster resource utilization. Suppose we have a cores, 128g mem Server, if it is used as a Hadoop nodemanager, in general we will set this: cores, 120g mem, so that the physical resources of this server can be fully utilized, This is also the industry's more common practice. In the process of cluster operation, we find this phenomenon (in the case of CPU): "Cores has been all allocated, but the CPU utilization of this server is only 60%, and not individual phenomenon", this is a huge waste of cluster resources. "One cluster, multiple queues" mode, our cluster configuration needs to take full account of the needs of the various business parties, only use the above-mentioned common practice; "One cluster, multiple resource pools" mode, we can according to the actual situation of the application of the Hadoop NodeManager in a resource pool to make " "Cores", such as "120g mem", resource pool can provide resources "more", the overall utilization of the cluster will be improved. (4) operation and maintenance cost; operation and maintenance cost mainly comes from the special needs of the business parties, thisThe implementation of the special requirements involves the entire cluster, including the nodes that are subsequently added, and the human and maintenance costs are high. In this case, we implement a "containerized" (Docker) deployment of Hadoop NodeManager, and if business parties have special needs, they can make changes based on the Docker base image we provide, not directly "polluting" the system environment, making it easy to rollback, And the changed image will only be deployed on servers in the business party-specific pool of resources, without affecting other business parties, in a way that makes operational costs significantly lower. In summary, the "one cluster, multiple resource pool" model can solve the problem of heterogeneous resources, physical resource isolation, resource utilization, operation and maintenance costs. 4. The implementation of hadoop Yarn Fairscheduler only supports fair scheduling between applications, and we need to extend it to support the "resource pool" mode: (1) resource pool configurable, with resource pool name can be associated with several Hadoop NodeManager node (denoted by the host name (Hostname)), the resource pool is configured in the profile fair-scheduler.xml of the Fair scheduler, as shown in: Hadoop between multiple resource pools NodeManager nodes cannot have coincident, that is, each Hadoop NodeManager node can only belong to one resource pool; (2) The corresponding relationship between resource pool and queue depends on "name prefix" to identify; "Pool1", "pool2" as an example, If the queue name is prefixed with "pool1", all apps in that queue will run in the resource pool pool1, and if the queue name is prefixed with "pool2", all apps in that queue will run in the resource pool pool2;
Core IdeasEach time Hadoop Yarn Fairscheduler receives a "node_update" event for Hadoop NodeManager, it uses a fair scheduling algorithm to assign containers to each running application. We need to add the processing logic of the resource pool on top of the fair dispatch, such as when assigning containers to an application: (1) The hostname of the Hadoop NodeManager that gets the "Node_update" event, and (2) Gets the application's submission queue name, The corresponding resource pool is obtained according to the queue name, and if the corresponding resource pool can be found, continue (3), if the corresponding resource pool is not found, end the allocation process, continue processing the next application, (3) If the node list of the resource pool contains the host name in (1), continue to use the Fair scheduling algorithm to complete the allocation ; otherwise, end the allocation process and continue with the next application;
Implement
Hadoop Yarn Fairscheduler Legacy processing logic (Hadoop 2.5.0-cdh5.3.2 Org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer): On top of this, add the processing logic for the resource pool: (1) Get the hostname nodehostname of "node_update" Hadoop NodeManager, (2) Get the resource pool information groups, where "key" Represents the resource pool name, "value" represents the Hadoop NodeManager hostnames associated with the resource pool, and (3) Gets the queue name of the app to be allocated QueueName ; (4) Find the queue name queuename the list of host names in the corresponding resource pool nodes; (5) If nodes contains Nodehostname, continue the allocation process; otherwise end the allocation process and continue to the next application; that is, Hadoop Yarn Fairscheduler's original logic, as long as the cluster of any one of the Hadoop NodeManager "Node_update" event, according to the fair scheduling algorithm to complete the containers allocation process; After you add the processing logic for the resource pool, All applications submitted to the queue QueueName will only complete the containers allocation process according to the fair height algorithm only if they receive a "Node_update" event from the Hadoop NodeManager in the corresponding resource pool. We also need to deal with both exceptions: (1) The application submits to a queue, the queue does not have a corresponding resource pool configured, and (1) the application submits to a queue, the list of host names corresponding to the resource pool is empty; both cases we do the same way, Terminate the execution of the application as follows: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.fairscheduler.addapplication also involves the source code extension of the Fair Dispatcher Fair-scheduler.xml, as follows: Org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderservice.reloadallocations The actual is from the Fair scheduler configuration file Fair-scheduler.xml parse out the information of the resource pool, storage variables groups, the detailed process is not repeat. Source Address: Https://github.com/Leaderman/hadoop-yarn-scheduler.
Elastic cluster resource management in the real-time computing platform