Large-scale cluster management tool Borg

Source: Internet
Author: User

Google's large-scale cluster management tool Borg

Overview

Google's Borg system is a cluster management tool that runs thousands of jobs on top of many different applications and spans multiple clusters, each of which is comprised of a large number of machines.

Borg achieves high utilization by combining access control, efficient task packaging, excess load, and machine sharing based on process-level performance isolation. It supports highly available applications whose runtime features minimize the error recovery time, and their scheduling strategy reduces the likelihood of related errors occurring. To simplify user use, Borg provides a declarative work specification language, a name service integration mechanism, real-time job monitoring, and a range of tools for analyzing and simulating system behavior.

We will then present some summaries of Borg, including system architecture and features, important design decisions, quantitative analysis of some of its strategies, and lessons learned from its more than 10 years of use.

1. Introduction

This internal cluster management system called Borg confirms, Dispatches, initiates, restarts, and monitors all applications run by Google. This paper is mainly used to illustrate the three advantages offered by Borg: (1) It hides resource management and error handling, so users can focus on developing applications and (2) have very high reliability and availability to support applications with these features ; (3) allows us to efficiently run loads across thousands of machines. Borg is not the first system to solve these problems, but it is one of the few systems that can handle these problems on such a large scale and achieve such flexibility and completeness. This paper revolves around the above topics and concludes with some qualitative observations from our more than 10 years of Borg experience.

2. User Perspective

The main users of Borg are Google developers and system administrators who run Google Apps and services (website reliability engineer or SRE). The user submits work to Borg as a job, and each job consists of running one or more tasks that run the same program. Each job runs in a Borg cell, and a set of machines is managed as a unit. The remainder of this section will primarily describe some of the features of Borg from a user's perspective.

2.1. Workload

Borg run a variety of loads, which can be divided into two main categories. The first class is a service that cannot be stopped for long periods of time and requires the ability to handle transient, latency-sensitive requests (latency requirements between microseconds and hundreds of milliseconds). These services are used primarily for end-user services, such as Gmail,google docs,web search and some internal infrastructure services such as BigTable. The second class is a batch job that typically takes a few seconds to a few days to complete, and these jobs are not very sensitive to short-time performance fluctuations. These loads are typically mixed between cells, and each cell runs a variety of different applications as the primary tenant and time are different: The batch type of job comes and goes and many end-user-oriented jobs expect a pattern that can be used for a long time. Borg requires good handling of all situations. We have been tracking an open, Representative Borg load for one months starting in May 2011 and have conducted extensive analysis of it.

Over the past few years, many application frameworks have been deployed to Borg, including our in-house mapreduce systems, Flumejava,millwheel and Pregel. Most of them have a controller that submits a master job and one or more worker jobs, where the first two and the application managers in yarn play the same role. Our distributed storage systems such as GFS and its successors cfs,bigtable and Megastore are all running on Borg.

In this paper, we call high-priority jobs prod, and the rest is called Non-prod. Most long-running service jobs are of type PROD, and most batch jobs are of type Non-prod. In a representative cell, the PROD type of job allocates approximately 70% of all CPU resources and represents 60% of CPU usage, and in memory, the PROD type of job allocates approximately 55% of all memory resources and represents approximately 85% of memory usage. As for the difference between allocation and use, we will explain it in Chapter 5.5.

2.2. Cluster and cell

A machine in a cell typically belongs to a single cluster and is connected by a high-performance network structure of data Center scale. A cluster usually exists in a single data center, and a collection of multiple data centers forms a site. A cluster usually contains a large cell, perhaps some of which are small enough to be used for testing or other special purpose cells. We always try our utmost to avoid single points of failure.

Excluding cells for testing, our medium-sized cells typically contain 10k machines and, of course, larger cells. From different dimensions: such as size (Cpu,ram, HDD, network), processor type, performance, and external IP and flash memory, the cell is diverse. However, Borg usually masks the differences between the cells by deciding which cell to run the task on, assigning resources to them, installing programs for them, and some of their dependencies, monitoring their health and restarting them when they crash.

2.3. Job and task

The Borg property of a job typically includes its name, the owner, and the name of the task it owns. A job can have certain constraints that allow its task to run on machines with specific properties, such as specific processor architectures, operating system versions, and external IPs. Constraints can be classified as hard or soft. Soft restraints are more like a priority advice than a requirement. The run of a job can be deferred until after the previous end and a job can only run in one cell.

Each task represents a series of Linux processes running in a container or in a physical machine. Most of the Borg load does not run in virtual machines because we do not want to incur the overhead of virtualization. And when we design the system, we assume that the processor used by the system is not supported by hardware virtualization.

Each task also has its own properties, such as resource requirements and the index of the task in the job. Most task properties are the same in the same job, but they can be overloaded after a command-line flag is provided for a specific task. The dimensions of each resource (such as CPU cores, RAM, disk space, disk access rate, TCP port, and so on) can be specified independently with good granularity, and we do not impose a fixed-size bucket or slot. Borg programs are usually statically linked to reduce their reliance on the operating environment, and binaries and data files are organized as packages, and their installation is orchestrated by Borg.

Users typically operate the job or our monitoring system by sending a remote procedure call to Borg, using some command-line tools. Most job descriptions are written in a declarative configuration language, BCL. The BCL is a variant of GCL, and GCL produces protobuf files, and the BCL expands on them with some Borg-specific keywords. The BCL provides lambda functions for calculations, which the application typically uses to adjust the configuration of the environment. Many BCL files exceed 1k lines, and we have accumulated thousands of rows of BCL files. There are many similarities between the Borg job profile and the Aurora configuration file. Shows the process of job and task running through their entire life cycle.

In a running job, the user can change some or all of the properties of a task by pushing a new job profile to Borg and the command Borg updating the task to a new configuration. This lightweight, non-atomized action is most likely not to be manipulated before it is closed. Updates are typically scrolled, and we typically limit the number of interrupts (rescheduling or preemption) that are caused by the update, which is skipped directly if the number of interrupts exceeds the limit.

Some task updates (such as pushing a new binary) always require a restart of the task, while others (such as increasing resource requests or changing constraints) may make the task no longer fit the current machine, so it needs to be stopped and re-dispatched. Other operations, such as changing the priority, do not require any restart or task movement.

Until the Sigkill is preempted, the task is always notified through the sigterm signal, so they have enough time to clean up, save the state, end the currently executing request, and reject the new request. If the preemption set the delay boundary, the actual signal may be a little less. In fact, only 80% of a notification will be successfully pushed.

2.4, Allocs

Borg's Alloc (abbreviated allocation) operation refers to reserving some resources on a single machine, so that one or more resources can be run on it, regardless of whether they are used or not, they remain allocated. Allocs operations can be used to preserve resources for use in future tasks, to save resources between stopping and starting a task, and to collect tasks from different jobs. Let them run on the same machine: for example, a Web server instance and the related task to copy the server URL records of the local disk to the Distributed file system. Resources that are alloc and other resources in the machine are treated equally, and multiple tasks running on the same alloc operation share the resources within them. If a alloc operation must be redirected to another machine, the task above must be re-dispatched with the alloc operation.

A alloc set is like a job: It is a series of alloc operations for reserving resources on multiple machines. Once a alloc operation is created, one or more jobs can be committed and run on top of it. For simplicity, we typically use "task" to substitute a alloc operation or a top-level task (a task that is outside the alloc operation), while "job" refers to a job or a alloc operation set.

2.5, priority, quota and access control

What if there is a load exceeding the processing capacity? Our solution is priority and quota.

Each job has a priority, which is a small positive integer. A high-priority task can fetch resources at the expense of another lower-priority task, even if such sacrifices include preemption or killing a lower-priority task. Borg defines a non-overlapping priority band for different uses, including (prioritized from high to bottom): monitoring, Production, batch tasks, and best effort (which can also be understood as testing or free work). In this paper, the prod type of job is in the monitoring or product priority band.

Usually a preempted task is re-dispatched to the other part of the cell, and preemption can also have cascading effects: a high-priority task, for example, grabs a lower-priority task, and the latter grabs a lower-priority task, cascading continuously. To prevent this cascade event from happening, we do not allow the task to preempt each other with the tasks in the production priority band. Fine-grained prioritization is useful in other situations: for example, MapReduce's master type of task has a higher priority than the worker it controls, thereby improving the overall system's reliability.

The priority is used to indicate the relative importance of the job running in a cell or waiting to run. Quotas (quota) indicate which jobs can be dispatched. Quotas we can understand as a resource request vector (Cpu,ram, disk, and so on) at a given priority level. Resource requests refer to the maximum number of resources a user's job can request within a period of time, typically one months (such as a PROD request for 20TB of RAM, from now to July, in xx cell). Quota checking is also part of admission control, not scheduling: a job that is not satisfied with a quota requirement is immediately rejected.

Quotas for high-priority jobs typically cost more than lower-priority job quotas. For example, the production priority quota is limited to the amount of resources that a cell can actually acquire. Therefore, if a user submits a job with a production priority and the quota is appropriate, you can expect it to run. Although we recommend that users do not purchase quotas that are more than they need, many users will still choose to buy excessive quotas, as this will ensure that future users of their applications will grow without resource shortages. Our response to this is that a job with a low priority can have more quotas: Each priority 0 user has unlimited resource quotas, even though it's hard to actually implement because resources are oversubscribed. A low-priority job may be accessible but may be suspended because the requested resource has not been satisfied.

Quotas are allocated outside of Borg and are closely related to our physical capacity planning. They typically reflect the price and availability of different data center quotas. A user's job can only be admitted after it has met its priority quota. The use of quotas reduces the use of strategies such as the advantageous resource equity (dominant Resource FAIRNESS,DRF).

Borg also has a capacity system that gives some users special privileges: such as allowing administrators to delete or modify arbitrary jobs in the cell, or to run user access-restricted kernel features or Borg behavior, such as disabling resource limits in their jobs.

2.6. Naming and monitoring

Creating and deploying tasks alone is not enough, because a client of a service and other systems need to be able to find them, even after they are dispatched to a new machine. Therefore, Borg created a name for each task named "Borg Name Service" (BNS), which contains the name of the cell, the name of the job, and the number of the task. Borg writes the host name, port number, and this BNS name of the task to a consistent, highly available file in chubby, which is often used by our RPC system to find a task. The BNS name is also used as the DNS name base for a task, so for a user ubar a 50th task in a job called Jfoo in a cell called CC, We can access it through the domain name 50.jfoo.ubar.cc.borg.google.com. At the same time Borg will write to chubby when the size of the job or the health of the task changes, and then the load balancer can decide where to route the request.

Almost every task running under Borg has a built-in HTTP server for publishing task health and many other performance metrics (RPC latency, and so on). Borg will monitor the URL of the health check and will restart those tasks that do not even reply, or return an HTTP error code directly. Other data is monitored by additional monitoring tools and alerts the service object level of violations.

A service called Sigma provides a Web-based user interface through which users can test the health of all their jobs or a particular cell. You can also drill down into specific jobs or tasks to test their resource-related behavior, detailed logs, execution history, and their eventual fate. Our application generates a large number of logs: they automatically rotate to avoid running out of space on the disk, and will retain a certain amount of time for debugging after the task exits. If a job is no longer running, Borg will provide a "why is suspended" annotation, and will attach how to modify the job's resource request to better adapt to the cell's guidance. We've released a resource request guide that meets the requirements to make the schedule go more smoothly.

Borg records all job submissions, task events, and resource usage before detailed task execution in Infrastore. Infrastore is an extensible read-only data store with a class SQL interface. This data is used for usage-based billing, debugging, system errors, and long-term capacity planning. They also provide data for Google's cluster load tracking.

All of the above features help the user to better understand, debug the behavior of Borg and the job inside it, and also help our SRE to manage a large number of parts of the machine.

Note: Some content in translation may be obscure or not very smooth, please correct

Original address: Http://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/43438.pdf

Large-scale cluster management tool Borg

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.