3. Borg Architecture
A Borg cell consists of a series of machines, usually running a logical central controller in the cell called Borgmaster, and a proxy process called Borglet is running on each machine in the cell. All components of Borg are written in C + +.
3.1, Borgmaster
The borgmaster of each cell consists mainly of two processes: a master borgmaster process and a separate scheduler. The primary borgmaster process is used to process RPC requests from a variety of clients, including either state changes (for job creation) or read-only access to the data (the job used for querying). It is also used to manage state machines for various objects (machines, task,alloc, etc.) in the system, and to interact with borglets and to provide a Web UI as a backup of the Sigma.
Logically, Borgmaster is a single process, but in fact it has five repeating units. Each repeating cell maintains most of the state of a cell in memory, and these states are recorded on the local disk of the repeating unit at the same time using a highly available, distributed, Paxos algorithm-based approach. Each elected master is also used as a Paxos leader and status change to handle all actions that change the cell state, such as submitting a job or terminating a task on a machine. When a cell has just started or is selected as a master failure, we need to use the Paxos algorithm to elect a new master, in which we need to acquire a chubby lock, which allows other systems to discover it. It usually takes 10s of time to elect a master node, but for some larger cells, this can take a minute because many of the state information in memory needs to be refactored. When a repeating unit recovers from a failure, it needs to be dynamically synchronized with the other repeating units to update to the latest state.
The state of Borgmaster at a given point in time is called Checkpoint, which is usually stored in the Paxos store as a periodic snapshot plus a change log. Checkpoint has many uses, including restoring the state of Borgmaster to a previously arbitrary point in time (such as returning to the state before receiving a request to trigger a Borg defect, so that we can debug accordingly); In extreme cases, manual repair ; Build a persistent event log for future queries, and simulations for offline use.
A high-fidelity Borgmaster simulator called Fauxmaster can be used to read checkpoints files, store full Borgmaster code copies, and obsolete borglets interfaces. It is able to receive RPC for state machine conversions and perform some operations, such as "Dispatch all pending tasks," and we can also use it to debug errors by interacting with it as if it were a real borgmaster, It then reproduces all the real interactions in the checkpoint by simulating the borglet. This allows the user to analyze the changes of the system in the past in a step-by-step manner. Fauxmaster is also useful for capacity planning (such as "How many new jobs this type creates"), but also for integrity checks before changes are made to a cell's configuration (such as "This change will have an impact on some important jobs").
3.2. Dispatch
When a job is committed, Borgmaster will keep it in the Paxos, and the task in that job will be added to the Suspended queue. These are all done by the scheduler asynchronous scan, which deploys the task to the machine when there is sufficient resources to match the job's constraints. (The scheduler primarily operates on a task, not a job.) Scanning is based on the priority from high to the end, in the same priority in accordance with the rotation method to ensure fairness between the users and avoid the head-end blocking of large job. The scheduling algorithm mainly consists of two parts: feasibility checking, which is used to find the machine that the task can run, and the scoring to select one of the feasible machines.
In feasibility checking, the scheduler finds a series of machines that meet the task constraints and have sufficient resources available (including those that are assigned to low-priority tasks). In scoring, the scheduler will then evaluate the machines that meet the basic requirements. The chapter considers different user preferences, but is mainly determined by a number of built-in criteria: for example, minimizing the number and priority of the preempted process, selecting those machines that already have the task package, propagating the task within the power supply and the failed domain, and packaging quality includes placing high-priority and low-priority task mixes in a single machine so that those high-priority tasks can scale their load spikes.
Borg native uses a variant of E-PVM for scoring. It can be used to generate a single cost value for a wide variety of resources and to minimize the cost of change in deploying a task. In fact, the E-PVM distributes the load on all machines, and instead uses the remaining headroom for the load spikes, at the cost of increasing fragmentation, especially for large tasks that need to occupy most of the machine's resources, which we often call "worst fit".
The opposite of "worst fit" is naturally "best fit": it tries to plug the machine as full as possible. This usually leaves a lot of empty machines to the user's job (which, of course, is still running on the storage server), so it's easy to deploy large tasks, but this tight packaging can adversely affect any user or Borg error estimates for resource requests. This can cause damage to applications with sudden load, which is especially bad for batch jobs because they specify very low CPU requirements so that they can be easily dispatched and run when some resources are not being used: typically 20% of non-prod jobs require less than 0.1 cpu cores.
The scoring model we are using now is a hybrid. It tries to reduce the number of standard resources-----They cannot be used because another resource on the machine is already allocated. It provides packaging efficiencies that are about 3%-5% better than "best fit".
If the selected machine through scoring does not have enough available resources to run the new task. Borg will preempt (or even kill) low-priority tasks, in order of priority from low to high, until the condition is met. Instead of migrating or hibernating, we put the preempted task into the Suspended queue of the scheduler.
The start delay of a task (from job submission to the time the task is run) is an area that continues to be valued. It will change a lot, the average is about 25s. The package installs approximately 80% of the total time: a known bottleneck is contention for the local disk that is used to write to the package. To reduce the start-up time of a task, the scheduler tends to be more willing to deploy tasks to machines that already have the appropriate packages (including programs and data); Most packages are immutable and therefore can be shared and cached (this is the only form of data locality supported by the Borg Scheduler). In addition, Borg distributes packets to the machine in parallel through the tree and Torrent-like protocol.
Finally, the scheduler uses other techniques to extend it to cells that have thousands of machines.
3.3, Borglet
Borglet is a local Borg agent that appears on every machine in the cell. It starts, stops tasks, restarts them when the task fails, manages local resources by controlling the operating system kernel settings, and reports machine status to Borgmaster and other monitoring systems.
Borgmaster polls each borglet every few minutes to get the current state of the machine while sending external requests to them. This allows the borgmaster to control the rate of interaction, avoiding the display of flow control and recovery storms.
The selected master is used to prepare the information sent to Borglet and to update the status of the cell with borglet feedback. For performance extensibility, each borgmaster repeating unit runs a link shard to handle interactions with some borglet, and the partitions are recalculated when the Borgmaster election arrives. In order to be resilient, borglet usually reports its full state, but link Shard gathers and compresses the information, reporting only the changes in the state machine, thus reducing the update load on the selected master.
If a borglet does not respond to several polling messages in a row, the corresponding machine is flagged as down, and any task it runs on will be re-dispatched to the other machine. If the interaction is restored, then Borgmaster will tell the corresponding borglet to kill the task that has been re-dispatched, thus avoiding duplication. When Borglet loses contact with Borgmaster, it continues to perform normal operations, so even after all the Borgmaster repeating units have been hung up, the running state of the task and service remains in operation.
3.4. Scalability
We are not sure what the ultimate extensibility limit will come from in the Borg central structure; So far, every time we feel we have reached a limit, we can always finally eliminate it. A single borgmaster can manage many machines in a cell, while some cells receive more than 1000 tasks per minute. A busy borgmaster will use 10-14 CPU cores and get 50G of RAM. We use a number of technologies to achieve this extensibility.
The early borgmaster had only a single, synchronized loop for receiving requests, scheduling tasks, and communicating with Borglet. To cope with the large cell, we assign the scheduler to a separate process, allowing it to work in parallel with other Borgmaster functions for exception handling. A repeating unit of a scheduler is typically operated on a cached cell state copy. It loops through the process of getting a state change from the selected master (which includes the work that has already been deployed and pending), updating its local cache, scheduling the deployed task, and notifying the currently-elected master of these deployment actions. Master receives and applies these deployments unless they are inappropriate (for example, they are based on an outdated state) so they are reconsidered in the next round of scheduling. This is very similar to the optimistic concurrency control in Omega. In fact, we have now been able to get Borg to use different schedulers for different load types.
To improve response time, we added additional threads for interacting with Borglet and for responding to read-only RPC. To improve performance, we share (partially) these features across five borgmaster repeating units. These improvements reduced the 99% ui to less than 1s and reduced the 95% Borglet polling interval to below 10s. Several of the following techniques make Borg's scheduler more scalable:
Score caching: Evaluating the usability of a machine and scoring it is very expensive, so Borg caches them until the features of the machine or task change, for example, a task termination on the machine, a change in attributes, or a request for a task. Ignoring the change in the number of small resource requests can help reduce cache invalidation.
Equivalence classes: A task in a Borg job usually has the same requirements and restrictions. As a result, Borg does not perform a feasibility analysis on each machine for each suspended task, and scores each feasible machine. Borg will only perform feasibility analysis and scoring operations on a task in each equivalence classes, and equivalence classes is actually a set of tasks with the same request.
Relaxed randomization: The feasibility calculation and scoring of each machine in a large cell is very wasteful, so the scheduler randomly tests the machine until it finds enough viable machines to score and then picks the best in it. This reduces the number of points and cache failures that are brought when the task enters and leaves the system, and accelerates the task-to-machine deployment. Relaxed randomization is somewhat similar to bulk sampling in Sparrow, and it also handles the overhead of prioritization, preemption, heterogeneity, and package installation.
In our experiment, it takes hundreds of seconds to dispatch a cell's full load from scratch, but if you disable the above technologies, it will take three days to complete. However, in general, a scheduling cycle for a suspended queue can often be completed in less than half a second.
Note: Some content in translation may be obscure or not very smooth, please correct
Original address: Http://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/43438.pdf
Google's large-scale cluster management tool borg (ii)------Borg architecture