Guide |
This article mainly describes how Google implements a reliable distributed Cron service that serves internally those teams that require most of the computational tasks scheduled for scheduling. In the practice of this system, we have gained a lot, including how to design and how to make it look like a reliable basic service. Here, let's discuss what problems distributed Cron might encounter and how to fix it. |
Cron is a common tool used in UNIX to perform arbitrary tasks that are specified by users on a regular basis. Let's start by analyzing the basic principles of cron and its most common implementations, and then we'll review how services like Cron should work in a large, distributed environment, so that even single-machine failures don't affect system availability. We will introduce a cron system built on a small number of machines and then combine the scheduling services in the data center to run cron tasks across the data center.
Before we describe how to run a reliable distributed cron service, let's go back to cron from a SRE perspective.
Cron is a common tool that both administrators and ordinary users can use to run specified commands on the system and specify when to run commands that can be either periodic garbage collection or periodic data analysis. The most common time-specified format is called crontab, which not only supports simple time periods (such as once a day, once per hour), but also supports more complex timeframes, such as every Saturday, the 30th day of every month, and so on.
Cron usually contains only one component, known as Crond, which is a daemon that loads all cron scheduled tasks that need to be run, sorts them according to their next run time, and then the daemon waits until the first task starts to execute. At this point, Crond will load the task and then put it in the queue to wait for the next run.
Reliability
A service from a reliability standpoint requires a lot of attention.
First, such as Crond, its fault domain is essentially a machine, and if the machine is not running, either cron scheduling or loading tasks are not operational. So consider a very simple distributed example ——— we use two machines, and Cron dispatches to run tasks on one of the machines (for example, via SSH). A fault domain is then created: both the dispatch task and the target server may fail.
Another thing to note is that even if the Crond is restarted (including a server restart), the crontab configuration deployed above should not be lost. Crond performs a task and then ' forgets ' the state of the task, and it does not attempt to track the execution status of the task, including whether the execution has been performed.
Anacron is an exception, and it is a supplement to crontab, which attempts to run tasks that should be performed but not performed because of server outages. This is limited to daily or smaller execution frequency tasks, but is useful for running maintenance work on workstations and laptops. This makes it easier to run these special tasks by maintaining a configuration file that includes the last execution time.
Cron tasks and idempotent
Cron's tasks are used to perform recurring tasks, but beyond that, it's hard to know more about their functionality. Let's put aside the topic to be discussed, and now we'll start with the cron task itself, because only by understanding the various requirements of the Cron task can we know how it affects the reliability requirements we need, and this discussion will go through the next article.
Some Cron tasks are idempotent, so that in the case of some system failures, it is safe to execute them multiple times, such as garbage collection. However, some Cron tasks should not be executed many times, such as a task that sends a message.
In more complicated situations, some cron tasks allow "forget" to run in some cases, but some cron tasks do not tolerate them, for example, the cron task of garbage collection is dispatched every 5 minutes, even if there is not much problem at one time, however, the one-month payroll task But there is absolutely no mistake allowed.
The various types of Cron tasks make it impossible to have a common solution that allows it to handle a wide variety of failures. So, in the cases described above, we're more inclined to miss a run than to run them two or more times. The owner of the Cron task should (and must) monitor their tasks, such as returning the result of the call to the task, or sending the running log to the owning person individually, etc., so that the corresponding remedial action can be easily taken even if the task is skipped. When a task fails, we tend to set the task status to "fail closed" to avoid a systematic bad state.
massively deployed Cron
When you deploy cron from a single machine to a cluster, you need to rethink how to make cron run well in this environment. Before we explain Google's Cron, let's talk about the differences between the standalone and multi-machine, and how to design this change.
Extending Infrastructure
Conventional Cron is limited to a single machine, while a massively deployed cron solution cannot be tied to a single machine. Assuming we have a data center with 1000 servers, it's obviously not what we'd expect if even a 1/1000 chance of a server unavailability could destroy our entire Cron service.
So, to solve this problem, we have to decouple the service from the machine. This way, if you want to run a service, you just need to specify which data center it is running in, and the rest depends on the data center scheduling system (assuming that the dispatch system should also be reliable), the scheduling system will be responsible for running the service on which or which machines, and the ability to handle the machine to hang out of this situation. Well, if we're going to run a task in the datacenter, it's just a dispatch system that sends one or more RPCs to the datacenter.
However, this process is obviously not done instantaneously. For example, it takes time to check which machines are hung up (what to do if the Machine Health checker hangs), and to rerun the task on another machine (the service relies on redeployment to recall the task).
Transferring the program to another machine may mean losing some state information stored on the old machine (unless you also take a dynamic migration), and the time interval for rescheduling may exceed the minimum definition of one minute, so we must also consider both of these cases. A straightforward way to put a state file into a distributed file system, such as GFS, is to use it to record the state of usage when the task is running and when the task is redeployed. However, this solution does not meet the needs of our expected timeliness, for example, you want to run a Cron task running every five minutes, the redeployment of 1-2 minutes of running time is also a considerable delay for this task.
The need for timeliness can lead to the use of a variety of hot-standby technologies that can quickly record status and recover quickly from the original state.
Demand Expansion
Another substantial difference between deploying a service in a datacenter and a single server is how to plan the computing resources needed for a task, such as CPU or memory.
Stand-alone services are typically resource-isolated through processes, and while Docker is now becoming more common, it is not a very common practice to use it to isolate everything, including restricting crond and the tasks it will run.
Large-scale deployments often use containers for resource isolation in data centers. Isolation is necessary because we certainly want a program running in the datacenter to not adversely affect other programs. For the effectiveness of isolation, it is necessary to anticipate what resources are required to run before running-including the Cron system itself and the tasks to be run. This in turn creates the problem that if the data center does not have enough resources for the time being, the task may run late. This requires us not only to monitor the cron task load, but also to monitor the full state of the cron task, including starting the load to the terminating run.
Now that we want the Cron system to be decoupled from the standalone operation, as we described earlier, we may experience partial tasks running or loading failures. Thanks to the versatility of the task configuration, running a new Cron task in the datacenter can be done simply by RPC invocation, but unfortunately, we can only know if the RPC call is successful, but we don't know exactly where the task failed, for example, the task failed during the run, Then the recovery program must also handle these intermediate processes well.
In terms of failure, the data center is far more complex than a single server. Cron runs from a single stand-alone binary program to the entire data center, adding a lot of obvious or less obvious dependencies. As a basic service like Cron, we want to be assured that even if some "Fail" occurs in the data center (for example, partial machine outages or storage hangs), the service can still guarantee functional uptime. In order to improve reliability, we should deploy the data center scheduling system in different physical locations, so that even if one or a part of the power is hung up, at least the Cron service will not be completely unusable.
How Google's Cron is built
Now let's address these issues so that we can deploy reliable Cron services in a large distributed cluster, and then focus on some of Google's experience with distributed cron.
track the status of a Cron task
As described above, we should keep track of the Cron task's real-time status so that even if it fails, it's easier for us to recover it. Furthermore, the consistency of this state is critical: we are more receptive to not running it than running the same Cron task 10 times with the wrong one. Recall that many Cron tasks, which are not idempotent, such as sending notification messages.
We have two options to store the data for Cron tasks in a reliable distributed store, or simply to save the state of the task. When we design a distributed Cron service, we take the second one, for several reasons:
Distributed storage, such as GFS or HDFS, is often used to store large files (such as the output of web crawlers), and then we need to store the cron state to be very, very small. It is very expensive to store such small files on this large distributed file system, and it is not appropriate to consider the latency of the Distributed File system.
Like the Cron service, the basic service, it needs to be less dependent on the better. This way, even if some data centers are hung up, the Cron service can at least guarantee its functionality and last for some time. This does not mean that the storage should be directly part of the Cron program (which is essentially an implementation detail). Cron should be a downstream system that can be operated independently for the user to use.
using Paxos
We deploy Cron services for multiple instances, and then use the Paxos algorithm to synchronize the state between these instances.
The Paxos algorithm and other alternative algorithms, such as Zab,raft, are very common in distributed systems. Specific description Paxos is not within the scope of this article, its basic role is to make the state of multiple unreliable nodes consistent, as long as most of the Paxos group members are available, then the entire distributed system, as a whole to deal with the state of change.
Distributed cron uses a separate master task, see, only it can change the shared state, and only it can load Cron tasks. Here we use a variant of Paxos--fast Paxos, where the primary node of Fast Paxos is also the primary node of the Cron service.
If the primary node hangs, Paxos's health check mechanism is quickly discovered in seconds and a new master node is elected. Once the new master node is elected, the Cron service will be elected with a new cron master node, and the new Cron Master can take over all the unfinished work left by the previous master node. Here the Cron Master node is the same as the primary node of the Paxos, but the cron master node needs to handle the extra work. The mechanism to quickly elect a new master node allows us to roughly tolerate a minute of downtime.
One of the most important states we maintain with the Paxos algorithm is which Cron tasks are running. For each Cron task that runs, we synchronize the start and end of its load run to a certain number of nodes.
master node and slave node roles
As described above, we use Paxos in a Cron service and deploy it with two different roles, master nodes, and slave nodes. Let's do a specific description of each character.
Master Node
The master node is used to load the Cron task, which has an internal dispatch system, similar to the Crond of a single machine, maintains a task load list and loads the task at a specified time.
When the task loads, the master node "declares" that it will load the specified task, and calculates the next load time for the task, just like Crond. Of course, just like Crond, after a task is loaded, the next load time may change artificially, and this change is also synchronized to the slave node. The simple Identity cron task is not enough, and we should also bind this task to start execution time to avoid ambiguity in the cron task when loading (especially those high-frequency tasks, such as those that occur once a minute). This "notice" is carried out through Paxos. Demonstrates this process.
It is important to keep Paxos communication synchronized, only the Paxos legal number has received the load notification, the specified task can be loaded execution. The Cron service needs to know if each task has been started, so that it can decide what to do next, even if the primary node is dead. If you do not synchronize, it means that the entire Cron task is running on the master node, and the slave node is not aware of all this. If a failure occurs, it is likely that the task will be executed again because no node knows that the task has been executed.
The completion status of the Cron task is communicated to the other nodes by Paxos, so keep in sync, and note here that the "done" status here does not indicate that the task is a success or failure. We'll also need to deal with the situation where the cron task is executed at the specified call time, and we'll do the same if the Cron service fails after the load task is executed, which we'll discuss in the next section.
Another important feature of the master node is that it must immediately stop interacting with the datacenter scheduling system, for whatever reason the master loses its master control. The maintenance of the master control should be mutually exclusive for accessing the data center. If this is not the case, the old and new two primary nodes may initiate conflicting operational requests to the data center's dispatch system.
from the node
Real-time monitoring of state information from the master node from the node to make a positive response at the desired moment. The state change information of all the master nodes is transmitted through Paxos to each slave node. Similar to the master node, a list is maintained from the node, preserving all Cron tasks. This list must be consistent across all nodes (or through Paxos, of course).
When a notification of a load task is received, the slave node places the next load time for the task into the local task list. This important change in state information (which is done synchronously) ensures that the schedule of Cron jobs within the system is consistent. We track all valid load tasks, that is, we track when the task starts, not the end.
If a primary node hangs or is lost for some reason (for example, a network exception, etc.), a Slave node may be elected as a new master node. This election process must run in a minute to avoid the loss of Cron tasks. Once elected as the primary node, all running load tasks (or partially failed) must be re-validated for their validity. This can be a complex process, which needs to be explained in detail in both the Cron service system and the data center dispatch system.
Failure Recovery
As mentioned above, a logical Cron task is loaded between the master node and the dispatch system in the datacenter via RPC, but this series of RPC invocation procedures is likely to fail, so we have to take this situation into account and handle it well.
Recall that each Cron task that is loaded has two synchronization points: Start loading and execution complete. This allows us to separate the loading tasks from each other. Even though the task load only needs to call RPC once, how do we know if the RPC call actually succeeds? We know when the task starts, but if the primary node is hung we won't know when it's over.
In order to solve this problem, all operations performed on the external system, or their operations are idempotent (that is, we can safely execute them multiple times), or we must monitor their state in real time so that we can clearly know when to complete.
These conditions significantly increase the limit, it is difficult to achieve, but in a distributed environment, these restrictions are to ensure that the Cron service accurately run the fundamental, can well handle the possible "fail." Failure to properly handle these will result in the loss of the Cron task's loading or the loading of repeated cron tasks.
Most basic services name these tasks when they load logical tasks in data centers (such as Mesos), which makes it easier to view the status of a task, terminate a task, or perform other maintenance operations. A reasonable solution to the idempotent is to put the execution time in the name-which does not cause task-changing operations in the data center's dispatch system-and then distribute them to all the Cron service nodes. If the primary node of the Cron service hangs, then the new master node simply needs to pre-preprocess the task name to see its corresponding state, then load the missing task.
Note that when we keep the internal state consistent between the nodes, we monitor the time of the dispatch load task in real time. Again, we need to eliminate the inconsistencies that might occur when interacting with the datacenter, so here we take the scheduled load time as a standard. For example, there is a short but frequently executed Cron task that has been executed, but when it is ready to advertise the situation to the other nodes, the master node hangs up and the failure time continues to be exceptionally long-long until the cron task has been successfully executed. Then the new master node looks at the status of the task, finds that it has been executed, and then tries to load it. If this time is included, the master node will know that the task has been executed and will not repeat the second time.
In the actual implementation process, the state supervision is a more complex work, its implementation process and details depend on other underlying basic services, however, the above does not include the implementation of the relevant system description. Depending on the infrastructure you are currently available, you may want to compromise between the risk of repeating tasks and skipping the execution of tasks.
State Save
Using Paxos to synchronize is just one of the issues encountered in the processing state. Paxos is essentially just a log that keeps track of state changes and synchronizes the logs as the state changes. This has two effects: first, the log needs to be compressed to prevent it from growing indefinitely; second, the log itself needs to be kept in one place.
To avoid its infinite growth, we only take the current snapshot of the state, so that we can quickly rebuild the state without having to replay it with all the previous state logs. For example, in the log we recorded a status of "counter plus 1", and then after 1000 iterations, we recorded 1000 status logs, but we can also simply record a record "set counter to 1000" to do instead.
If the log is lost, we simply lose a snapshot of the current state. Snapshots are actually the most critical state-if you lose a snapshot, we basically have to start over, because we lose all the internal state during the last snapshot and the lost snapshot. On the other hand, the loss of logs also means that the Cron service is pulled back to the place indicated in the last snapshot that was recorded.
We have two main choices to save data: stored in an externally available distributed storage service, or in an internal system to store the state of a Cron service. When we design the system, these two points need to be considered.
We store the Paxos logs on the local disk of the server where the Cron service node resides. The default of three nodes means that we have three copies of the log. We also store the snapshot on the server itself, however, because it is very important in itself, we also make a backup of it in the distributed storage service, so that even if the small probability of the three node machine is faulty, can also service recovery.
We do not store the log itself in distributed storage because we feel that missing logs represent only recent state loss, which we can actually accept. Storing it in distributed storage can result in a performance penalty, as it itself is written in a constant byte-to-bit scenario that is not suitable for use with distributed storage. At the same time, the probability of a three-server total failure is too small, but once this happens, we can automatically recover from the snapshot and only lose the part from the last snapshot to the point of failure. Of course, as with the Cron service itself, it is up to you to weigh it and decide on your own infrastructure.
Logs and snapshots are stored locally, and snapshots are backed up in distributed storage, so that even if a new node is started, it can be obtained through the network from other nodes that are already running. This means that the boot node has nothing to do with the server itself, and rearranging a new server (such as restarting) to assume the role of a node is essentially one of the issues that affect the reliability of the service.
run a large Cron
There are other, small, but equally interesting situations that can affect the deployment of a large Cron service. Traditional cron sizes are small: up to dozens of cron tasks are included. However, if you run a Cron service in a data center over thousands of servers, you will encounter a variety of problems.
A big problem is that distributed systems often face a classic problem: surprise group problems, in the use of Cron services will cause a lot of spikes in the situation. When it comes to configuring a Cron task that executes every day, most people think about it in the middle of the night and then configure it. If a Cron task executes on a machine, that's fine, but if your task is to perform a mapreduce task involving thousands of workers, or if there are 30 different teams in the datacenter that want to configure such a daily running task, Then we have to extend the crontab format.
For traditional crontab, the user specifies the time to run the cron task by defining "minutes", "hours", "Monthly (or weekly) days", "number of months", or by an asterisk (*) to represent each corresponding value. For example, run every morning, and its crontab format is 0 0 * * *, which represents a daily 0:0 run. We have also launched a question mark on this basis (? This symbol, which is labeled, is available at any time on this corresponding timeline, and the Cron service will be free to select the appropriate value to randomly select the corresponding value within the specified time period, thus making the task run more evenly. Like 0? * *, 0-23 o ' clock per day, one hours of random 0 minutes to run this task.
Despite this change, the load value caused by the cron task still has a sharp spike, indicating the number of cron task loads in Google. Spikes often represent tasks that require a fixed frequency to run at a specified time.
Summary
The Cron service has been a basic service for UNIX for nearly 10 years. The entire industry is evolving toward a large distributed system at a time when the smallest unit of hardware will be the data center, so a lot of technology stacks need to be changed, and Cron will not be the exception. A closer look at the service features required for cron services and the requirements of cron tasks will drive us to new designs.
Based on Google's solution, we have discussed the corresponding constraints and possible design of Cron services in a distributed system. This solution requires strong consistency assurance in a distributed environment, and its implementation is based on a common algorithm such as Paxos to achieve eventual consistency in an unreliable environment. Using Paxos, the correct analysis of the failure of cron tasks in large-scale environments, and the use of distributed environments together create a robust cron service for use within Google.
This article was reproduced from: http://www.linuxprobe.com/stable-cron-service.html
Free to provide the latest Linux technology tutorials Books, for open-source technology enthusiasts to do more and better: http://www.linuxprobe.com/
How to design a stable Cron service across the globe