Cloud design mode (20)--Scheduler Agent Manager mode
Coordinates the behavior of a series of distributed service sets and other remote resources , attempting to transparently handle failures if these operations fail, or revoke, if the system cannot recover the impact of execution from the failure. This mode can increase elasticity and flexibility in the distributed system, so that the recovery and retry failures are due to transient anomalies, persistent failures and processing failures.
Background and issues
An application executes a task that includes several steps, some of which can invoke a remote service or access a remote resource. Individual steps can be independent of each other, but they are orchestrated by the application logic that implements the task.
Whenever possible, the application should ensure that the task runs complete and resolves any failures that may occur when the remote Access Service or resource is resolved. These failures may occur for a variety of reasons. For example, the network may be crashing, communications may be interrupted, remote services may stop responding or are in an unstable state, or remote resources may be temporarily inaccessible due to resource constraints. In many cases, these failures may be temporary and can be processed by using the retry mode.
If the application detects a more permanent failure from which it can be difficult to recover, it must be able to restore the system to a consistent state and ensure the integrity of the entire end-to-end operation.
Solution Solutions
Dispatch AgentThe manager mode defines the following roles. The steps (work, individual projects) organized by these actors will be performed as part of the task (the entire process):
• Scheduling schedules constitute the overall tasks to be performed and to match the various steps of their operations. The following steps can be combined into a pipeline or workflow, and the scheduler is responsible for ensuring that the steps in this workflow are executed in the proper order. As part of each step (", "step Run,
Note:
The scheduler performs a similar function to the Process Manager in Process management mode. The actual workflow is usually defined and implemented by the workflow engine controlled by the scheduler. This method separates the business logic from the scheduler's workflow.
• The proxy contains logic, encapsulates the invocation of a remote service, or accesses a remote resource referenced by a step in a task. Each agent typically invokes a single service or resource through it, enforcing the appropriate error handling and retry logic (if there is a timeout limit, as explained later). If the steps in the workflow run by the scheduler take advantage of several services and resources in different steps, each step may refer to different proxies (this is the implementation details of the • Monitor the status of a task step that is scheduled to be in progress. It runs the cycle (frequency will be dedicated to the system), checking the steps through scheduling to hold the state. If any time-outs or failures are detected, it will schedule the appropriate agent to recover the steps or perform the appropriate remediation (this may involve modifying the state of the steps). Note that recovery or remediation actions are usually performed by the scheduler and the agent. Supervisors should simply require these actions to be carried out. The
Dispatcher, agent and Managers are logical components and their physical implementations depend on the technology being used. For example, a number of logical proxies can be implemented as part of a single network service. The
Scheduler maintains state information about the progress of the task and each step in the persisted data store, which is known as the state store. Supervisors can use this information to help determine if a step has failed. Figure 1 illustrates the relationship between the agent, supervisor, and state store of the scheduler.
Figure 1-TheDispatcher AgentManagersModel of the actor
Attention:
This figure shows theSimplified diagram of the pattern. In the actual implementation, it is possible that multiple instances of the scheduler run concurrently for each subset of tasks. Similarly, the system can run multiple instances of each agent, or even multiple monitors. In this case, supervisors must coordinate their work with each other seriously to ensure that they do not struggle to recover the same failed steps and tasks. The leadership in the electoral model provides a possible solution to this problem.
When an application wants to perform a task, it submits a request to theScheduler. Schedule records about the tasks and their steps (for example,"The step has not started yetThe state stores the initial state of the information, and then begins to perform the operation defined by the process. AsWhen the scheduler starts each step, it updates the state store in the step (for example,"Step-running") information about the status of the.
If a step references a remote service or resource,The dispatcher sends the message to the appropriate agent. The message can contain information that the agent needs to pass to the service or access the resource, in addition to the full pass-time operation. If the agent completes its operation successfully, it returns to theThe response of the scheduler. The scheduler can then update state information in the state store (for example,"Step Complete"), and proceed to the next step. This process continues until the entire task is completed.
The agent can implement any retry logic that needs to perform its work. However, if the agent finishes its work without completing itsThe scheduler will assume that the operation failed. In this case, the agent should stop its work and not try to return things to the scheduler (even without error messages), or perform any form of recovery. The limitation of the reason for this is that the next step has timed out or failed, and another instance of the agent can be dispatched to run the failed step (which is described later in this procedure).
If the agent itself fails, the scheduler will not receive a reply. The pattern may not allow this step to have timed out, a real difference that has failed.
If a step times out or fails, the state store contains a record that indicates whether the step is running ("Step Run"), but the complete passage of time has passed. The supervisor finds steps, such as this, and tries to recover them. One possible strategy is for the super update to complete by the value of the time that is available to complete the step, and then send the message to the scheduler to identify the step that has timed out. The scheduler can then try to repeat this step. However, such a design requires a power-like task.
This may be necessary.The manager prevents the same steps if successive failures or timeouts are retried. To achieve this, the management can maintain a retry count for each step, along with the state information that is stored in that state. If the count exceeds the predetermined threshold value, theA manager can take a strategy, for example, to notify it that it should retry the step and wait longer before the expected failure will resolve the schedule within that time. Or, theA manager can send a message to a dispatch request that the entire task is undone by implementing a compensating transaction (the method relies on the scheduler and the agent to provide the necessary information to implement the compensation operation for the steps that have been successfully completed).
Attention:
This is not the purpose of the manager to monitor the Scheduler and agents if they cannot restart them. This aspect of the system should be handled by the infrastructure in which these components run. Similarly, a manager should not be a knowledge of the actual business operations that the tasks performed by the scheduler are running, including how to compensate for the failure of those tasks. This is the purpose of the workflow logic that is performed by the scheduler. The responsibility of the supervisor is to determine if the step has failed and to either repeat or include the failed steps for it, and the entire task is canceled.
If the dispatcher is unsuccessful, or if the restart is being aborted by the scheduler unexpectedly, the scheduler should be able to determine any flight task, which is the state at the time of processing failure, and is ready to continue this task from the point of failure. The implementation details of this method are likely to be specific systems. If the task cannot be recovered, this may be necessary to undo the work that the task has completed. This may also require a compensation transaction to be performed.
The main advantage of this mode is that the system is resilient to unexpected temporary or unrecoverable failure scenarios. The system can be self-healing. For example, if an agent or dispatcher crashes, a new bootable, and thus management can schedule the task to be resumed. If the manager fails, another instance can be started and can be taken over from the failure. If the manager is scheduled to run periodically, a new instance can be started automatically after a predefined time interval. The state store can be replicated for greater resiliency.
Issues and considerations
When deciding how to implement this pattern, you should consider the following points:
• The pattern can be trivial to perform, and requires a thorough test of each possible failure mode of the system.
• The recovery/retry logic implemented through scheduling can be complex and dependent on state store hold state information. It may also be necessary to record the information required to perform a compensating transaction in a persistent data store.
• Running with the supervisor is a very important frequency. It should run frequently enough to prevent any failures from clogging the long-running application steps, but it should not run very frequently, it becomes a overhead.
• The steps performed by the agent can be executed more than once. The logic to implement these steps should be idempotent.
when to use this mode
When you use this mode, the methods that run in a distributed environment, such as the cloud, must be resilient to communication failures and/or operational failures.
This mode may not be suitable for tasks that do not invoke remote services or access remote resources.
Example
Implementing a Web application in an e-commerce system has been deployed in Microsoft Azure. Users can run this application to browse for products that provide an organization, place orders, and these products. The user interface runs as a network, and the command processing component of the application is implemented as a set of job roles. Part of the Order Processing logic includes access to remote services, which may be prone to transient or more persistent failures. For this reason, the designers usedDispatcher AgentThe manager mode realizes the System Order Processing unit.
When the customer orders the order, the application builds a message stating the order and the position of the message into the queue. A separate commit process, the worker's role runs, retrieves this message, orders the order to the order database details, and records the order process in the country store. Note that inserting into the regular database and the country store is performed as part of the same operation. The submission process is designed to ensure that two blades are completed together.
The order created by the submission process includes status information:
• Order ID: The ID of the order in the order database.
Lockedby: the instance ID of the worker role's processing order. There could be a run Dispatcher the worker role for multiple current situations, but each order can only be handled by one instance.
Completeby: The time through which the command should be processed.
processstate: The current status of the task processing order. The possible states are:? Pending. The order has been created, but processing has not started.
? Processing. The command is in process.
? Processed. The order has been successfully processed.
? Error. The order Processing failed.
Failurecount: The number of times that the processing has tried the sequential numbers.
In this state information, the OrderID field is copied from the order ID of the new order. The Lockedby and Completeby fields are set to NULL, the Processstate field is set to Pending, and the Failurecount field is set to 0.
Note:
In this example, it is simpler to handle logic, including only a single step of a invoked remote service. In a more complex multi-step scenario, the commit process is likely to involve multiple steps, so multiple records are created in the state store, each of which describes the state in a separate step.
The Scheduler also implements the business logic for processing orders as part of a worker role. An instance of scheduling polling for a new order discusses the record of the state store, where the Lockedby field is empty and is pending in the Processstate field. When the scheduler discovers a new order, immediately populates the Lockedby field with its own instance ID, set the Completeby field to an appropriate time, and set the Processstate field to handle. The design that executes this code is unique and atomic to ensure that two concurrent instances of the scheduler cannot attempt to process the same order at the same time. The
Scheduler will run the orchestration to process the order asynchronously, and the value passed from the state store to it in the OrderID field. The order in which the workflow is processed retrieves the details of the order order from the database and performs its work. When the step of the order processing process requires a call to the remote service, it uses a proxy. The agent in the workflow step communicates through the Service bus message queue that takes advantage of the azure as the request/response channel. Figure 2 shows a high-level view of the solution.
Figure 2-UsingDispatcher AgentManagersMode in azure solution to process orders
The order of message descriptions sent from a workflow step to the agent, and includes the Completeby time. If the agent receives a response from a remote service before the Completeby time expires, the workflow on which the service bus queue is built is to listen for a post reply message. When the workflow step receives a valid reply message, it completes its processing and the dispatch station's order Status Processstate field processing. At this point, in order to handle the successful completion.
If a completeby time-expired agent receives a response from a remote service, the agent simply stops its processing and terminates the processing order. Similarly, if the sequence of workflow processing exceeds Completeby, it will also terminate. In both cases, the state is stored in the sequential state of the given processing, but the time indicated by the Completeby time is used to process the order that has passed and the processing is judged to be unqualified. Note that if you are accessing a remote service, or if the agent that is processing the sequential workflow (or both) terminates unexpectedly, the information in the state store will remain set to process again, and eventually there will be an expiration Completeby value.
If the agent detects a non-instantaneous failure that is not recoverableWhen it is trying to contact a remote service, it can send an error response back to the workflow. The scheduler can be set to error, and the state of the event that alerts the operator is raised. The operator can then attempt to manually resolve the cause of the failure and resubmit the failed processing steps.
The supervisor periodically checks the status of the store in the search order, expiring Completeby value. IfManagers find such a record, which adds to the Failurecount field. If the Failurecount value is below the specified threshold, the manager resets the Lockedby field to null, updates the Completeby field with the new expiry time, and sets the Processstate field to be determined. An instance of the scheduler can pick up this command and perform its processing as before. If the Failurecount value exceeds a specific threshold, the cause of the failure is assumed to be non-transient. The supervisor is set to the wrong state, and an alert action is raised, as described earlier in the event.
Attention:
In this example, the manager is in a separate task to implement. You can use a variety of policies to schedule the supervision task to run, including using the Scheduler service for Azure (do not confuse the scheduler component in this mode). For more information about the Scheduler service for Azure, go to the Scheduler page.
Although not shown in this embodiment, the scheduler may need to be kept in a sequential application that informs about the progress of a single and the first commit of the state. Applications and schedulers are detached from each other to eliminate any dependencies between them. The application does not know the order in which scheduled instances are processed, and the dispatch does not know the order in which it is published by the specific application instance.
For the order status to be reported, the application can use its own private response queue. The details of this response queue are included in the request to the submission process, which includes part of the information stored in the state. The Scheduler then posts the message to the queue to represent the status of the Order ("received request ", "in order to complete ", "Order Failed", and so on). It should include the order ID in these messages so that they can be associated with the original request by the application.
This article is translated from msdn:http://msdn.microsoft.com/en-us/library/dn589780.aspx
Cloud design mode (20)-Scheduler Agent Manager mode