Source code Analysis Elasticjob failure transfer mechanism

Last Update:2018-07-26 Source: Internet

Author: User

Tags event listener failover

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This section explores the elasticjob failover mechanism. We know that Elasticjob is a distributed task scheduling framework based on quartz, where distributed data is distributed, and the core design concept of elasticjob is that a task executes on multiple nodes, and each node processes part of the data (task pending data shards). If a task node is down, and a task is scheduled, a portion of the data will not be processed, in order to solve the task node downtime caused by the task of a scheduling cycle of a task execution part of the data is not processed, you can set the open failover failover, the task to the other normal node execution, To achieve the same effect as the task on a single node (the amount of data that this dispatch processes), the Elasticjob failover class diagram is shown in the figure:

1) Failoverlistenermanager: Fail-over transfer monitor manager.
2) Failoverlistenermanager$jobcrashedjoblistener Job implementation (job instance outage) event monitoring manager.
3) Failoverlistenermanager$failoversettingschangedjoblistener fail-over configuration change event listener.

1, Failure transfer event monitoring manager detailed
1.1 Jobcrashedjoblistener Job Instance node outage event Listener

Class Jobcrashedjoblistener extends Abstractjoblistener {protected void datachanged (final String path, final Type EventType, final String data) {if (isfailoverenabled () && type.node_removed = = EventType && Instancenode.isinstancepath (path)) {//@1 String Jobinstanceid = path.substring (instancenode.getinsta                                 Ncefullpath (). Length () + 1);               @2 if (Jobinstanceid.equals (Jobregistry.getinstance (). Getjobinstance (JobName). Getjobinstanceid ())) {
                @3 return;                                                  } list<integer> Failoveritems = Failoverservice.getfailoveritems (Jobinstanceid);                                                                                                                               @4 if (!failoveritems.isempty ()) { @5 for (int
Each:failoveritems) {                        Failoverservice.setcrashedfailoverflag (each);
                    Failoverservice.failoverifnecessary ();                                                                }} else {for (int each:shardingService.getShardingItems (Jobinstanceid)) { @6 Failoverservice.setcrashedfailoverfla
                        g (each);
                    Failoverservice.failoverifnecessary (); }
                }
            }
        }
    }

Code @1: If an open failover mechanism is set up in the configuration file, when a ${namespace}/jobname/instances node is heard of the delete event, it is considered to have a node outage, and the fault failover related logic will be performed.
Code @2: Gets the task Instance ID (jobinstanceid) of the outage.
Code @3: ignored if the deleted task node ID is the same as the ID of the current instance.
Code @4: Gets the failed Transfer Shard item collection for the job server based on Jobinstanceid. The
implementation logic is as follows: Failoverservice#getfailoveritems

/** * Gets the Failover Shard item collection for the job server. * * @param Jobinstanceid Job Run instance primary key * @return A collection of shard items for job Failover * * Public list<integer> Getfailoveritems (Final String Jobinstanceid)
        {list<string> items = Jobnodestorage.getjobnodechildrenkeys (shardingnode.root);
        list<integer> result = new Arraylist<> (Items.size ());
            for (String each:items) {int item = Integer.parseint (each);
            String node = failovernode.getexecutionfailovernode (item); if (jobnodestorage.isjobnodeexisted (node) && jobinstanceid.equals (jobnodestorage.getjobnodedatadirectly (
            node)) {Result.add (item);
        }} collections.sort (Result);
    return result; }

First obtains the direct child node (current shard information) under the ${namespace}/jobname/sharding directory, determines whether the ${namespace}/jobname/sharding/{item}/failover node exists, Returns if there is a shard node that determines whether the Shard is the current task or not. The main purpose of this method is to obtain the Shard information that has been transferred to the current task node.
Code @5, determine if there are failed shards to transfer to the current node, the initial state is definitely empty, will execute code @6, set up a failover-related readiness environment.
Code @6, gets all the Shard nodes assigned to the crashed (down job instance), traverses the failed shards, sets these shards as failures, and is set to fail-over, as follows: Create ${namespace}/jobname/leader/ Failover/items/{item}.
Code @7: Executes the failoverservice#failoverifnecessary whether the failover is performed.

/**
     * If failover is required, the job failover is performed.
     */Public
    void Failoverifnecessary () {
        if (Needfailover ()) {
            Jobnodestorage.executeinleader ( Failovernode.latch, New Failoverleaderexecutioncallback ());
        }
    }

    Private Boolean Needfailover () {
        return jobnodestorage.isjobnodeexisted (failovernode.items_root) &&! Jobnodestorage.getjobnodechildrenkeys (Failovernode.items_root). IsEmpty ()
                &&! Jobregistry.getinstance (). isjobrunning (JobName);
    }

Its implementation idea: the "Needfailover method" first determines whether there is a ${namespace}/jobname/leader/failover/items node exists, and whether there are child nodes under its node, and the node also runs the task, You need to perform a fail-over transfer. The logic to perform the failover is also to fail over the selection of the main, the Distributed lock node is: ${namespace}/jobname/leader/failover/latch, who first obtains the lock, then performs the fail-over failover concrete logic ( Failoverleaderexecutioncallback), the specific fail-over algorithm is:
Failoverservice#failoverleaderexecutioncallback:

Class Failoverleaderexecutioncallback implements Leaderexecutioncallback {@Override public void execute ()
            {if (Jobregistry.getinstance (). IsShutDown (jobName) | |!needfailover ()) {//@1 return; } int crasheditem = Integer.parseint (Jobnodestorage.getjobnodechildrenkeys (failovernode.items_root    ). Get (0));
            @2 log.debug ("Failover job ' {} ' begin, crashed item ' {} '", JobName, Crasheditem); Jobnodestorage.fillephemeraljobnode (Failovernode.getexecutionfailovernode (crasheditem), JobRegistry.getInstance  (). Getjobinstance (JobName). Getjobinstanceid ());     @3 jobnodestorage.removejobnodeifexisted (Failovernode.getitemsnode (Crasheditem)); @4//TODO should not use Triggerjob, but instead use executor unified scheduling Jobschedulecontroller Jobschedulecontroller = Jobre    Gistry.getinstance (). Getjobschedulecontroller (JobName);
              @5 if (null! = Jobschedulecontroller) {  Jobschedulecontroller.triggerjob (); }
        }
    }

Code @1: Returns if the current instance stops running the job or does not need to perform a fail-over failover.
Code @2: Gets the first shard to fail over, gets ${namespace}/jobname/leader/failover/items/{itemnum, gets the Shard ordinal itemnum.
Code @3: Creates a temporary node ${namespace}/jobname/sharding/itemnum/failover node.
Code @4: Delete ${namespace}/jobname/leader/failover/items/{itemnum} node.
Code @5: triggers a task schedule and ends the failover of the current node, then releases the lock, the next node acquires the lock, and transfers the failed shard under the ${namespace}/jobname/leader/failover/items directory.
PS: The basic implementation of the fault implementation of the idea is: when a task node is down, the other nodes will hear the instance delete event, from the instance directory to get the actual example ID, and from ZK to obtain the original allocation of the fault instance shard information, and the shards are marked as requiring failover (create ${ Namespace}/jobname/leader/failover/items/{item}, and then determine if a failover operation is required. The prerequisite for performing a failover operation is: 1. The current task instance also dispatches the job;2: There are ${namespace}/jobname/leader/failover/items nodes and child nodes. If both of these conditions are met, failover is performed, multiple surviving nodes are selected for primary (Leaderlatch), a Distributed lock node (${namespace}/jobname/leader/failover/latch) is created, The node that acquires the lock takes precedence over the Fetch Shard node, and its specific process, as shown above, only competes for one shard per surviving node failover.

2, fault shard re-execution logical analysis
The main function of the above event listener is that when the task node fails, the other surviving nodes "divide" the shards of the failed node and create the ${namespace}/jobname/sharding/{item}/failover node. However, the tasks of these shards are not actually performed, and this summary will comb the execution of failed node shards. It can be seen that the Shard failover, which is to create the failover node under the corresponding fault Shard, takes precedence when acquiring the Shard information context, which is not explained in the analysis of the fragmentation process.
Therefore, before entering the following, please read the source code Analysis Elasticjob of the Shard mechanism.
Return to scheduled task schedule execution Portal: Abstractelasticjobexecutor#execute

/** * Execute the job.
        */public final void execute () {try {jobfacade.checkjobexecutionenvironment ();
        } catch (final jobexecutionenvironmentexception cause) {jobexceptionhandler.handleexception (jobName, cause);  } shardingcontexts shardingcontexts = Jobfacade.getshardingcontexts ();

Get the Shard Context ...}  Litejobfacade#getshardingcontexts @Override public shardingcontexts getshardingcontexts () {Boolean isfailover
        = Configservice.load (True). Isfailover (); if (isfailover) {//@1 list<integer> Failovershardingitems = Failoverservice.getlocalfailoveritems ()    ; @2 if (!failovershardingitems.isempty ()) {return Executioncontextservice.getjobshardingcon    Text (failovershardingitems);
        @3}} shardingservice.shardingifnecessary ();
       list<integer> Shardingitems = Shardingservice.getlocalshardingitems (); if (isfailover) {Shardingitems.removeall (Failoverservice.getlocaltakeoffitems ());
        } shardingitems.removeall (Executionservice.getdisableditems (Shardingitems));
    Return Executioncontextservice.getjobshardingcontext (Shardingitems); }

Code @1: When getting the Shard context, if the failover mechanism is enabled, the Shard context of the failed failover is acquired preferentially.
Code @2: Gets the implementation shard information obtained by this node. The basic logic is to traverse the byte point under the ${namespace}/jobname/sharding, get all the current shard information of the task, traverse each node, get the ordinal number, and then in turn determine whether it exists (${namespace}/jobname/sharding /{item}/failover), and the content of the node is the current instance ID, it is added to the Shard result.
Code @3: Constructs a shard context based on the broken Shard ordinal, executes the task on that Shard, and performs the task according to the Shard context. "Abstractelasticjob#execute (shardingcontexts, JobExecutionEvent.ExecutionSource.NORMAL_TRIGGER);" After performing this task dispatch, The fault token for the Shard will be removed and re-shard when the next task is scheduled. The fault tag code for deleting the Shard is as follows: litejobfacade#registerjobcompleted

public void registerjobcompleted (final shardingcontexts shardingcontexts) {
        Executionservice.registerjobcompleted (shardingcontexts);  @1
        if (Configservice.load (true). Isfailover ()) {
            Failoverservice.updatefailovercomplete ( Shardingcontexts.getshardingitemparameters (). KeySet ());   @2
        }
}

Code @1: Sets the schedule task for the Shard to finish, first setting the task in memory to non-running (jobregistry.getinstance (). Setjobrunning (False)), if Monitorexecution is turned on, You need to delete the run tag for the Shard by deleting the ${namespace}/jobname/sharding/{item}/running node.
Code @2: If failover is enabled, call the Updatefailovercomplete method, update the failure to implement the transfer processing complete, delete the ${namespace}/jobname/sharding/{item}/failover node , the next time the task is scheduled uniformly, all shards will be re-partitioned and a failure failover will be completed.
Summary:
Fault implementation of the transfer, in fact, during a task scheduling, the Shard node outage, resulting in the allocation of the outage service on the Shard task is not executed, then this data is not processed during this task scheduling, in order to deal with that part of the database in time, Elasticjob support failure transfer, is to transfer the Shard context allocated by the other outage service to the currently surviving node during a task dispatch, and then the next reassignment task is performed. The next time the transfer task runs, the shards are re-made. The
Elasticjob is a distributed task scheduling platform, where distributed more refers to the distribution of data, that is, a task is performed on multiple shards, each node gets part of the data according to the Shard context (data fragmentation).

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More