Solr4.8.0 Source Code Analysis (20) of the solrcloud of the Recovery Strategy (i)

Source: Internet
Author: User
Tags solr

Solr4.8.0 Source Code Analysis (20) of the solrcloud of the Recovery Strategy (a) Preface:

We often find in the use of solrcloud that there will be a backup shard state recoverying, which indicates that there is inconsistency in the Solrcloud data, need to recovery, This time the Solrcloud index is not written to the index file (each shard is accepted to the update after it is written to its own ulog). There are three articles on recovery, and this is the first reason to introduce recovery and the overall process.

1. Causes of recovery

Recovery generally occurs in the following three times:

    • Solrcloud startup, mainly due to the construction of the index at the time of the unexpected shutdown, resulting in some shard data and leader inconsistent, then at the start of the beginning of the Shard will synchronize data from the leader.
    • Solrcloud in the leader election errors, generally appear in leader downtime caused replica to elect into leader process.
    • Solrcloud when the update is in progress, for some reason leader forwarding update to replica is unsuccessful, forcing replica to perform recoverying data synchronization.

The first two kinds of circumstances are not introduced, this article introduces the next third situation. The approximate principle is as follows:

Previously in <solr4.8.0 source Analysis (15) of the Solrcloud Index in Depth (2) >, no matter which Shard shard the update request is sent to, Finally, the order of distribution in Solrcloud is from leader to replica. Leader receives the update request and puts the document into its own index file and the update write Ulog, and then forwards the update to each replica shard. This is the process of the index chain of the add that was mentioned earlier.

Then after the add process of the index chain is complete, Solrcloud will then call the finish () function to accept each replica response and check if the replica update operation was successful. If a replica is unsuccessful, the Shard is forced to recoverying by sending a requestrecovering command to the update failed replica.

1 Private voidDofinish () {2     //Todo:if not a forward and replication req are not specified, we could3     //send in a background thread4 5 cmddistrib.finish ();6list<error> errors =cmddistrib.geterrors ();7     //Todo-we may need to tell about more than one error ...8     9     //if its a forward, any fail is a problem-Ten     //Otherwise we assume things is fine if we got it locally One     //until we start allowing min replication param A     if(Errors.size () > 0) { -       //If one node is a retrynode, this is a forward request -       if(Errors.get (0). Req.nodeinstanceofRetrynode) { theRsp.setexception (Errors.get (0). e); -}Else { -         if(log.iswarnenabled ()) { -            for(Error error:errors) { +Log.warn ("Error Sending Update", ERROR.E); -           } +         } A       } at       //Else -       //for now we don t error-we assume if it is added locally, we -       //succeeded -     } -     -      in     //if it isn't a forward request, for each fail, try-to-tell them to -     //recover-the Doc is already added locally, so it should has been to     //Legit +  -      for(Finalsolrcmddistributor.error error:errors) { the       if(Error.req.nodeinstanceofRetrynode) { *         //we don ' t try to force a leader to recover $         //When we cannot forward to itPanax Notoginseng         Continue; -       } the       //Todo:we should force their state to recovering?? +       //todo:do retries?? A       //Todo:what If it is already recovering? Right now recoveries queue up- the       //should they? +       FinalString Recoveryurl =error.req.node.getBaseUrl (); -        $Thread thread =NewThread () { $         { -Setdaemon (true); -         } the @Override -          Public voidrun () {WuyiLog.info ("Try and ask" + Recoveryurl + "to recover"); theHttpsolrserver Server =NewHttpsolrserver (recoveryurl); -           Try { WuServer.setsotimeout (60000); -Server.setconnectiontimeout (15000); About              $Requestrecovery Recoverrequestcmd =Newrequestrecovery (); - recoverrequestcmd.setaction (coreadminaction.requestrecovery); - Recoverrequestcmd.setcorename (Error.req.node.getCoreName ()); -             Try { A server.request (recoverrequestcmd); +}Catch(Throwable t) { the SolrException.log (log, Recoveryurl -+ ": Could not-a replica to recover", T); $             } the}finally { the Server.shutdown (); the           } the         } -       }; inExecutorservice executor =Req.getcore (). Getcoredescriptor (). Getcorecontainer (). Getupdateshardhandler (). Getupdateexecutor (); the Executor.execute (thread); the        About     } the}
2. Recovery's overall process

Replica receives the Requestrecovery command from leader and then starts the recoverystrategy thread and then recovery. Overall process such as index:

    • In the requestrecovery request, I cite some (not all) of the request commands, which is the normal index chain process.
    • If the Requestrecovery command is accepted, the Shard initiates a recoverystrategy thread to recovery.
1       //If true, we is recovering after startup and shouldn ' t has (or is receiving) additional updates (except for local TL og recovery)2       BooleanRecoveringafterstartup = Recoverystrat = =NULL;3 4Recoverystrat =NewRecoverystrategy (CC, CD, This);5 Recoverystrat.setrecoveringafterstartup (recoveringafterstartup);6 Recoverystrat.start ();7Recoveryrunning =true;

    • Shards set the state of the Shard recoverying. It should be noted that if this shard is detected as leader, then the recovery process exits. Because recovery is synchronizing data from the leader.
1         Zkcontroller.publish (Core.getcoredescriptor (), zkstatereader.recovering);

    • Here to determine whether the next firsttime is true (when restarting the Shard will be checked before the replication and not finished is closed), Firsttime is to control whether advanced into the Peersync recovery strategy, If False, skip Peersync to enter replicate.
1     if(recoveringafterstartup) {2       //if we ' re recovering after startup (i.e. we has been down) and then we need to know what's the last versions were3       //When we went down. We may have the received updates since then.4Recentversions =startingversions;5       Try {6         if((Ulog.getstartingoperation () & updatelog.flag_gap)! = 0) {7           //Last operation at the time of the startup had the GAP flag set ...8           //This means we were previously doing a full index replication9           //That probably didn ' t complete and buffering updates in theTen           //Meantime. OneLog.info ("Looks like a previous replication recovery does not complete-skipping peer sync. Core= " A+corename); -Firsttime =false;//Skip Peersync -         } the}Catch(Exception e) { -SolrException.log (log, "Error trying to get Ulog starting operation. Core= " -+Corename, E); -Firsttime =false;//Skip Peersync +       } -}
    • The final selection entry is the Peersync policy and the replicate policy, which is simply mentioned in the <SOLR in Action Note (4) Solrcloud distributed index Base >. The specific differences are described in detail in the following two sections.
      • Peer Sync, if the interrupt time is short, recovering node simply loses a small number of update requests, which can be obtained from the leader update log. This threshold is a 100 update request, and if it is greater than 100, a full index snapshot recovery will be performed from leader.
      • Replication, if the node is too long offline to synchronize from leader, it uses SOLR's HTTP-based, indexed snapshot recovery.
    • Finally, set the state of the Shard to active. And judge whether it is sucessfulrrecovery, if otherwise will be more out try recovery.

Summarize:

This paper mainly introduces the cause of recovery and the recovery process, because it is a brief summary so the content is relatively simple, mainly refers to two different recovery strategies, the following two types will be described in detail.

Solr4.8.0 Source Code Analysis (20) of the solrcloud of the Recovery Strategy (i)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.