Solr4.8.0 Source Code Analysis (20) of the solrcloud of the Recovery Strategy (i)

Last Update:2014-12-05 Source: Internet

Author: User

Tags solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Solr4.8.0 Source Code Analysis (20) of the solrcloud of the Recovery Strategy (a) Preface:

We often find in the use of solrcloud that there will be a backup shard state recoverying, which indicates that there is inconsistency in the Solrcloud data, need to recovery, This time the Solrcloud index is not written to the index file (each shard is accepted to the update after it is written to its own ulog). There are three articles on recovery, and this is the first reason to introduce recovery and the overall process.

1. Causes of recovery

Recovery generally occurs in the following three times:

Solrcloud startup, mainly due to the construction of the index at the time of the unexpected shutdown, resulting in some shard data and leader inconsistent, then at the start of the beginning of the Shard will synchronize data from the leader.
Solrcloud in the leader election errors, generally appear in leader downtime caused replica to elect into leader process.
Solrcloud when the update is in progress, for some reason leader forwarding update to replica is unsuccessful, forcing replica to perform recoverying data synchronization.

The first two kinds of circumstances are not introduced, this article introduces the next third situation. The approximate principle is as follows:

Previously in <solr4.8.0 source Analysis (15) of the Solrcloud Index in Depth (2) >, no matter which Shard shard the update request is sent to, Finally, the order of distribution in Solrcloud is from leader to replica. Leader receives the update request and puts the document into its own index file and the update write Ulog, and then forwards the update to each replica shard. This is the process of the index chain of the add that was mentioned earlier.

Then after the add process of the index chain is complete, Solrcloud will then call the finish () function to accept each replica response and check if the replica update operation was successful. If a replica is unsuccessful, the Shard is forced to recoverying by sending a requestrecovering command to the update failed replica.

1 Private voidDofinish () {2     //Todo:if not a forward and replication req are not specified, we could3     //send in a background thread4 5 cmddistrib.finish ();6list<error> errors =cmddistrib.geterrors ();7     //Todo-we may need to tell about more than one error ...8     9     //if its a forward, any fail is a problem-Ten     //Otherwise we assume things is fine if we got it locally One     //until we start allowing min replication param A     if(Errors.size () > 0) { -       //If one node is a retrynode, this is a forward request -       if(Errors.get (0). Req.nodeinstanceofRetrynode) { theRsp.setexception (Errors.get (0). e); -}Else { -         if(log.iswarnenabled ()) { -            for(Error error:errors) { +Log.warn ("Error Sending Update", ERROR.E); -           } +         } A       } at       //Else -       //for now we don t error-we assume if it is added locally, we -       //succeeded -     } -     -      in     //if it isn't a forward request, for each fail, try-to-tell them to -     //recover-the Doc is already added locally, so it should has been to     //Legit +  -      for(Finalsolrcmddistributor.error error:errors) { the       if(Error.req.nodeinstanceofRetrynode) { *         //we don ' t try to force a leader to recover $         //When we cannot forward to itPanax Notoginseng         Continue; -       } the       //Todo:we should force their state to recovering?? +       //todo:do retries?? A       //Todo:what If it is already recovering? Right now recoveries queue up- the       //should they? +       FinalString Recoveryurl =error.req.node.getBaseUrl (); -        $Thread thread =NewThread () { $         { -Setdaemon (true); -         } the @Override -          Public voidrun () {WuyiLog.info ("Try and ask" + Recoveryurl + "to recover"); theHttpsolrserver Server =NewHttpsolrserver (recoveryurl); -           Try { WuServer.setsotimeout (60000); -Server.setconnectiontimeout (15000); About              $Requestrecovery Recoverrequestcmd =Newrequestrecovery (); - recoverrequestcmd.setaction (coreadminaction.requestrecovery); - Recoverrequestcmd.setcorename (Error.req.node.getCoreName ()); -             Try { A server.request (recoverrequestcmd); +}Catch(Throwable t) { the SolrException.log (log, Recoveryurl -+ ": Could not-a replica to recover", T); $             } the}finally { the Server.shutdown (); the           } the         } -       }; inExecutorservice executor =Req.getcore (). Getcoredescriptor (). Getcorecontainer (). Getupdateshardhandler (). Getupdateexecutor (); the Executor.execute (thread); the        About     } the}

2. Recovery's overall process

Replica receives the Requestrecovery command from leader and then starts the recoverystrategy thread and then recovery. Overall process such as index:

In the requestrecovery request, I cite some (not all) of the request commands, which is the normal index chain process.
If the Requestrecovery command is accepted, the Shard initiates a recoverystrategy thread to recovery.

1       //If true, we is recovering after startup and shouldn ' t has (or is receiving) additional updates (except for local TL og recovery)2       BooleanRecoveringafterstartup = Recoverystrat = =NULL;3 4Recoverystrat =NewRecoverystrategy (CC, CD, This);5 Recoverystrat.setrecoveringafterstartup (recoveringafterstartup);6 Recoverystrat.start ();7Recoveryrunning =true;

Shards set the state of the Shard recoverying. It should be noted that if this shard is detected as leader, then the recovery process exits. Because recovery is synchronizing data from the leader.

1         Zkcontroller.publish (Core.getcoredescriptor (), zkstatereader.recovering);

Here to determine whether the next firsttime is true (when restarting the Shard will be checked before the replication and not finished is closed), Firsttime is to control whether advanced into the Peersync recovery strategy, If False, skip Peersync to enter replicate.

1     if(recoveringafterstartup) {2       //if we ' re recovering after startup (i.e. we has been down) and then we need to know what's the last versions were3       //When we went down. We may have the received updates since then.4Recentversions =startingversions;5       Try {6         if((Ulog.getstartingoperation () & updatelog.flag_gap)! = 0) {7           //Last operation at the time of the startup had the GAP flag set ...8           //This means we were previously doing a full index replication9           //That probably didn ' t complete and buffering updates in theTen           //Meantime. OneLog.info ("Looks like a previous replication recovery does not complete-skipping peer sync. Core= " A+corename); -Firsttime =false;//Skip Peersync -         } the}Catch(Exception e) { -SolrException.log (log, "Error trying to get Ulog starting operation. Core= " -+Corename, E); -Firsttime =false;//Skip Peersync +       } -}

The final selection entry is the Peersync policy and the replicate policy, which is simply mentioned in the <SOLR in Action Note (4) Solrcloud distributed index Base >. The specific differences are described in detail in the following two sections.
- Peer Sync, if the interrupt time is short, recovering node simply loses a small number of update requests, which can be obtained from the leader update log. This threshold is a 100 update request, and if it is greater than 100, a full index snapshot recovery will be performed from leader.
- Replication, if the node is too long offline to synchronize from leader, it uses SOLR's HTTP-based, indexed snapshot recovery.
Finally, set the state of the Shard to active. And judge whether it is sucessfulrrecovery, if otherwise will be more out try recovery.

Summarize:

This paper mainly introduces the cause of recovery and the recovery process, because it is a brief summary so the content is relatively simple, mainly refers to two different recovery strategies, the following two types will be described in detail.

Solr4.8.0 Source Code Analysis (20) of the solrcloud of the Recovery Strategy (i)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More