Solrcloud: According to the SOLR wiki

Last Update:2016-04-01 Source: Internet

Author: User

Tags apache solr solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article is written by the author according to Apache SOLR document, translation is not correct or understanding is not in place welcome everyone to correct! Thank you! Nodes, cores, Cluster and leadersNodes and Cores

In Solrcloud, a node is an instance of a JVM running SOLR, often referred to as server. Each solrcore can be treated as a node. Any node can contain a SOLR instance and a variety of data in it.

SOLR Core stores an index based on the text content and fields found in an article. A single SOLR instance can contain multiple cores, which are separated from each other based on local standards. These cores provide different search methods for different users (users in the United States or users in Canada), providing private concerns (some users cannot access certain documents), providing documents that are irrelevant or difficult to integrate (shoe data and DVD data).

When you start a new core in Solrcloud mode, the core is automatically registered with the zookeeper. This process involves creating a temporary node (if the SOLR instance stops this temporary node disappears), registering the core and how to connect to the core (for example, SOLR's url,core name, etc.). Nodes in the client and cluster can use this information to determine what information is required to perform the request.

The new SOLR cores can be coreadmin to create and correlate collection. The new cloud-related parameters are described on the parameter reference page. Objects using the Create operation:

L Collection:core belongs to the collection. The default is the name of the core

L Shard:shardid represents core

L COLLECTION.<PARAM>=<VALUE>: If a new collection is created, a set of <param>=<value> properties is set. For example, collection.configname=<configname> is used to indicate the new collection config.

For example:

Curl ' http://localhost:8983/solr/admin/cores?action=create&name=mycore&collection=my_collection& Shard=shard2 '

Clusters

A cluster is a separate collection of SOLR nodes managed by zookeeper. When you have a cluster, at any time you can send a mass request, if this request is recognized, you can ensure that the request will be managed and persisted as a unit, that is, you will not lose data. Once you're done, you can see the status update, and the cluster can be expanded or shrunk.

Creating a Cluster

As long as more than one SOLR instance is registered on zookeeper, the cluster is created.

resing a Cluster

The cluster contains a parameter that can set the number of Shard. When you start SOLR, set the number of Shard for the new cluster by passing the system parameter numshards. Either SOLR node must pass the Numshards parameter at the first boot to automatically configure the part of the instance that Shard should belong to. Once you start the number of SOLR node more than Numshards,nodes will create replicas for each shard, evenly distributed on node as long as they belong to the same collection.

To add more cores to your collection, just launch a new core. You can do this at any time, and the new core will be synchronized with the current replicas in Shard before activation.

If you assign a shard ID to a core manually, you can also bypass numshards.

The number of Shard determines how fragmented your index data is, so you cannot change the Shard number of indexes after you initialize the cluster settings.

However, you have the opportunity to separate your index into multiple shard, even if you have only one server. You can expand to more than one server in the future. Complete this operation as follows:

1. Set up your collection with multiple cores on a single physical server. Every Shard is the leader under that Shard.

2. When you are ready, you can move those shard to the new server by starting the new replica on each new server that belongs to those shard.

3. Delete The Shard on the original server. Zookeeper will upgrade replica to that Shard leader.

leaders and Replicas

The concept of leader is similar to the master in SOLR replication functions. Leader is responsible for ensuring that the replicas is consistent with the information stored in leader.

Then using Solrcloud, you will not only have one master and one or more slave, but you are likely to be communicating between distributed queries and multi-server indexing. If you have set up SOLR's numshards=2, for example all your differences are on two shard. In this case, two Shard will be considered leader. If you start more node after initializing two, then these node will automatically be treated as these leader replica.

Replica belong to Shard for the first time they are added to the cluster and remain in the boot state. This is done by the Round-robin method unless the new node with the Shardid parameter is manually attributed to a shard during startup. This parameter is usually used as the system's property,-dshardid=1, and the new node needs to enclose the Shard ID value.

At a later restart, each node is loaded into the Shard assigned to him when node first starts (whether the assignment is manual or automatic). If the node that was assigned as leader at first is not available, a replica can become leader.

Thinking:

L Node a accompanies the boot parameter to start, pointing to a standalone zookeeper,numshards set to 2.

Node B starts and points to the standalone zookeeper

Node A and Node B are both Shard and defined to meet 2 shard slots When you start node A. If we look at the SOLR console, we will find that two node contains leader (represented by a real-side Bai).

L Node C starts and points to the standalone zookeeper

Node C will automatically become a replica of Node A, because we don't specify that he belongs to any other Shard, and he cannot be a new shard because we define only two shard and both are occupied.

L Node D starts and points to the standalone zookeeper

Node D will automatically become a replica of Node B, with the same logic as node C.

To restart, what happens if node C restarts before node A? Node c becomes leader,node a becomes the replica of NodeC.

shards and Indexingdata in Solrcloud

When your data is stored on a node that is too large, you can separate the data by creating one or more Shard to store in the section. Each is part of a logical index, or a core, that contains the index of all the sections in node.

Shard is a method of separating the core with a certain number of servers or node. For example, suppose you have a shard that contains data of various states, or different kinds, that will be retrieved independently, but are usually combined.

Prior to Solrcloud, SOLR provided distributed retrieval, allowing a query to be executed from multiple shard, so the query executed was antagonistic to SOLR's full index and would not lose documents in the query results. So separating the core through Shard is not solrcloud unique concept. However, some of the problems associated with the distributed approach Solrcloud need to be improved:

1. Separating the core into the Shard is manual

2. Does not support distributed indexing, which means that you need to explicitly send the document to a special shard; sole cannot indicate which shard the document is sent to.

3. No load balancing or fault tolerance, so if you want to query on a large scale, you need to indicate where the request is sent and if a shard crashes, it ends.

Solrcloud solves these problems. Supports distributed indexing and distributed automated queries, zookeeper provides fault tolerance and load balancing. In addition, each shard can have multiple replica to increase robustness.

There are no master and slaves in Solrcloud. Instead of them are leaders and replicas. Leaders is an automatic election, with first-come-first-served as the basic principle, based on zookeeper processing described in http://zookeeper.apache.org/doc/trunk/recipes.html #sc_leaderElection

If leader is stopped, one of his replica will be automatically elected as the new leader. Once each node is started, he will be assigned to the Shard with the fewest replica. If the situation is the same, it will be assigned to the Shard with the smallest shard ID.

When a document is sent to the server for indexing, the system first determines whether the server is a replica or leader.

L If the server is replica, this document will be forwarded to leader for processing.

L If the server is Leader,solrcloud decide which shard the document should access, then forward the document to the Shard leader, which generates the index of this document in this shard, The index is then tagged and forwarded to the other replica.

Document Routing

When creating your collection, SOLR has the ability to implement router by specifying the collection of the Router.name parameter. If you use the "Compositeid" router, you can send document with the document ID prefix, which will be used to calculate HASH,SOLR to which shard the document is destined to be generated. This prefix can be arbitrary (not the name of the Shard), but must be consistent so that SOLR performance is stable. For example, if you want to synchronize data with a customer, you can prefix it with a customer name or ID. For example, if your customer is Ibm,document's ID is "12345", you'd better add a prefix to the ID value of document: "ibm!12345". “!” is a boundary that distinguishes the prefix used to determine which shard to manage the document.

Then in the query, in your query statement by the _route_ parameter (that is, q=solr&_route_=ibm!) to add a prefix to manage the query to the specified shard. In some cases, doing so will enhance the performance of the query because of the potential factors of the network from all Shard queries.

Tip: The _route_ parameter instead of Shard.keys,shard.keys is deprecated in the release of Solr later.

This compositeid supports level 2 prefixes. For example: The first is a region prefix, then the customer prefix: "usa! ibm!12345 "

Another usage scenario is if IBM has a large number of documents for this customer, you want to distribute them to multiple shard. The syntax for this usage is: "shard_key/num!document_id", which is the number of bits in the compound hash using Shrad key.

Therefore, "ibm/3!12345" will occupy 3bit in Shard Key, 29bit in the unique doc ID, and propagate 1/8 of tenants over Shard in collection. For example, a num value of 2 propagates the document across the 1/4 shard. In the query, directly to the specified shard with the _route_ parameter query, the bit num is included in the prefix (that is, q=solr&_route_=ibm/3!).

If you do not want to affect how document is stored, you do not have to specify a prefix in the document ID.

If you create a collection and define a "implicit" route at the time of creation, You can add a Router.field parameter to define which shard the document belongs to by using the field of each document. If you lose this field designation in document, the document will be rejected. You can also use the _route_ parameter to name a specified shard.

Shard Splitting

When you create a collection in Solrcloud, you decide the number of Shard to initialize. But it's hard to know the number of Shard you need in advance, especially if your organization's needs change, and it's expensive to find your choice is wrong, involves creating a new core and rebuilding the index of all the data.

The Collection API provides the ability to split shard. Currently allows splitting of a shard to two pieces. The existing shard remains unchanged, so the split operation actually takes two slices of data as two new Shard. When you are ready easy you can delete the old shard.

More about splitting shard content in the collection API section. Https://cwiki.apache.org/confluence/display/solr/Collections+API

ignoring commits fromclient application in Solrcloud

In most cases we run in Solrcloud mode, and the client app cannot send the request to submit the index data directly. Of course, you can make the most recent updates appear in the search request by configuring Opensearcher=false and Soft-commits Auto-commit. This ensures that the commits in the schedule cluster occur periodically. Make sure that the client app does not send a direct commit scenario, and you can update the SOLR index data for all client apps into Solrcloud. This approach, however, is not always feasible, so SOLR provides ignorecommitoptimizeupdateprocessorfactory that allows you to ignore the requests for direct commits or optimizations from the client application without refactoring the client application's code. To activate this request processor, you need to add the configuration to the Solrconfig.xml:

<updaterequestprocessorchainname= "Ignore-commit-from-client" default= "true" >

<processorclass= "SOLR. Ignorecommitoptimizeupdateprocessorfactory ">

<intname= "StatusCode" >200</int>

</processor>

<processorclass= "SOLR. Logupdateprocessorfactory "/>

<processorclass= "SOLR. Distributedupdateprocessorfactory "/>

<processorclass= "SOLR. Runupdateprocessorfactory "/>

</updateRequestProcessorChain>

In the example above, the processor returns to client 200 but ignores the commit/optimize request. Note that your solrcloud also needs to be plugged into an implicit processor, because this custom chain overrides the default chain.

In the following example, the processor returns a custom error message for the 403 code exception:

<updaterequestprocessorchainname= "Ignore-commit-from-client" default= "true" >

<processorclass= "SOLR. Ignorecommitoptimizeupdateprocessorfactory ">

<intname= "StatusCode" >403</int>

<str name= "Responsemessage" >thoushall not issue a commit!</str>

</processor>

<processorclass= "SOLR. Distributedupdateprocessorfactory "/>

<processorclass= "SOLR. Runupdateprocessorfactory "/>

</updateRequestProcessorChain>

Finally, you can omit the optimizations by following the configuration to make the commit pass:

<processorclass= "SOLR. Ignorecommitoptimizeupdateprocessorfactory ">

<str name= "Responsemessage" >thoushall not issue a optimize, but commits is ok!</str>

<boolname= "Ignoreoptimizeonly" >true</bool>

</processor>

<processorclass= "SOLR. Runupdateprocessorfactory "/>

</updateRequestProcessorChain>

Distributed Requestslimiting which shardsare queried

One of the great advantages of Solrcloud is the ability to distribute queries across Shrad that contain or do not contain the data you are looking for. You can choose to query all the data or just part of the data.

Collection looks very familiar from all Shard queries, as if Solrcloud didn't even work:

http://localhost:8983/solr/gettingstarted/select?q=*:*

You only want to query from a shard, you can specify the logical ID of that shard to specify Shard:

http://localhost:8983/solr/gettingstarted/select?q=*:* &shards=shard1

If you want to query a group of Shard IDs, you can specify them at the same time:

http://localhost:8983/solr/gettingstarted/select?q=*:* &shards=shard1,shard2

The above two examples, shard IDs will randomly select the corresponding shard under the replica.

You can explicitly specify the replica you want to use in Shard, either:

http://localhost:8983/solr/gettingstarted/select?q=*:* &shards=localhost:7574/solr/gettingstarted,localhost : 8983/solr/gettingstarted

Alternatively, you can specify a collection of replica from a separate shard by using the symbol "|" (To achieve load balancing purposes):

http://localhost:8983/solr/gettingstarted/select?q=*:* &shards=localhost:7574/solr/gettingstarted|localhost : 7500/solr/gettingstarted

Of course, you can specify a shard collection by "," and the members of the collection can be passed the "|" To specify more than one shard. For example, this example requires 2 Shard, the first is a randomly selected replica from Shard1, and the second one is through the "|" A set of clearly divided:

http://localhost:8983/solr/gettingstarted/select?q=*:* &shards=shard1,localhost:7574/solr/gettingstarted| localhost:7500/solr/gettingstarted

Configuring Theshardhandlerfactory

You can configure concurrency and thread pooling directly in SOLR distributed search applications. This allows finer-grained control, and you can adjust his goals according to your Violet's specific requirements. The default configuration facilitates deferred throughput.

You can configure a standard handler in Solrconfig.xml:

<!--other params go-to-

</shardHandler>

</requestHandler>

Configuring Statscache (distributed IDF)

Documentation and long-term statistics are required in order to calculate the correlation degree. SOLR provides four modes for statistical calculation of documents:

Localstatscache: This uses only local terminology and document statistics to calculate correlations. This configuration works well for a uniform distribution of terms from each shard.

If <STATSCACHE> is not configured, this is the default.

Exactstatscache: This implementation uses global values (across collection) for document frequency.

Exactsharedstatscache: The function is much like the Exactstatscache, but under the same conditions, the global data is reusable for subsequent requests.

Lrustatscache: Through the LRU cache global statistics, shared between requests.

Implemented by configuring <statsCache> in Solrconfig.xml. For example, the following line of SOLR is implemented using Exactstatscache:

Avoiding Distributeddeadlock

Each fragment is a top-level query request that requires all other fragments to be requested by the child. Note that the maximum number of threads that make sure that the HTTP request for the service is greater than the number of requests from top-level customers and other fragments. If this is not the case, the configuration may cause a distributed deadlock.

For example, a deadlock can occur in the case of two fragments, each with an HTTP request that is just a single thread service. Two threads can receive a top-level request at the same time and divide the request into one another. Because there are no more remaining thread service requests, incoming requests will be blocked until other wait requests are completed, but they will not be completed because they are waiting for a child request. By ensuring that SOLR is configured with a sufficient number of threads to handle, you can avoid deadlocks, so.

Prefer Local Shards

SOLR allows you to name Preferlocalshards with an optional Boolean parameter indicating distributed queries, which tend to be replica for this shard when a local shard is available. In other words, if the query includes preferlocalshards= True, then the query controller will execute the local replica query instead of selecting random query services from the entire cluster. This is useful when a query requests multiple fields or a large field is returned because it avoids moving large amounts of data on the network when it is local. Additionally, this feature can be useful to minimize the impact of a copy of the problem with degraded performance, as it reduces the likelihood that a degraded copy will be hit by another healthy copy.

Finally, it shows that the value of this feature is reduced by increasing the number of Shard in the collection because the query controller will query directly to most of the non-native replica shard. In other words, this feature is very useful for query optimization with a small number of Shard and many replica collections. Also, if you require a load balancing query from collection all nodes replica only with this option, as SOLR's cloudsolrclient will do. Without load balancing, this feature can introduce a hotspot in the cluster because the query will not be evenly distributed across the cluster.

Https://cwiki.apache.org/confluence/display/solr/SolrCloud

Http://wiki.apache.org/solr/FrontPage

Not to be continued ...

Solrcloud: Based on the SOLR wiki

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More