Reference:
Solrcloud Chinese explanation
Under Windows based on zookeeper publishing Solrcloud
Official documents
Solrcloud Wiki
SOLR Chinese documentation what is Solrcloud
Explanation of official documents:
Solrcloud is designed to provide a highly available, fault tolerant environment for distributing your indexed content and Query requests across multiple servers.
It ' s a system in which data was organized into multiple pieces, or shards, which can be hosted on multiple machines, with RE Plicas providing redundancy for both scalability and fault tolerance, and a ZooKeeper servers that helps manage the overall Structure so, both indexing and search requests can be routed properly.
Chinese explanation (from Google Translate)
The Solrcloud is designed to provide a highly available fault tolerant environment for distributing index content and query requests across multiple servers.
This is a system that organizes data into multiple parts or fragments, can be hosted on multiple machines, replicas provide redundancy for scalability and fault tolerance, and a zookeeper server that helps manage the entire structure so that indexing and search requests can be routed correctly.
It can also be said that Solrcloud is a way to deploy SOLR, in addition to Solrcloud, SOLR can be deployed as a single-party and multi-machine master-slaver. Distributed indexing refers to the use of SOLR's distributed indexes when the index becomes larger, a single system fails to meet disk requirements, or when a simple query takes a lot of time. in a distributed index, the original large index is divided into small indexes, and SOLR can merge the results returned by these small indexes and return them to the client. Why to use Solrcloud
The single-machine SOLR deployment is no longer explained, but when the recommendation of the business, the increase in data volume, the higher the required throughput, it is necessary to consider the cluster distributed solution.
master-slave Master-slave mode
1, Master is responsible for receiving the index data, preferably not included in the SOLR data provided by the cluster, because the creation of the index cost will affect the efficiency, once the master down, there will be partially connected to the master machine users can not query normally.
2, slave from the machine can be configured what action (such as when the commit), what time (such as every 10 minutes), to master request. The index on the host is then replicated after operations such as version number, index size, distribution data, and so on.
Solrcloud
Describe a few key components and concepts first
1, cluster cluster: Cluster is a set of SOLR nodes, logically managed as a unit, the entire cluster must use the same set of schemas and solrconfig.
2. Node: A JVM instance running SOLR.
3. Collection: The complete index in the logical sense of the Solrcloud cluster is often divided into one or more shard, these shard use the same config Set, if the number of shard more than one, then the index scheme is distributed index. Solrcloud allows a client user to refer to it by collection name, so that the user does not need to be concerned about the and shard related parameters that are required for distributed retrieval.
4. Core: The SOLR core, a SOLR with one or more SOLR cores, each of which can provide index and query capabilities independently, and the SOLR core is presented for increased management flexibility and shared resources. The configuration used in Solrcloud is in zookeeper, while the traditional SOLR core configuration file is in the configuration directory on disk.
5. config SET:SOLR core provides a set of configuration files that the service must have, each Config set has a name. Minimum requirements include solrconfig.xml and schema.xml, in addition, depending on the configuration of these two files, you may also need to include other files, such as the thesaurus files required for Chinese indexes. Config set is stored in zookeeper and can be re-uploaded or updated with the Upconfig command, which can be initialized or updated using SOLR's startup parameters Bootstrap_confdir.
6. Shard Shard: Logical Shard of collection. Each shard is divided into one or more replicas, which is elected to determine which is the leader.
7, a copy of the Replica:shard. Each replica exists in a core of SOLR. In other words a solrcore corresponds to a replica, such as a collection named "Test" is created in Numshards=1, and the specified Replicationfactor is 2, which results in 2 replicas, That is, the correspondence will have 2 cores, stored on different machines or SOLR instances, one of which will be named TEST_SHARD1_REPLICA1 and the other named Test_shard1_replica2, one of them will be elected leader.
8, Leader: Win the election shard replicas, each shard has a plurality of replicas, these replicas need to elect to determine a Leader. Elections can occur at any time, but they are usually triggered only when a particular SOLR instance fails. When an index operation is performed, Solrcloud uploads the index operation request to this shard corresponding Leader,leader and distributes them to the replicas of all shard.
9, Zookeeper:zookeeper provides distributed lock function, this is necessary for Solrcloud, mainly responsible for handling the leader election. SOLR can run inline with zookeeper or use standalone zookeeper, and SOLR recommends that more than 3 hosts be preferred.
Here is a summary of some of the more important
1. Centralized management of configuration files
In fact, the Solrconfig.xml, Schema.xml and other configuration files to zookeeper management, unified configuration
2. Automatic Fault tolerance
Dynamic election leader, will be detected through the watcher cluster machine failure, if there is a failure in the creation of the index or query data will skip the fault machine, if the leader machine fails, will be a new machine
3. NRT (near real time) search
In fact, SOLR provides a soft commit (Autosoftcommit), the index is first committed to memory, index is visible, the user can query to the index content, but did not submit to the hard disk, it is obvious that if the room power outage, then this part of the index will disappear. Real-time is relative, and is mainly based on the volume of business and index data to use correctly.
The general configuration is as follows:
Configure hard submissions every 5 minutes
<autoCommit>
<maxTime>${solr.autoCommit.maxTime:300000}</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
//Configure soft commit <autosoftcommit every second
>
<maxTime>${solr.autoSoftCommit.maxTime:1000}</maxTime>
</autoSoftCommit>
4, expansion more convenient
Solrcloud Support Webapi Mode
Collection API Readiness Environment
Here in the case of Windows Publishing, the Linux system published online tutorials a lot, step by step follow the release
Zookeeper zookeeper-3.4.11
Launch a standalone Tomcat see the previous article to build Solrcloud overall Solrcloud structure
Zookeeper Configuration
After extracting the file as follows
Rename the zoo_sample.cfg under Conf to Zoo.cfg, modify the DataDir for the newly created Data folder, note that the path here is a forward slash, and the Windows default is not the same
Tomcat Configuration
Note: Because I am here in the same machine to build Solrcloud, prepared 4 Tomcat, while running the need to configure a different port, modify its server.xml can, here not much to repeat
Here we select Tomcat with Port 8081 as the leader, configuring its initialization start operation. Locate the catalina.bat file set run parameter under Tomcat's Bin folder
Set
Java_opts=-dbootstrap_confdir=d:\solr\server\solrhome-8081\my_solr\conf
-dcollection.configname=clusterconf-dzkrun-dzkhost=localhost:2181-dnumshards=2
1,Dbootstrap_confdir zookeeper need to prepare a copy of the cluster configuration, this parameter is to tell Solrcloud where these configurations are placed, as well as the entire cluster as a common configuration file.
2,dcollection.configname Specify the name of your configuration file after uploading to zookeeper
3.Dzkrun initiates an embedded zookeeper server in SOLR that manages the associated configuration of the cluster.
4,dzkhost with the meaning of the above parameters, allow the configuration of an IP and port to specify which zookeeper server to coordinate.
5,dnumshards=2 configuration of the number of shards, is actually the drop index stored in how many shards
The catalina.bat configurations for the other three tomcat (8082, 8083, 8084) are as follows:
Set java_opts=-dzkrun-dzkhost=localhost:2181-dnumshards=2 web. XML Configuration
Modify the path of the Web. XML specified solrhome for the SOLR project, where 8081 is the example
<!--Configure SOLR home location -
<env-entry>
<env-entry-name>solr/home</env-entry-name >
<env-entry-value>D:\solr\server\solrhome-8081</env-entry-value>
<env-entry-type >java.lang.String</env-entry-type>
</env-entry>
solr.xml Configuration
Here, take 8081 as an example to intercept, other reference configurations
<!--combined with zookeeper configuration Solrcolound start-<solrcloud> <str name= "host" >${h ost:localhost}</str> <int name= "Hostport" >${port:8081}</int> <str name= "Hostcontext" >${hos tcontext:solr}</str> <bool name= "Genericcorenodenames" >${genericCoreNodeNames:true}</bool> < ; int name= "Zkclienttimeout" >${zkClientTimeout:30000}</int> <int name= "Distribupdatesotimeout" >${ distribupdatesotimeout:600000}</int> <int name= "Distribupdateconntimeout" >${distribUpdateConnTimeout :60000}</int> </solrcloud> <shardhandlerfactory name= "shardhandlerfactory" class= "Httpshardhandle Rfactory "> <int name=" sockettimeout ">${socketTimeout:600000}</int> <int name=" Conntimeout ">${ Conntimeout:60000}</int> </shardHandlerFactory> <!--with zookeeper configuration Solrcolound end-->
Project file Overall structure
Run
Run Zookeeper
. \zookeeper-3.4.11\bin under the Zkserver.cmd
Visit 8081 or any of the published SOLR to enter the following interface
As you can see, two shards have been generated, and 8081 and 8082 are leader of two shards respectively.
View the files that are managed by zookeeper for us to see the surviving nodes, all current collection, and clusterconf files that we have entrusted to zookeeper management ( The name of this folder actually corresponds to the name that we configured the JVM system parameter to zookeeper to manage the file copy before Solrcloud How to create the index
as shown in the figure, when we add an index to Solrcloud, the key is where our index is ultimately written and how it is written.
Set up the Solrcloud, each shard corresponding to a hash interval, the construction of this article as an example.
Shard-1 corresponding hash interval for 0-7fffffff shard-2 corresponding hash interval is 80000000-FFFFFFFF, then Solrcloud will be based on the addition of the index data doc's unique identification ID to calculate a hash value, Then you can confirm that the index should be distributed on that Shard. The
Solrcloud provides the following two requirements for the acquisition of hash values:
1, the hash calculation must be fast, because the hash calculation is the first step in the distributed index.
2, hash value must be evenly distributed in each shard, if there is a shard the number of document is greater than the other Shard, then in the query time before a shard time will be greater than the last one, Solrcloud query is the process of summing up first, that is, each shard query is completed before the end, so Solrcloud query speed is determined by the slowest shard query speed.
based on the above two points, Solrcloud uses the MurmurHash algorithm to improve the hash calculation speed and the uniform distribution of hash value. Here I summarize the steps for 3
1, any replica receive an index data add request
2, the request to the same Shard leader
3, if the index of the request is not in the current shard hash distribution interval, The request is then forwarded to the corresponding Shard
Note: The route omitted here is shard between leader and replica
using SOLRJ to add an index
Package Com.yvan.solrcloud;
Import java.io.IOException;
Import org.apache.solr.client.solrj.SolrServerException;
Import org.apache.solr.client.solrj.impl.CloudSolrClient;
Import Org.apache.solr.client.solrj.response.UpdateResponse;
Import org.apache.solr.common.SolrInputDocument;
/** * Cloudsolrclient will route requests internally leader node * This avoids overhead * @author Yvan * * March 26, 2018 PM 6:23:27 * * public class Appmain {
@SuppressWarnings ("resource") public static void main (string[] args) throws Solrserverexception, IOException { Cloudsolrclient Add index//127.0.0.1:2181 to zookeeper host cloudsolrclient client = new Cloudsolrclien
T ("127.0.0.1:2181");
The corresponding collection name, corresponding to the SOLR core final String defaultcollection = "MY_SOLR";
Final int zkclienttimeout = 20000;
Final int zkconnecttimeout = 1000;
System.out.println ("The Cloud solrserver Instance has Benn created!"); Client.setdefaultcollection (defaultcollection);
Client.setzkclienttimeout (zkclienttimeout);
Client.setzkconnecttimeout (zkconnecttimeout);
Client.connect ();
Solrinputdocument document = new Solrinputdocument ();
Document.addfield ("ProductCode", "123321");
Updateresponse ur = client.add (document); System.out.println (Ur.getstatus () ==0? "
Index Add Success ":" Index add fail ");
Client.commit (); }
}
Solrcloud How to retrieve an index
The entire query step is divided into:
1. The index query request is routed to any one of the Shard replica
2, replica will find the routing information, establish and other shard replica between the sub-query
3. Subquery returns query results
4. Merge query results and return request results
We just added a doc with id ' 123321 ', now let's search, open debug mode
We request a port of 8082, which can be queried to this index
Let's take a look at the debug message
Can see Execute_query, indicating that both shard have been requested, and actually see Get_fields is returned by the Shard of 8081, 8083, indicating that this index is distributed in this shard interval How to manage the configuration files in zookeeper
Upload file:
1. Upload a copy at startup by configuring the -dbootstrap_confdir parameter setting
2. Uploading configuration files via Configsets API
3. Upload the configuration file by calling the Cloudsolrclient.uploadconfig method (different SOLRJ version methods may be different)
4, Org.apache.solr.cloud.ZkCLI upload files
Java-classpath server/solr-webapp/webapp/web-inf/lib/* org.apache.solr.cloud.zkcli-zkhost Localhost:2181-cmd Upconfig-confname {upload zookeeper on alias}-confdir {The path of the configuration file to be uploaded locally}
To manage a single file:
Use the Org.apache.solr.cloud.ZkCLI in SOLR core to execute the putfile command under Windows Replace file Linux Similarly, just change the path
Java-classpath
d:/solr/server/apache-tomcat-solr-leader-8081/webapps/solr/web-inf/lib/*
Org.apache.solr.cloud.zkcli-zkhost Localhost:2181-cmd Putfile
/configs/clusterconf/synonyms.txt C:/users/yvan/desktop/synonyms.txt
/configs/clusterconf/synonyms.txt//Replace file path for zookeeper in-memory data
C:/users/yvan/desktop/synonyms.txt//For local file path
Other applications for Solrcloud are still being studied, and follow-up will be updated in conjunction with the work process.