Solrcloud Usage tutorials, Introduction to PrinciplesSolrcloud is a distributed search scheme based on SOLR and Zookeeper and is one of the core components of the Solr4.0 in development, and its main idea is to use Zookeeper as the configuration Information Center for the cluster.
It has several features: ① centralized configuration information ② automatic fault tolerance ③ near real-time search ④ query automatically load balance.
Here's a look at the wiki documentation:
1, Solrcloud
Solrcloud refers to a new set of potential distribution capabilities in SOLR. This ability allows you to set up a highly available,
Fault-tolerant SOLR service cluster. Use Solrcloud (SOLR cloud) When you need large-scale, fault-tolerant, distributed indexing and retrieval capabilities.
Take a look at the "Start" section below to quickly learn how to start a cluster. There are 3 quick and simple examples that show how to start a progressively more complex cluster. After checking out the examples, you need to read through the sections that follow to learn more about the details.
2. A little something about solrcores and collections.
For an instance of SOLR that runs alone, it has something called Solrcore (configured in Solr.xml), which is essentially a separate index block. If you plan multiple index blocks, you create multiple solrcores. When Solrcloud is deployed at the same time, independent index blocks can span multiple SOLR instances. This means that a separate index block can consist of multiple Solrcore index blocks on different server devices. We call all solrcores that make up a logical index block a separate index block (collection). An independent index block is essentially an independent index block that spans multiple Solrcore index blocks, while the index blocks are scaled as much as possible with the redundant devices. If you want to build your two solrcore Solr into Solrcloud, you will have 2 independent index blocks, each of which has multiple independent solrcores components.
3. Start
Download the Solr4-beta or later version.
If you don't understand it, familiarize yourself with SOLR with a simple SOLR guide. Note: Before you learn about cloud features through the guide, reset all documentation for the configuration and removal guide. Copying an example directory with a pre-existing SOLR index will result in a document count off SOLR embedded using zookeeper as the cluster configuration and coordinated operation of the warehousing. Coordination is considered as a distributed file system that contains information about all SOLR services.
If you want to use a different instead of 8983 as the SOLR port, go to the "Parameter Reference" section below about the solr.xml annotation
Example A: A simple 2-shard cluster
This example simply creates a cluster of two SOLR services consisting of two different shards representing a separate index block. Since we will need two SOLR servers, simply copy the example directory copy as a second server, to ensure that no indexes exist.
1 2 |
Rm-r example/solr/collection1/data/* Cp-r example Example2 |
The following command initiates a SOLR service that simultaneously guides a new SOLR cluster
1 2 3 |
cd Example Java-dbootstrap_confdir=. /solr/collection1/conf-dcollection.configname=myconf-dzkrun-dnumshards=2-jar Start.jar |
-dzkrun This parameter will cause an embedded zookeeper service to run as part of the SOLR service.
-dbootstrap_confdir=./solr/collection1/conf This parameter causes the local path when we do not have the zookeeper configuration./solr/conf is loaded as a "myconf" configuration. " Myconf "from the" collection.configname "parameter definition below.
-dcollection.configname=myconf Sets the configuration name for the new collection. Omitting this parameter will cause the configuration name to be the default "Configuration1"
-dnumshards=2 We intend to divide the index into the number of logical partitions.
Browse the Http://localhost:8983/solr/#/~cloud address to see the status of the cluster. You can see from the Zookeeper browser that the SOLR configuration file is loaded as "myconf" and a new collection of documents called "Collection1" is created. Collection1 is a list of fragments that make up the complete collection.
Now we want to start our second server; because we don't explicitly set the fragment ID, it will automatically be set to Shard2, start a second server, point to the cluster
1 2 |
CD example2 Java-djetty.port=7574-dzkhost=localhost:9983-jar Start.jar |
-djetty.port=7574 is just a way to get the Jetty servlet container to use a different port
-dzkhost=localhost:9983 points to all Zookeeper services that contain the state of the cluster. In this example, we are running a standalone Zookeeper embedded in the first SOLR server Services. By default, an embedded zookeeper service runs on port 9983.
If you refresh your browser, you should now see two shard1 and Shard2 under Collection1.
Next, index some documents. If you want to do something quickly, like Java you can use the Cloudsolrserver SOLRJ implementation and simply initialize it with the Zookeeper address. Or you can choose to add a document to a SOLR instance, which is automatically directed to their place.
1 2 3 4 |
CD Exampledocs java-durl=http://localhost:8983/solr/collection1/update-jar post.jar ipod_video.xml java-durl=http:/ /localhost:8983/solr/collection1/update-jar Post.jar monitor.xml java-durl=http://localhost:8983/solr/collection1 /update-jar Post.jar Mem.xml |
Now, initiate a request to obtain the results of any of the services that are searched by the distributed search service that covers the entire collection:
1 |
http://localhost:8983/solr/collection1/select?q=*:* |
If at any point you wish to reload or try different configurations, you can delete all zookeeper cloud states by simply deleting the Solr/zoo_data directory after shutting down the server.
Example B: A simple two-fragment cluster that supports replicas
With previous examples, this example can be easily created by recreating Shard1 and shard2 copies. Additional shard replicas are used for high availability and fault tolerance, or to improve the retrieval capability of the cluster.
First, with the previous example, we already have two fragments and some document index in it. Then simply copy the two services:
1 2 |
cp-r example Exampleb cp-r example2 example2b |
The two new services are then launched under different ports, each executed under its own windows.
1 2 3 4 |
CD Exampleb java-djetty.port=8900-dzkhost=localhost:9983-jar start.jar CD example2b Java-djetty.port=7500-dzkhost=lo Calhost:9983-jar Start.jar |
Refresh the Zookeeper Browse page to verify that the 4 SOLR nodes have been started, and that each shard has two replicas because we have told SOLR we want 2 logical fragments, the boot instance 3 and 4 are automatically negative for two shards of the secondary
This.
Now, send a request to any server to query the cluster
1 |
http://localhost:7500/solr/collection1/select?q=*:* |
Send this request multiple times to study the log of the SOLR service. You should be able to find SOLR implementing a cross-copy load balancing request that satisfies each request with a different service. The server in which the browser sends the request has a log description of the top-level request, and there will be a log description of the child request that was merged to generate the completed response.
To demonstrate the availability of failover, press ctrl-c under run Windows on any server (except the server running Zookeeper). Once that service instance stops running, it sends another query request to any one of the remaining running servers, and you should still see the full result.
Solrcloud can continue to provide full query service until the last one to maintain each fragmented server. You can prove this high availability by explicitly shutting down individual instances and query results. If you turn off all servers that maintain a single fragment, requesting a different server will return 503 error results. To return the correct documents that are available from the fragmented service that is still running, you need to add shards.tolerant=true query parameters.
Solrcloud is implemented with multi-master Service + Monitor service. This means that a write node or replica plays a special role. You don't have to worry about what you turn off is the primary service or cluster monitoring, if you turn off one of them, the failover feature will automatically pick up the new master service or new monitoring service, and the new service will seamlessly take over the job. Any SOLR instance can be promoted to a role in these roles.
Example C: Two fragmented clusters that support replicas and full zookeeper services
The problem with B is that there can be enough servers to ensure that the information of any corrupted server survives, but there is only one zookeeper server that contains the state of the cluster. If that zookeeper server is corrupted, the distributed query will still work in the last zookeeper saved cluster state, and there may not be a chance to change the status of the cluster.
Running multiple zookeeper servers in a zookeeper ensemble (zookeeper cluster) guarantees high availability of zookeeper services. Each zookeeper server needs to know the other servers in the cluster, and one of the primary servers in those servers needs to provide services. For example, 3 zookeeper clusters allow either failure and then continue to service through the remaining 2 negotiated one master service. 5 need to allow 2 server failures at a time.
For production services, it is recommended to run an additional zookeeper cluster instead of running the zookeeper service embedded in SOLR. Here you can read more about building a zookeeper cluster. For this example, for simplicity we will use the embedded zookeeper service.
First stop 4 servers and clean up the zookeeper data
1 |
Rm-r example*/solr/zoo_data |
We will run the service separately on the 8983,7574,8900,7500 port. The default way is to run the embedded zookeeper service on host port +1000, so if we run the embedded zookeeper with the first 3 servers, the address is localhost:9983,localhost:8574,localhost:9900 respectively.
For convenience, we upload the first server SOLR configuration to a cluster. You will notice that the first server is blocked until you start the second server. This is because the zookeeper needs to reach a sufficient number of servers to allow the operation.
1 2 3 4 5 6 7 8 |
cd Example Java-dbootstrap_confdir=. /solr/collection1/conf-dcollection.configname=myconf-dzkrun-dzkhost=localhost:9983,localhost:8574,localhost : 9900-dnumshards=2-jar Start.jar cd example2 java-djetty.port=7574-dzkrun-dzkhost=localhost:9983,localhost:8574, Localhost:9900-jar Start.jar CD Exampleb java-djetty.port=8900-dzkrun-dzkhost=localhost:9983,localhost:8574, Localhost:9900-jar Start.jar CD example2b java-djetty.port=7500-dzkhost=localhost:9983,localhost:8574,localhost : 9900-jar Start.jar |
We have now run a cluster of 3 embedded zookeeper services, even though a service loss remains in operation. Using CTRL + C to kill the server of example B and then refresh the browser to see the zookeeper State can conclude that the Zookeeper service still works. Note: When running on more than one host, the default localhost will not work with the exact host name and port that are required to set the corresponding hostname,-dzkrun=hostname:port,-dzkhosts in each.
4, ZooKeeper
For fault tolerance and high availability multiple zookeeper servers run together, this mode is called an ensemble. For production environments, it is recommended to run zookeeper ensemble as an alternative to SOLR's embedded zookeeper service. For more information on downloading and running ensemble, visit Apache ZooKeeper. More specifically, try to start and manage zookeeper. It's actually very simple to run. You can allow SOLR to run Zookeeper, but remember that a Zookeeper cluster is not easy to dynamically modify. Unless dynamic modification is supported in the future, it is best to modify it when rolling back the restart. Separate handling with SOLR is usually the most desirable.
When SOLR runs the embedded zookeeper service, the default is to use SOLR port +1000 as the client port, and SOLR Port +1 as the Zookeeper service port, SOLR Port +2 as the primary service election port. So in the first example, SOLR runs on port 8983, and inline zookeeper uses 9983 as the client port, 9984 and 9985 as the service port.
Since it is very fast to make sure that Zookeeper is set up, there are a few things to keep in mind: SOLR does not strongly demand the use of Zookeeper, and many optimizations may not be necessary. In addition, adding more Zookeeper nodes is helpful for reading data performance, but it also slightly reduces write performance. Again, when the cluster is in stable operation, SOLR does not have much interaction with Zookeeper. If you need to optimize zookeeper, here are a few helpful points:
The 1.ZooKeeper works best on a dedicated machine. ZooKeeper provides a timely service, dedicated equipment to help ensure timely service response. Of course, dedicated machines are not required.
2.ZooKeeper works best on different hard drives on transaction logs and snapshots
3. If you configure a Zookeeper that supports SOLR, you can use separate hard drives for performance.
5. Managing collections through the collection API
You can manage collections through the collection API. Under this API shell, it is common to use the Coreadmin API to manage the solrcores on each server asynchronously, which is undoubtedly an essential good thing if you handle the standalone Coreadmin API by replacing the action parameter with each server.
① Creating an interface
1 |
http://localhost:8983/solr/admin/collections?action=create&name=mycollection&numshards=3& Amp;replicationfactor=4 |
Related parameters:
Name: The names of the collections that will be created
Numshards: Number of logical fragments to create when collection is created
Replicationfactor: The number of copies per document. The Replicationfactor (replication factor) of 3 means that each logical fragment will have 3 copies. Note: In Solr4.0, Replicationfactor is the number of additional * replicas, not the total number of replicas
Maxshardspernode: A create operation will be expanded to create a numshards*replicationfactor fragment copy spread across your SOLR node, fairly distributed, and two copies of the same fragment will not be on the same SOLR node. If SOLR is corrupted when the creation operation is complete, the operation does not create any part of the new collection. This parameter is used to prevent the creation of too many replicas on the same SOLR node, with the default parameter 1. If its value does not match the number of Numshards*replicationfactor replicas in the overall collection to the normal active SOLR node, it will not create anything
Createnodeset: If this parameter is not provided, the create operation will create a fragmented copy spread across all active SOLR nodes. Provides this parameter to change the collection of nodes used to create a fragmented copy. The format of the parameter value is: "<node-name1>,<node-name2>,..., <node-nameN>",
Example: CREATENODESET=LOCALHOST:8983_SOLR,LOCALHOST:8984_SOLR,LOCALHOST:8985_SOLR
Collection.configname: The name of the configuration file for the new collection. If you do not provide this parameter, the name of the configuration file will be used for the collection name.
Solr4.2
Related parameters:
Name: The names of the collection aliases that will be created
Collections: comma-delimited list of one or more collection aliases
② Deleting an interface
1 |
Http://localhost:8983/solr/admin/collections?action=DELETE&name=mycollection |
Related parameters:
1 |
<strong>name</strong>: The name of the collection to be deleted |