Apache SOLR Extension

Source: Internet
Author: User
Tags apache solr solr dedicated server haproxy
Document directory
  • JVM Configuration
  • HTTP Cache
  • SOLR Cache
  • Better Schema Design
  • Index Policy
  • Script vs Java Replication
  • Start to experience multiple SOLR servers
  • Distributed search across multiple slave instances
  • Distribute search requests to different slave
  • Sharding Indexes)
Scaling)

When the number of indexes increases, you will find that your search response time is getting slower and the time for indexing new content is getting longer and longer. So it is time to make some changes, fortunately, SOLR considers these situations well and you only need to change your configuration.

The scaling of SOLR is described in the following three aspects:

L tune a SOLR server (scale high)

Optimize the SOLR of a single instance through cache and memory management. Deploy SOLR to a dedicated server with fast CPU and hardware, and maximize the performance of a single server through optimization.

L use multiple SOLR servers (scale wide)

Use multiple SOLR servers. If your avgtimeperrequest parameter is within your acceptable range (the data volume is usually several million), you can copy the indexes on your master to the slave machine through configuration; if your query is slow, the load of your single query is distributed to multiple SOLR servers using shards.

L use replication and sharding (scale deep)

When your data volume is large enough and you need to use both replication and sharding, each shard will correspond to one master and several slave instances, which will be the most complex architecture.

We will optimize three performance parameters:

L TPS (transaction per second) indicates the transaction processing volume per second. You can view http: // localhost: 8983/SOLR/mbtracks/admin/stats. jsp or avgtimeperrequest and avgrequestspersecond parameters of requeshandler.

L CPU usage. In Windows, you can use perfmon to obtain information about CPU usage, and use top on UNIX operating systems.

L memory usage can be viewed using prefmon, top, and jconsole.

Next we will introduce scaling for SOLR.

 

Tune a SOLR server (scale high)

SOLR provides a series of optional configurations to enhance performance. How to Use it depends on your application. The most common

JVM Configuration

SOLR runs on the JVM, so tuning the JVM will directly affect SOLR performance. However, you must be careful when changing JVM parameters, because a slight change may cause a large problem.

JVM parameters can be specified at startup:

Java-xms512m-xmx1024m-server-jar start. Jar

Your xmx parameters should reserve enough memory for your operating system and other processes running on the server. For example, if you have 4 GB index files, you can specify 6 gb ram (and specify a large cache) so that you can achieve better performance.

In addition, if possible, try to use a higher version of Java, because the performance of the new version of Java Virtual Machine is getting better and better.

 

HTTP Cache

Because many SOLR operations are based on HTTP, SOLR has great support for HTTP cache. If you want to use HTTP cache, You need to configure solrconfig. XML as follows:

<Httpcaching lastmodifiedfrom = "opentime" etagseed = "SOLR" never304 = "false">

<Cachecontrol> MAX-age = 43200, must-revalidate </cachecontrol>

</Httpcaching>

By default, SOLR does not use the 304 not modified status to the client, but always returns the 200 OK message. The above configuration indicates that the max-age is 43200 seconds. The following is an example:

>> Curl-V http: // localhost: 8983/SOLR/mbartists/select /? Q = smashing + pumpkins

<Http/1.1 200 OK

<Cache-control: Max-age = 43200

<Expires: Thu, 11 Jun 2009 15:02:00 GMT

<Last-modified: Thu, 11 Jun 2009 02:55:39 GMT

<Etag: "ywfkzwiynjvmodgwmdawmfnvbhi ="

<Content-Type: text/XML; charset = UTF-8

<Content-Length: 1488

<Server: jetty (6.1.3)

Obviously, if the HTTP cache configuration takes effect, you can also specify the IF-modified-Since parameter so that the server will compare it, the server returns the latest data.

> Curl-v-z "Thu, 11 Jun 2009 02:55:40 GMT"

Http: // localhost: 8983/SOLR/mbartists/select /? Q = smashing + pumpkins

* About to connect () to localhost port 8983 (#0)

* Trying: 1... connected

* Connected to localhost (: 1) Port 8983 (#0)

> Get/SOLR/mbartists/select /? Q = smashing + pumpkins HTTP/1.1

> User-Agent: curl/7.16.3 (powerpc-apple-darwin9.0) libcurl/7.16.3

OpenSSL/0.9.7l zlib/1.2.3

> HOST: localhost: 8983

> Accept :*/*

> If-modified-since:Thu, 11 Jun 2009 02:55:40GMT

>

<Http/1.1 304 not modified

<Cache-control: Max-age = 43200

<Expires: Thu, 11 Jun 2009 15:13:43 GMT

<Last-modified:Thu, 11 Jun 2009 02:55:39GMT

<Etag: "ywfkzwiynjvmodgwmdawmfnvbhi ="

<Server: jetty (6.1.3)

Entity tag is also a new method for identification. It is more robust and flexible than Last modified Date. Etag is a string. After the SOLR index is updated, the current etag changes accordingly.

 

SOLR Cache

SOLR uses the LRU Algorithm for caching and stores the cache in the memory. The cache and index searcher are associated to maintain a snapshot of data ). after a commit, the new index searcher is enabled and auto-warmed is automatically pushed ). automatic push means that the cache for the previous search will be copied to the new searcher. Then, the searcher defined in solrconfig. XML will run. Add some typical queries to newsearcher and firstsearcher for the fields to be sorted, so that the new searcher can provide services for the new search.

Solr1.4 uses fastlrucache, which is faster than lrucache, because it does not need a separate thread to Remove useless items.

On the statistics page of SOLR, you can see the size of your cache and adjust the cache size based on the actual situation to adapt to the latest situation.

Better Schema Design

You need to consider whether indexed, stored, and so on. These will depend on the specific situation of your application. If you store a large amount of text to your index, you 'd better use the field compressed option configuration to compress it. If you do not always need to read all fields, configure the field delay loading in solrconfig. xml: <enablelazyfieldloading> true </enablelazyfieldloading>

This will play a very good role.

NOTE: If compressed is used, you may need to use field to delay loading and reduce the decompression cost. In addition, reducing the number of text analysis will effectively improve performance, because text analysis will consume a lot of CPU time and greatly increase your index.

Index Policy

One way to accelerate indexing is batch indexing, which will significantly accelerate performance. However, as your document increases, the performance will continue to decline. Based on experience, for a large document, there are 10 indexes in each batch, and for a small document, there are 100 indexes in each batch, and they are submitted in batches.

In addition, using multi-threaded indexing will improve performance again.

 

Cancel the document uniqueness check (disable unique document check)

By default, SOLR checks whether the primary keys are repeated during indexing to prevent different documents from using the same primary keys. If you confirm that your document does not have duplicate primary keys, add allowdups = true to the URL to cancel the check. For SCV documents, use overwrite = false.

 

Commit/optimize factor (factors)

For large indexes and frequent updates, using a large mergefactor determines how many segments Lucene will merge (merge ).

 

Optimize the performance of faceting (group query) using term Vectors

Term vectors are a series of terms after a field is analyzed by text. It generally includes the term frequency, document frequency, and numeric offset in the text. Enabling it may enhance the performance of morelikethis and hignlight queries.

However, enabling tern vectors will increase the index size and may not be included in the morelikethis or highlight query results at all.

 

Improves phrase query performance

In the query of large indexes, the performance of phrase queries is slow, because a phrase may appear in many documents, one solution is to use filter to filter out meaningless words such as ". However, this will make the search ambiguous. The solution is to useShinglingIt uses a method similar to n-gram to split search sentences. For example, "The quick brown fox jumped over the lazy dog" will change to "the quick", "Quick Brown ",

"Brown fox", "fox jumped", "jumped over", "over the", "the Lazy", "lazy dog ". rough tests show that this can increase the performance by at least 2-3 times.

 

Use multiple SOLR servers (scale wide)

When you still cannot meet the performance requirements of a single SOLR server, you should consider splitting query requests to different machines, with horizontal scaling (scale wide) SOLR is also the most basic feature of a scalable system.

Script vs Java Replication

Before solr1.4, replication was performed using UNIX scripts. In general, this solution is not bad, but it may be complicated. You need to write shell scripts, cron jobs and resync daemon.

Since 1.4, SOLR has implemented a Java-based replication policy, eliminating the need to write complex shell scripts and running faster.

The replication configuration is in solrconfig. XML, and the configuration file itself can be copied between the master and slave servers. Replication currently supports UNIX and Windows systems and has been integrated into the admin interface. Currently, admin interface can control replication-for example, force replication to start or terminate failed replication. Replication is performed throughReplicationhandlerProvided rest API.

 

 

Start to experience multiple SOLR servers

If you use the same solrconfig. xml file between multiple SOLR servers, you need to specify the following parameters at startup:

L-dslave = disabled: specify that the current SOLR Server is a master. The master will be responsible for pushing the index file to all slave servers. You will store the document on the master and query it on the slave server.

L-dmaster = disabled: specify that the current SOLR server is slave. Slave either regularly polls the master server to update the index, or manually triggers the update operation through the admin interface. A group of slave servers will be managed by Server Load balancer (probably haproxy) to provide external searches.

If you want to run multiple SOLR servers on the same machine, you need to use-djetty. port = 8984 specify different ports and use-dsolr. data. dir =. /SOLR/data8984 specify different data directories.

 

Configure Replication

It is easy to configure replication. The sample configuration is shown in./examples/cores/mbreleases/solrconfig. xml:

<Requesthandler name = "/replication" class = "SOLR. replicationhandler">

<Lst name ="$ {Master: Master}">

<STR name = "replicateafter"> startup </STR>

<STR name = "replicateafter"> commit </STR>

<STR name = "conffiles"> stopwords.txt </STR>

</Lst>

<Lst name = "$ {slave: slave}">

<STR name = "masterurl"> http: // localhost: 8983/SOLR/replication </STR>

<STR name = "pollinterval"> 00: 00: 60 </STR>

</Lst>

</Requesthandler>

Note that $ {} will be able to be configured at runtime. It will determine whether the parameter here is master or slave through-dmaster = disabled or-dslave = disabled. The master machine has been configured to perform replication after each commit. You can also use the conffiles attribute to specify the replication configuration file. Copying a configuration file is useful because you can modify the configuration at runtime without redeployment. Modify the configuration file on the master. After replication to the slave, slave will know that the configuration file has been modified and reload the core.

See http://wiki.apache.org/solr/SolrReplication

 

Implementation of Replication

The master node cannot perceive the existence of the slave. The slave periodically polls the master node to view the current index version. If slave finds a new version, slave starts the replication process. The procedure is as follows:

1. Slave issues a filelist command to collect the file list. This command will return a series of metadata (size, lastmodified, alias, etc)

2. Slave to check whether these files exist locally, and then it starts to download the missing files (using the filecontent command ). If the connection fails, the download ends. It will retry five times and give up if it still fails.

3. The file is downloaded to a temporary directory. Therefore, an error occurred while downloading does not affect slave.

4. A commit command is executed by replicationhandler and the new index is loaded in.

 

Index distributed search across multiple slave files to the master

You can use SSH to run two sessions, one to enable the SOLR service and the other to index some files:

> Curl http: // localhost: 8983/SOLR/mbreleases/update/CSV-F. R _

Attributes. Split = true-f. r_event_country.split = true-f. r_event _

Date. Split = true-f. r_attributes.separator = ''-f. r_event_country.

Separator = ''-f. r_event_date.separator =''-F commit = true-F stream.

File =/root/examples/9/mb_releases.csv

The preceding command indexes a csv file. You can monitor this operation through the admin interface.

 

Configure slave

The file has been indexed and copied to slave. Next, you need to use SSH to connect to the slave machine and configure masterurl as follows:

<Lst name = "$ {slave: slave}">

<STR name = "masterurl">

Http://ec2-67-202-19-216.compute-1.amazonaws.com: 8983/SOLR/mbreleases/Replication

</STR>

<STR name = "pollinterval"> 00: 00: 60 </STR>

</Lst>

You can view the current replication status on the admin interface.

 

Distribute search requests to different slave

Because multiple slave instances are used, we do not have a fixed request URL for the customer. Therefore, we need to use Server Load balancer. haproxy is used here.

On the master machine, configure/etc/haproxy. cfg and put the URL of your salve machine:

Listen SOLR-balancer 0.0.0.0: 80

Balance roundrobin

Option forwardfor

Server slave1 ec2-174-129-87-5.compute-1.amazonaws.com: 8983

Weight 1 maxconn 512 check

Server slave2 ec2-67-202-15-128.compute-1.amazonaws.com: 8983

Weight 1 maxconn 512 check

The SOLR-balancer processor will listen to port 80 and redirect requests to each slave machine based on the weight to run

> Service haproxy start

To start haproxy.

Of course, solrj also provides APIs for load balancing. lbhttpsolrserver requires the client to know the addresses of all slave machines, and it is not as robust as haproxy, because it implements very simple architecture. Refer:

Http://wiki.apache.org/solr/LBHttpSolrServer

Sharding Indexes)

Sharding is a general policy for scaling a single database when you have too much data. In SOLR, sharding has two policies: one is to divide a single SOLR core into multiple SOLR servers, and the other is to convert a single-core SOLR into multiple cores. SOLR can distribute a single query request to multiple SOLR shards and aggregate the results to a single result to return the caller.

 

When your query execution on a single server is too slow, you need to combine the capabilities of multiple SOLR servers to complete a query. If your query is fast enough and you just want to expand to serve more users, we recommend that you use the full index to use the replication method.

Sharding is not a completely transparent operation. The key constraint is that when you index a document, you need to decide which shards it should be in. SOLR does not have related logic support for Distributed indexes. When searching, you need to add the shards parameter to the URL to determine which shards to collect results. This means that the client needs to know the SOLR architecture. In addition, each document requires a unique ID, Because you split it based on rows, and you need an ID to distinguish them.

 

Distribute documents to shards

A better way is to hash the ID and determine the shards to which the partition should be distributed based on the size of the modulo partition:

Shards = ['HTTP: // ec2-174-129-178-110

.Compute-1.amazonaws.com: 8983/SOLR/mbreleases ',

'Http: // ec2-75-101-213-59

.Compute-1.amazonaws.com: 8983/SOLR/mbreleases ']

Unique_id = Document [: Id]

If unique_id.hash % shards. size = local_thread_id

# Index to shard

End

In this way, your document will find its shards well without changing your shards.

 

Searching internal SS shards)

This function is already configured in query request handler. Therefore, you do not need to perform additional configuration. If you want to query two shards instances, you only need to query the URL and related parameters:

> Http: // [shard_1]: 8983/SOLR/select? Shards = ec2-174-129-178-110.

Compute-1.amazonaws.com: 8983/SOLR/mbreleases, ec2-75-101-213-59.compute-

1.amazonaws.com: 8983/SOLR/mbreleases & indent = true & Q = r_a_name: Joplin

Note that the parameters after shards cannot be transmitted with the HTTP protocol, and you can follow as many shards as possible, as long as you do not exceed the maximum number of characters of get URL 4000.

 

Note the following when using shards:

L shards only supports the following components (Component): query, faceting, hignlighting, stats, and debug.

L each document must have a unique ID. SOLR merges the search result document based on this.

L if Multiple shards return documents with the same ID, the first one will be selected, while the remaining one will be ignored.

 

Joint Use of replication and shards (scale deep)

If you use the previous method and still find that the performance cannot meet the requirements, it is time to combine the two to form a higher-level architecture to meet your needs.

You need to configure a set of masters in the same way, so that you will eventually have a same tree of masters and slaves. You can even have a dedicated SOLR server that does not have an index. It is only responsible for distributing queries to shards and returning the final Merging Results to users.

Data is updated on the master machine and replicated to the slave machine. The front-end needs the relevant Server Load balancer support, but in this way, SOLR can process a lot of data.

The next scaling of SOLR has been discussed many times in the SOLR email list. Some research shows that hadoop can provide a robust and reliable file system. Another interesting project is zookeeper, which can manage distributed systems in a centralized manner and has made great efforts to integrate zookeeper.

 

Reference File:

  • SOLR 1.4 Enterprise Search serverless
  • Http://wiki.apache.org/solr/homepage

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.