A total of three parts, the first part SOLR conventional treatment, the second part of the targeted treatment, the former is more general, the latter has limitations. It is important to adjust the parameters and compare the performance according to the specific application characteristics. Part III
SOLR Queries Related
Specific applications need to be comprehensive to control, all factors work together.
The first part <SOLR General tuning >
e-Text Connection http://wiki.apache.org/solr/SolrPerformanceFactors
Schema Design ConsiderationsIndexed fields
The number of indexed fields will affect some of the following performance:
We can reduce the effect of increasing the number of indexed fields by omitnorms= "true".
Stored fields
Retrieving the stored fields is really a cost. This overhead is greatly affected by the bytes stored in each document. The larger the space occupied by each document , the more sparse the document is, and the more I/O operations are required to read the data from the hard disk (typically, when we store a larger domain, we will consider such things as storing documents for an article.) )
Consider placing larger domains outside of SOLR to store them. If you think this is going to be a bit awkward, consider using a compressed domain, but this will increase the burden on the CPU when storing and reading the domain. But this is a burden that can be less i/0.
If you are not always using stored fields, you can use lazy loading of the stored field, which can save a lot of performance , especially when using compressed field.
Configuration ConsiderationsMergefactor
This is the merging factor, which probably determines the number of segment (index segments).
Merge factor This value tells Lucene when to merge several segment into a segment, the merging factor is like the cardinality of a digital system.
For example, if you set the merge factor to 10, a new index segment is created each time you add 1000 documents to the index. When the 10th 1000-size index segment is added, the 10 index segments are combined into a 10,000-size index segment. When 10 index segments of size 10,000 are generated, they are combined into an index segment of size 100,000. So the analogy goes on.
This value can be used in the Solrconfig.xml
* Set in mainindex*. (Do not set in tube Indexdefaults)
Mergefactor tradeoffs
Higher consolidation factor
will improve indexing speed
Lower frequency merges will result in more index files, which will reduce the search efficiency of the index
Lower consolidation factor
Hashdocset Max Size Considerations
Hashdocset is a custom optimization option in Solrconfig.xml,
Use in filters (docsets)
, smaller sets indicate smaller memory consumption, traversal, and insertion.
The Hashdocset parameter value is finally based on the total number of indexed documents, and the larger the index set, the greater the Hashdocset value.
Calulate 0.005 of the total number of documents it is going to store. Try values on either ' side ' of this value to arrive at the best query times. ? When query times seem to plateau, and performance doesn ' t show much difference between the higher number and the lower, US E the higher.
Note:hashdocset is no longer part of SOLR as of version 1.4.0, see SOLR-1169.
Cache Autowarm Count Considerations
When a new searcher is opened, its cache can be warmed up, or used to "Automatically heat" the data from the old searcher cache. Autowarmcount is a parameter that represents the number of objects copied from the old cache to the new cache. Autowarmcount This parameter will affect the " auto warm-up" time . There are times when we need some compromise, seacher the start time and the level of cache heating. Of course, the better the cache heats up, the longer it will take, but often, we don't want too long seacher start-up time. This autowarm parameter can be set in the Solrconfig.xml file.
Detailed configuration can refer to the SOLR wiki.
Cache Hit rate (buffer hits)
We can view the cached status information through the SOLR admin interface. increasing the size of the SOLR cache is often a shortcut to improving performance . When you use a polygon search , you may notice the Filtercache, which is a cache implemented by SOLR.
For more information, refer to the Solrcaching wiki.
Explicit warming of Sort fields
If you have many domains that are based on sort, then you can have the "Newsearcher" and "Firstsearcher" event
Listeners add some queries that obviously need to be warmed up so that the Fieldcache caches this part of the content.
Optimization considerations
Optimizing the index is something we often do, such as when we set up an index and then the index will not change again , we will do an optimization.
However, if your index often changes, then you need to consider the following factors well.
When more and more index segments are added to the index, the performance of the query degrades, and Lucene has a limit on the number of indexed segments, which can be automatically merged into one when this limit is exceeded.
In the same absence of caching, a performance without an optimized index is 10% less than the performance of an optimized index ...
The time for automatic heating will be longer because it relies on search.
optimization will have an impact on the distribution of the index .
During optimization, the size of the file will be twice times the index , but will eventually return to its original size, or it will be smaller.
optimization, all the index segments will be merged into an index segment, so optimizing this operation can actually help avoid the "too many files" problem, which is thrown by the file system.
Updates and commits Frequency tradeoffs
If the slave is frequently updated from the host, the performance of the slave will be affected. To avoid the performance degradation caused by this problem, we must also understand how the slave performs the update, so that we can more accurately adjust some relevant parameters (commit frequency, spappullers, autowarming/autocount), so that Updates from the slave machine are not too frequent.
Performing a commit will make SOLR a new snapshot. If the Postcommit parameter is set to True, optimization also executes snapshot.
Slave on the Snappuller program is generally performed on the crontab above, it will go to master inquiry, there is no new version of the snapshot. Once a new version is found, slave will download it and then snapinstall it.
Each time a new searcher is open, there will be a process for the pre-warming of the cache, and the new index will not be delivered until it is warmed up.
Three relevant parameters are discussed here:
number/frequency of snapshots --snapshot frequency.
snappullers is in crontab, it can certainly run once per second, once a day, or at other intervals. When it runs, it will only download the latest version that is not on the slave.
The Cache autowarming can be configured in the Solrconfig.xml file.
If the effect you want is to update the index on slave frequently, so that it looks more like a "Live index". Then you need to let snapshot run as frequently as possible, and then let the Snappuller run frequently. In this way, we may be able to update every 5 minutes, and also achieve good performance, of course, Cach's hit rate is very important, well, the cache heating time will also affect the update of the frequency.
Cache is important for performance . On the one hand, the new cache must have enough memory to allow the next query to benefit from the cache. On the other hand, the warming of the cache will probably take a long time, especially when it is actually only using one thread, and a CPU is working. Snapinstaller too often, SOLR.
Slave will be in a less than ideal state, perhaps it is still warming up a new cache, but a newer searcher is opern.
How to solve such a problem, we may cancel the first Seacher, and then to deal with an update seacher, that is, the second one. However, it is possible that the second seacher has not been used yet, and the third one has come again. See, a vicious cycle, not. Of course it is possible that when we just warmed up, we started a new round of cache warming, in fact, so that the role of caching is not reflected at all. When this happens, it is the hard truth to reduce the frequency of snapshot.
Query Response Compression
In some cases, we may consider compressing the SOLR XML response before outputting . If the response is very large, it will touch the NIC I/O limit.
Of course, the compression of this operation will increase the burden on the CPU, in fact, solr a typical dependence on CPU processing speed of the service, increase this compression operation, will undoubtedly reduce query performance. However, the compressed data will be 6 of the size of the data before it is compressed. However, SOLR's query performance also consumes around 15%.
As for how to configure this function, depends on what server you use, you can consult the relevant documents.
Embedded vs HTTP Post
Using embeded to build an index will be 50% faster than using XML format for indexing.
Ram Usage Considerations (memory-related considerations)Outofmemoryerrors
If your SOLR instance is not assigned enough memory, Java Virtual machine may throw Outof memoryerror, which does not affect the index data . But this time, any adds/deletes/commits operation is not able to succeed.
Memory allocated to the Java VM
The simplest solution to this approach is, of course, that Java Virtual machine has not used all of your memory and increased the memory of the Java VM running SOLR.
Factors affecting memory usage (factors that affect RAM usage)
I think you might also consider how to reduce the amount of memory used by SOLR. One of the factors is the size of the input document. When we use XML to perform an add operation, there are two restrictions.
The field in document is stored in memory, and field has a property called MaxFieldLength, which may be helpful.
Each additional domain will also increase the use of memory.
Part II <SOLR Special tuning >
1. When multi-core
Multi Core If the core switch at the same time, will cause memory, CPU pressure is too large, can expand SOLR code, limit the core
The number of executions to toggle. guarantees that high load or high CPU risk will not occur
2, High safety application
Finally no less than 2 nodes work, and the best 2 nodes are cross-machine.
Offline and online switch, if the amount of data is not many, you can consider index and search in the same, if the volume of data is larger than 5000w, it is recommended that index
Offline or a node other than the search node to execute index.
3.cache parameter Configuration
If updates are frequent, commit and reopen are frequent, and if possible, close the cache.
If the cache hint performance is dependent on the access, it is best to close the cache warm,no facet requirement
or open the cache warm have facet needs, to fieldvalue cache very dependent words.
Real-time updates, usually the document cache hit rate is relatively low, can not open this configuration
4.reopen and Commit
If possible, the primary disk index, do not participate in the segment merge, the new index segment go to different directories. And reopen, the main index does not change.
Commit and reopen Async
5. If you do not change some of the data, consider using memory cache or locale cache to balance performance and space overhead while avoiding FGC
6. Intermediate variable compression, single-instance
In all queries or indexing processes, create as few objects as possible, and change the object values by set, as well as single-instance, to improve performance. Some of the larger intermediate variables, if possible, take some integer compression
7. Object Representation redefinition
For example, the date, region, URL, byte and other objects, you can consider the difference, location code, the partial, compression and other structures, so that the memory overhead to reduce the indirect memory utilization, improve performance.
8.index and store Isolation
Is that index plays its query performance, and the store plays its storage, response performance.
That is, do not put all the content in index, as far as possible to make field properties Stored=false
9. Use the latest version of SOLR and Lucene
10. Sharing word Breaker instances
Custom participle, be sure to use a singleton. Do not have a document create a participle object
The third part SOLR query
1. Sort by the specified field
At the time of presentation, for the number of suggestions, show the last 1 or 3 months of data. such as price, to prevent cheating
When dump or index is built, the numbers above the nether are detected, and the numbers themselves are found to be correct, and the data that is unreasonable in actual sense
2. Sort variability
The default ordering must have its own relevant parameters and balance the requirements.
The sort is going to change, but not big fluctuations. The sorting details are not public, but the sorting results can be explained clearly.
3. Online and offline
Some points can be completed offline and some score lines are completed. Look at the demand.
4. Multi-domain Query
If you query multiple domains by default, you might want to combine multiple fields into one domain, one domain
5. Highlight
Highlighting can be performed inside or outside SOLR, not necessarily in SOLR, and can be performed outside of SOLR
Similarly, the word breaker can be executed online, dump only performs a simple space participle can be
6. Statistics
Facet statistics can be combined above and below the line, and do not necessarily depend entirely on the online tick count.
7. Active Search
The active search query string must be strictly handled, both to the invalid query string, but also to properly expand the query string.
Clear the query path and the corresponding processing of hit=0.
SOLR Tuning Reference