First, search optimization:
In the field of engineering, the more difficult it seems to be to solve the problem of "simplicity and certainty". Near real-time search engine needs to solve the problem only one: Performance! It contains fast indexing, quick search, and quick entry into the index to search. The following is a summary of the millions data-level (for Tens) fast scrolling data near real-time search engine practice experience:
1. For technology optimisation1.1 Numeric search Optimization: Reduce the range of values, can be used int value do not use a long value, can be used with the float value of the double value, can be replaced with a string, do not use the scope of the query (especially the large-scale query), These are based on the Lucene search engine on the numerical index and the scope of the principle and characteristics of the query decision; 1.2 Simplification of search syntax and support for advanced search get a balance: Be cautious with "*", "?" (wildcard search), prohibit "*AA" queries, and if you must support such queries, you will need to educate users about performance issues that may be caused by "*".
2. For business optimization2.1 Special emphasis on range queries, must be optimized. Avoid a large range of numeric queries, the range of values is unavoidable, as far as possible using a small range, which is determined by the nature of the business, such as: Say A > 0 query, need to optimize for a more meaningful query: [a:0-100]. 2.2 Can use the short string (especially the string that does not do participle) search to replace, must not use the numerical search. The specific value of the search, although fast, but the scope of the query, it has a certain price, if the use of inappropriate, the cost will be very large.
second, index optimization:A good search engine is a balancing act for quickly creating indexes and fast searching. Especially for near real-time search engines, this balance is more difficult to achieve. Through a series of tests and validations, it is not easy for Lucene to strike a balance between "search", "index" and "optimize". In frequent search and indexing, online optimization is difficult to really effect. It can be understood that the priority has this feature: Search > Index > Optimization. Searches are relatively minimal and optimize for the longest time. Under the previous two frequent operations, optimization has no chance (forced optimization can only lead to, search and index pauses, not acceptable for near real-time systems). Therefore, the corresponding parameters must be set up, mainly including: Cache size, index memory size, the maximum number of index single submissions (equivalent to the merge factor, Lucene does not necessarily strictly execute), search maximum concurrency number. Most of the index optimizations are based on the search business, as described above, using string fields instead of numeric range queries. The way the index itself is created has a direct impact on the search, and the configuration of the merge factor is critical for Lucene (in case of large enough memory). Simply put, the merge factor is small and the index is slow, but it has little impact on large-scale index performance (as long as the index memory is large enough), but at the same time it causes the number of indexed paragraphs to be limited to a reasonable range of values (directly affecting the number of index segments-search performance). Conversely, if the merge factor is small, the search will be fast and the number of segments is large, and the effect on search performance is fatal if the index is not optimized in time.
third, distributed balanceA Lucene server is basically impossible to achieve near real-time search, unless the search volume is very small. A single search volume of more than 5/s, a real-time search based on Lucene is almost unbearable, not because of the search itself, but in the index, while ensuring real-time search. The process of indexing will directly affect the search if it produces too many fragments (segments). Summing up, given the server to build the index must be idle time is necessary, that is, in the time period of indexing, search can not be too frequent. Thus distributed allocation of search pressure is very necessary. Summary: 1. Currently 3 Lucene-based PC servers, peak concurrency in the number of 15/s or so, 2. The amount of data is millions level (less than 2 million), 80 fields, each about 200 characters (Chinese, English, numbers); 3. Search criteria are basically 7-20 keywords, the average search speed is 98ms.
Summary of near real-time search engine optimization based on Lucene