Author: caocao (Web hermit), http://www.caocao.name, http://www.caocao.mobi
Reprinted please indicate Source: http://www.javaeye.com/topic/78884
I. Origin
Lucene's search performance decreases significantly after indexing files on G. It takes 0. x seconds to perform any search. If it is a single-thread search, the performance is acceptable, and the results can be returned in 0. x seconds,
For Web-based multi-threaded access, because of the internal mechanism of Lucene, data is loaded into the memory in large quantities and discarded immediately after use, which causes frequent jvm gc and extremely low performance.
Persistent connections are everywhere. This is also the Lucene application bottleneck criticized by the world. Is there a solution?
Ii. Ideas
Let's observe Google,
Baidu search has a general feeling that it takes less time to search for keywords with more results and more time to search for keywords with fewer results, when there are many results, it will say "about *** results ". Hermit guess
The algorithm used to test Google and Baidu is to find the first n results and then stop scanning the index. Based on the first n results, we can infer the total number of results,
Baidu page flip restrictions are partially verified.
Looking at Lucene, its Hits. length () always returns accurate results. If Lucene can also return fuzzy results, the index file can be easily matched even if it is 10 Gb.
Iii. Exploration
The hermit visited the famous mountains and found high people with this problem. Unfortunately, they did not find the achievements of their predecessors. It may be that the hermit did not follow the road diligently. If there is a similar solution, the hermit did not give me any advice.
In desperation, the hermit studied the Lucene 2.1.0 source code in detail and prepared to re-invent the wheel.
Generally, the Query in most search applications falls on BooleanQuery, And the hermit will take it as a knife. All the way, a method in BooleanScorer2 attracts the hermit. The Code is as follows:
Java code
- Public void score (HitCollector hc) throws IOException {
- If (countingSumScorer = null ){
- InitCountingSumScorer ();
- }
- While (countingSumScorer. next ()){
- Hc.collect(countingSumScorer.doc (), score ());
- }
- }
public void score(HitCollector hc) throws IOException { if (countingSumScorer == null) { initCountingSumScorer(); } while (countingSumScorer.next()) { hc.collect(countingSumScorer.doc(), score()); } }
The number of times the log writing code is embedded in the while loop to verify the size of the result set. CountingSumScorer. next () indicates finding the next
Document that complies with the boolean rule. Find the document and put it in HitCollector. The HitCollector will be placed in the familiar Hits.
.
If you can embed a break in this while LOOP, a certain number of breaks will be generated, and the performance improvement will be quite obvious. This code is quite simple, which greatly improves the performance. The side effect is that the results are not accurate. This can be corrected by adjusting the business model and logic. After all, this is an effective way to improve Lucene performance.
It is precisely because this break will lead to a large keyword in the result set coming out in advance, and the search time is small. The entire index is inevitably completed when the keyword in the result set is small, and the search time will be longer.
Iv. Effect
As the process of embedding code is extremely cumbersome, the hermit will explain it in detail in the second step. In this case, we will first begin with a Big picture.
After all the hardships, the hermit finally completed the program. The effect can be found in the video search http://so.mdbchina.com/video/%e7%be%8e%e5%a5%b3.
The keyword "beauty" can find 0.18 million videos, with an average of 0.5 seconds to return results. Now we use the new algorithm, as long as the results are returned in seconds, and the returned results are good enough, although there is a big gap between the estimated 85 thousand results and 0.18 million results, it is acceptable because the difference is estimated to be 2-3 times.
According to the characteristics of the algorithm, the hc. collect in the while can always be completed within the constant time, and the number of cycles is <= constant. The time complexity of the algorithm is only equal
The complexity of BooleanQuery is related to the size of the index file and the distribution density of the hit documents in the index file.
The complexity determines how many times the countingSumScorer. next () determines and how many times the index is read.
CountingSumScorer. next () is the time-consuming part of the entire algorithm.
Now the index file for this video search is close to 3 GB, and popular keywords can return results in seconds. The hermit believes that even if the index file is up to 10 Gb in the future, the results can still be returned in seconds.
(Note: the actual use of this video search results will be compromised, because the background index is also on this machine and will be split into servers in the future, and it is now available together .)
Hehe, did not know write so much, the next (http://www.javaeye.com/topic/80073) will be on the code, so stay tuned.