The latest project requires full-text retrieval, so I found several open-source. NET retrieval projects, such as e.net, sphsf-, and Hubble.net. Finally, use Lucene. ne to perform full-text search. For the reason, you can refer to the following points:
1) Sphinx has excellent performance and works well with Mysql. However, we are currently using SqlServer, and we don't need this advantage. The key issue is, sphenders need to eat the entire index into the memory. When the index is large, the memory will not be enough. The key is that no proper solution is found for this problem, so they can only discard it.
2) Hubble.net has good performance and is very good in combination with SqlServer. Therefore, we plan to use Hubble.net and find that Hubble.net is only used to check the full text, to use the Hubble.net retrieval engine. If you want to query exact matching of certain integer, time, or keywords, you must use SqlServer to retrieve them. I have heard that the author intends to put shaping, time, and other types of data into Hubble.net for retrieval. At present, we can only look forward ....
3) e.net, which is currently used for retrieval. Lucene.net may be inferior to the above two in terms of performance, but the performance can at least meet our current needs, and there are flexible retrieval methods, powerful community support, and a wealth of information, this is why I decided to use the e.net project.
If the sphflood and Hubble.net project defects I introduced have been solved or there are solutions, please kindly advise. Thank you.
The test Server is an HP Server, Windows 08 system, 12-core CPU, and 16 GB memory.
The software architecture uses. Net4.0 + Sqlserver08 + Lucene. net3.0 + pangu word segmentation.
The data volume is about 23 million, and the index is 30 GB.
Test method: single index (30 GB) performance test, Multiple indexes (10 Gb for each index, 3 indexes in total) performance test, Multiple indexes (10 Gb for each index, A total of 3 indexes.
Note: A single index is implemented through the IndexSearcher class.
Multi-index retrieval is implemented through multi‑earcher.
Multi-index parallel retrieval is implemented through ParallelMultiSearcher.
Test results:
Searches for 25 consecutive times in a single thread and obtains the average value. (Different keywords should be used to avoid the cache of e.net ).
Single Index |
Multiple indexes |
Multi-index parallel |
408 Ms |
420 Ms |
230 Ms |
Multiple Threads (15) are retrieved for 25 consecutive times and the average value is obtained. (Different keywords should be used to avoid the cache of e.net ).
Single Index |
Multiple indexes |
Multi-index parallel |
1.15 seconds |
1 second |
900 ms |
If there are more threads, the CPU will exceed 100%. Therefore, the retrieval requires a high CPU. To adapt to concurrency, you must use the cache. The results show that multi-index parallel processing is the most efficient and CPU-consuming.
The following is an example using example e.net. The implementation of indexing and retrieval is in the source code. If you have any good ideas, you can continue to discuss them. The following is a search. I am not good at the front-end. The page is ugly. Sorry.
Summary:
When using Lucene. Net, I would like to talk about my personal experience in terms of functions and performance.
1) Lucene.net is very CPU-consuming during retrieval, But it consumes very little memory (of course it does not use the memory mode ).
2) When parallel e.net is used for multi-index parallel retrieval, it is better than single-index. We recommend that you use this function.
3) The IndexSearcher object can be opened once. Do not open it frequently, which is time consuming.
4) if you want to perform real-time indexes, we recommend that you directly append the indexes instead of merging them (which is time-consuming). When there is idle time, merge the indexes. The appended records cannot be retrieved, you need to re-open IndexSearcher. It takes a lot of time to re-open IndexSearcher, so you can decide when to re-open IndexSearcher as needed.
5) when searching keywords in e.net, if the keyword is too long, the speed is very slow. Because there are too many words to be searched after word segmentation, it is recommended that after word segmentation, filters out useless single words because the retrieval of a single word is meaningless and time-consuming. For example, we are in the field of hope. After word segmentation: We/on/hope/on/field. Among them, "In/" is of little significance, it is recommended to filter out, to improve performance.
6) I heard that using solid state disks can directly increase the search speed. Because I do not have a solid state disk, I cannot provide the test results here. If you have any friends who have tested it, let's talk about it.
7) if a single server cannot meet the search requirements, it must be distributed. You can use the e.net + Wcp + cache server to solve the problem. WCP for communication, Lucene for retrieval, the central server sends requests to each retrieval server, and then merges the results of each retrieval server. The Cache Server solves the high concurrency problem. I have not tested this solution yet. It is just an idea. After the verification is completed in the future, I will discuss with you.
Thanks to the pangu author eaglet for providing the. NET word segmentation system.
Thanks to Baoyu for solving the problem that pangu cannot use in Lucene. net3.0.
LuceneProject source code download