Comparison of open-source search engines (v) 5.3 Overall Evaluation
Based on the above results, this article has conducted experiments on different sets of documents. the time consumed for indexing by search engines is ht: // Dig, indri, ixe, Lucene, mg4j, swish-E, swish ++, Terrier, xmlsearch, and zettair. The storage size analysis after indexing can be divided into three types: Lucene, mg4j, Swish-E, swish ++, the index size of xmlsearch and zettair is 25%-35% of the size of the dataset. While terrier is 50%. Ht: // Dig, Omega, and omnifind are even larger than 100% of the dataset. Finally, another aspect is the memory usage in the indexing phase. Ht: // Dig, Lucene, and xmlsearch have fixed memory overhead. In addition, the memory overhead of the first two has nothing to do with the dataset size (30 mb ~ 120 MB ). On the other hand, the memory overhead of ixe, mg4j, Swish-E, swish ++, and terrier is much larger and increases linearly. For a small data set, the memory overhead is 600 mb to MB, while for a big data set, the memory overhead is about 1 GB.
This article also finds that when a search engine stores and manages indexes, it uses the database's search engine (dataparksearch, mnogosearch, and openfts) to achieve significant poor performance in the indexing phase, therefore, their indexing time overhead is about three to six times the optimal search engine.
In the second experiment, we can see that, given the dataset and query type (one or two word query words), the search engine's query time overhead is similar. For a word, the query time overhead is less than 10 ms to 90 ms, and the query time overhead of two words is less than 10 ms to 110 ms. The fastest search engines for query include indri, ixe, Lucene, and xmlsearch. The only difference is that when a query word is a low-frequency word in a dataset, most search engines only retrieve 0 or 1 document, so that the retrieval percentage is not representative.
Experiments Based on the wt10g dataset show that only indri, ixe, mg4j, Terrier, and zettair in the index of the entire dataset, compared with the previous TREC-4 data set, the performance of the decline will not be too much. Swish-E and swish ++ cannot index the entire dataset at all given system parameters (operating system, memory, etc. Ht: // The indexing time overhead of dig and Lucene will soar, so that they are ignored in the final comparison. Zettair is the fastest index search engine, and its average accuracy/recall ratio is similar to that of indri, mg4j, and terrier. Compared with other search engines, ixe has the lowest average accuracy/recall rate ratio. If you compare this result with other TREC meeting Project directions (such as the timeline dataset), you can see ixe, mg4j, and terrier are also at the forefront of the search engine list. The difference from the official TREC evaluation is that developers have made fine adjustments to the search engine. In the final release version, they have specific requirements for each project, A lot of detailed adjustments have not been fully recorded, because these adjustments are indeed quite consistent with the direction of the meeting.
Chapter 4 Conclusion
This article provides different open-source search engine methods, and draws a conclusion after experiments on closed data sets of different sizes. First, select 17 search engines (from 29 known search engines) for comparison. After testing, we found that only 10 search engines can index 2. 7 GB document data sets within a reasonable time overhead range (less than 1 hour, then we use these 10 search engines for query testing. During indexing, the memory overhead varies, and the storage space of indexes of different search engines is also different. In the query test, the performance of the search engine that can index the largest dataset is not significantly different.
The last experiment is to compare the index building performance of a large dataset (10 Gb) and analyze the accuracy of different levels. Only five search engines can index this dataset (based on the given server parameters ). By observing the average accuracy/recall rate ratio, zettair is the winner, and the indri effect is basically the same. Compare the results in this article with the official results of the TREC evaluation to find a gap. But in fact, developers in every meeting direction of TREC will make detailed adjustments, and such adjustments will not be recorded.
The initial test results of ignored search engines (datapark, mnogosearch, namazu, openfts, and glimpse) are analyzed, the time overhead of these neglected search engines is indeed poor.
From the information above, we can see the features and performance of all available search engines in the index and query stages. In table 6.1, the index time overhead, index storage size, and average query time of the search engine indexing GB data are given. The query time sorting here is based on a GB Data Set, taking into account all the query words (one or two words, the original distribution and the normalized distribution ). This article also provides the top five search engines that index the wt10g dataset.
Based on the small dataset (TREC-4) and big dataset (wt10g), the overall performance of the search engine is analyzed. It can be seen that zettair is one of the most complete engines, whether it is more time overhead than other search engines when processing a large amount of information (less than half of the second index time, or its highest accuracy and recall rate on the dataset wt10g.
In addition, in order to determine which search engine to use, it is necessary to supplement the special requirements of the site. Some aspects still need to be considered, such as programming languages (such as secondary development of source code) and server performance (such as available memory ). For example, if the dataset to be indexed is quite large and the expected dataset changes frequently, take into account the index time overhead and query time overhead. Take a look at zettair, mg4j And swish ++ seem wise because they are faster. Swish-E may also work. From another perspective, if the disk size is limited and you want to save the storage space, lecence is a good choice. Its storage space overhead is small, and its query is fast, the disadvantage is that the indexing time overhead is large. Finally, if the data set changes little, because the query time of all search engines is not much different, you can consider the programming language your site wants to use, the secondary development cycle will be greatly shortened. If you want to use Java, mg4j, Terrier, or Lucene, and C/C ++, you can check swish-E, swish ++, ht: // Dig, xmlsearch, and zettair.
Bibliography
[1] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval.
Addison-Wesley, wokheim, UK, 1999.
[2] IBM omnifind Yahoo! Homepage. http://omnifind.ibm.yahoo.net /.
[3] indri homepage. http://www.lemurproject.org/indri.
[4] lemur toolkit homepage. http://www.lemurproject.org /.
[5] load monitor project homepage. http://sourceforge.net/projects/monitor.
[6] Lucene homepage. http://jakarta.apache.org/lucene.
[7] managing gigabytes homepage. http://www.cs.mu.oz.au/mg.
[8] nutch homepage. http://lucene.apache.org/nutch.
[9] QoS project homepage. http://qos.sourceforge.net /.
[10] swish ++ homepage. http://homepage.mac.com/pauljlucas/software/swish.
[11] SWISH-E homepage. http://www.swish-e.org /.
[12] terrier homepage. http://ir.dcs.gla.ac.uk/terrier.
[13] text retrieval conference (TREC) Homepage. http://trec.nist.gov /.
[14] xapian code Library homepage. http://www.xapian.org /.
[15] zettair homepage. http://www.seg.rmit.edu.au/zettair.
[16] ht: // Dig homepage. http://www.htdig.org /.
[17] Christopher D. Manning, prabhakar Raghavan, and Hinrich schtze. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK, 2008.
[18] Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann publishers, San Francisco, CA, 1999.
Shi chunqi,
Search engineer,
Graduated from the Chinese Emy of Sciences computing Institute,
[Email protected]