(Updated in-5-22) How far is Lucene (nutch) from commercial text search engines?
Author: rushed out of the universe http://lotusroots.bokee.com
Time: 2007.2.13
Update: 2007.5.9
Update: 2007.5.22
Note: Reprinted by the author.
Note (2007-5-22): During the latest update, I once again studied Lucene. After reading Lucene in action and using Lucene to build a small search system, I felt ashamed, because I have always been dissatisfied with Lucene and think that it is not doing well (it may be affected by some websites in China that use Lucene to build search engines, because they all build poorly, maybe they don't really understand Lucene like me ). Now I have discovered that Lucene's authors are much better than me in terms of overall consideration (although I don't know if some functions are actually useful ). Now I feel like:
1) Lucene has a deep understanding of the query, and takes into account almost any query requirement. For ordinary commercial search engines (such as Baidu), they only consider the Boolean query mode, far less than the query methods in Lucene;
2) Lucene's filtering and sorting process can meet 90% of the requirements. Filtering can delete unnecessary results, while sorting can sort the results based on the values of a certain field. Basic vertical search engines require these functions, because these functions can reduce the query result set and improve the user experience. Baidu and other commercial search engines obviously do not need (or are not provided) these features.
3) Lucene's score calculation results are good. Aside from the basic IR standard score CalculationAlgorithmNot to mention (everyone knows that Lucene uses the simplified Vector Model in general, but Lucene actually uses the simplified extended Boolean Model score calculation formula for the Boolean Model query ), lucene supports most additional empirical-based score calculations. For example, when querying "Beijing, China", the actual sorting of several documents is as follows (directly parsing this query phrase will not result in this result, see 4th ):
Beijing, China
Beijing China
Cute Beijing, China
Beijing belongs to China
4) Lucene does not provide true query string parsing. Syntax-based query string parsing, a serious result is too bad fault tolerance, so we can almost think that Lucene does not provide practical and useful query string parsing. Therefore, the "China Beijing" mentioned above requires you to resolve it:
Phrasequery query = new phrasequery ();
Query. Add (new term ("name", "China "));
Query. Add (new term ("name", "Beijing "));
Query. setslop (1000 );
Then there will be the above result. Therefore, the first step to use Lucene is to provide your own query string parser.
5) Understand Lucene's algorithm restrictions. Doug cutting is not a fairy, and he cannot solve algorithm problems. Therefore, almost all algorithm restrictions will appear in Lucene. For example, rangequery query is very slow, sort sorting requires a large amount of memory to buffer field values. These are not Doug cutting's mistakes, but algorithm restrictions. I think no one can solve these problems, right? (Although we know from IR that rangequery can use B + tree and other structures to speed up the query, we cannot blame Lucene because Lucene uses a compact file structure, B + tree structure is hard to support.) If you really need to consider these issues, you can modify Lucene'sCodeOr provide additional extensions.
6) Understand other restrictions of Lucene. In actual application, you will find that Lucene has various restrictions. Here, I would like to remind you of two problems: a) buffer synchronization; B) Write/read synchronization. Lucene buffering is not good. You can expand it based on the actual situation. During expansion, always pay attention to the buffering and actual data synchronization issues. Write/read synchronization is actually a question of Lucene. When reading and writing a directory at the same time, the written data will not be immediately read, you can either buffer the write results by yourself.
Lucene is an open-source Java-based search engine that only contains information retrieve (IR. It is not the only and not the best open-source search engine. It is better, such as egothor. However, it is the most comprehensive and focused file. Based on Lucene, nutch is a search engine that incorporates distributed and crawler components. In this article, the author tries to talk about the distance between the technologies they use and the technologies used by general commercial text search engines. Due to the limited level of the author, the author has less than two years of search research and practical experience. For more information about the shortcomings, please advise. Thank you.
1. network search engine architecture
A professional Web search engine consists of at least three parts: crawling, processing, and searching. The following are their general functions:
Crawlers: crawlers (Spider, crawler, crawler, And SPIDER)ProgramCrawls a specific network (or the entire network) and downloads the pages and other files on the network to a local device. The difficulty is JS analysis and identity authentication issues caused by the popularity of web.
Processing: Processes (classification, information extraction, data mining, classify, information extraction, data mining, etc.) programs to analyze the captured page, for example, classify website content, extract news information on news pages, generate page templates, and calculate the relationship between websites.
Search: the Information Retrieve program fills documents in the database, and then finds the most relevant documents in the database based on the query string.
From the perspective of the search engine architecture, the missing part of Lucene is information processing. Information processing is precisely the core technology of the entire search engine.
2. capture information
Web Information capturing includes web page capturing, text file capturing, and other file capturing. For websites that use HTTP (and https) protocols, the main process of information capturing is:
Enter the first page based on the specified URI address;
Analyze the Page Structure, obtain the hyperlink address, and add the address to the link to be downloaded;
When there is a link not downloaded, download the corresponding page, save the page, and return to Step 1;
Download all links and exit.
Common Information capturing can be called spider. It uses basic HTML page analyzer (such as htmlparser, neckohtml, and jtidy) to parse HTML pages and obtain hyperlinks in pages. Generally, a spider consists of the following two parts:
HTTP download server. Given a URI address, the HTTP download server downloads the data from this address. You may think this is easy, but it is not. In addition to the HTTP protocol, the HTTPS security protocol is also a very common protocol for pages on the network. When you need to authenticate the downloaded data, you also need to enable the download tool to support authentication. When you need to log on to the downloaded data, you also need to enable the download tool to support cookies. Therefore, it is far from enough for your download server to Support HTTP only. At this point, nutch is still far away.
HTML page parser. The HTML page parser does not support HTML pages. Currently, many pages actually use XML construction methods. Although HTML and XML are similar, their tag names are obviously quite different. In addition, WAP mobile page is also a page that you may need to support. Nutch uses neckohtml or jtidy. The actual application shows that jtidy is relatively general, but neckohtml has good results. In open-source software, it should be regarded as the best HTML page parser. However, compared with HTML Parser of IE or Mozilla browsers, it is still a distance. In addition, this type of spider only analyzes the page into a DOM tree, and cannot effectively process Web Pages Based on AJAX technology. To process page data that uses a large amount of Ajax technology or JavaScript code, spider still needs JavaScript processing capabilities. Of course, JavaScript processing can also be divided into data processing parts.
3. Information Processing
It seems that it is not professional to talk about information processing in front of Lucene, because they are not supported at all. However, you can write the information processing part and embed it into Lucene. The topic of information processing is too big, and the author does not have the guts and level to talk about it, although this is the main direction of the author's research.
The only thing that can be affirmed is that information processing requires at least 2 points: 1, machine learning (ML), 2, natural language understanding (NLP ). Typical examples of the former include SVM and hmm, and the latter includes HNC and net. I am confident that I have a certain understanding of ML, but I am not very familiar with NLP and have not yet formed a complete solution.
4. Information Acquisition
I have been studying information acquisition for many years, and most of them were not born at that time. This part is mainly divided into two steps. The first step is to enter the document into the database (that is, the so-called index building ); the second part is to get a series of most relevant documents (the so-called search) based on user input ). The similarity between the two documents is similar. After a user inputs a string (a document), the search engine searches for the document most relevant to the document based on the document and returns it to the user. The actual search engine must consider the speed issue and cannot do it as theoretically. When searching, there are generally three steps:
Query string parsing. There are many ways to organize queries, such as Boolean queries. Generally, the commercial text search engine supports the following queries: "Beijing Yellow River and Tom cat (or beautiful-Motherland) Jerry + mouse ". That is, compound expressions that support and (+), or (|),-, Parentheses, and quotation marks are supported. The light resolution is not good yet. You have to make extra optimizations. For example, for the string "Beijing, Beijing, and Beijing", is it the same as "Beijing? Lucene has done a lot of work in this regard, but you still need to write a query string parser closer to Chinese and commercial search engines by yourself. You can decide whether to optimize the expression to be suitable for parallel computing or the form of minimum calculation according to your needs.
Speaking of Boolean queries, I think that Lucene is still a long way away from practical use, which is partly based on this. Why? Because lucence uses the Java version of YACC to automatically build a Boolean expression parser, however, its syntax is too formal and cannot process the following string:
We and they
As a perfect person, the author cannot accept such "errors" that cannot be tolerated ".
Query. There is nothing to say about the query. One thing is that the algorithm is not optimized. However, this is sufficient for a project that is under initial development and lacks development personnel.
Sort the results. The score calculation method used by Lucene is close to the one used in IR theory. However, the actual score calculation is much more troublesome than this. One example is that Lucene's method does not take into account the closeness between words. For example, for "Beijing store" and "Beijing store", the first few results should be significantly different, this is because the former prefers the word "Beijing" and "Store. It is also easy for Lucene to achieve this, but it will seriously reduce the search speed (several times ).
In practice, you often need to calculate the weight of some attribute values. For example, in news search, should today's news have a higher weight than yesterday's news? Lucene does not support this.
5 speed first
For search engines, speed is definitely the first issue to be considered. The speed of information capturing is obviously not within the scope that the software can solve. It can only be improved by increasing bandwidth and multi-level update policies. The speed of information processing is not discussed in this article. When obtaining information, there are two points to consider the speed:
Index speed. The indexing speed includes the indexing speed, Document Modification speed, and document deletion speed. In terms of coarse granularity, the distributed index construction method based on mapreduce policy adopted by nutch is a good architecture. In terms of fine granularity, the memory buffer used by Lucene and the merge of small disk indexes are also a good framework. Their performance in index building is commendable. In the document deletion process, it is understandable that Lucene uses the mark before deleting the document. Lucene is not doing well in terms of Document Modification and document appending. The only way to modify a document is to delete it and then append it. This is obviously not satisfactory.
Query speed. Lucene (nutch) considers that distributed queries are not advisable because the query speed is slower. In fact, distributed queries require a fast architecture, which involves all aspects of the search engine. It is not something that can be achieved by nutch. Therefore, Lucene (nutch) queries are run on one machine. In addition, because Lucene uses the score calculation method, it does not need to load the position information of words in the document, so it looks faster than other search engines. In fact, it sacrifices the precision acquisition speed. In addition, its own file structure severely restricts its speed.
The absence of buffer is another serious mistake of Lucene. In this architecture, you can build a query result buffer (level-1 buffer) by yourself. However, it is difficult for you to build a word-based index buffer (level-2 buffer ). This is unimaginable for a large search system. What's more, the common commercial search engine has a level 3 buffer.
6. Second precision
If we look at the query method and score calculation supported by Lucene, Lucene does not have precision problems because they all calculate all the data. Commercial search engines have to use some estimation algorithms to reduce Disk Data Reading because of the large amount of data, which leads to a slight loss of precision. On the other hand, Lucene cannot support large-scale data.
7. Efficiency 3
The efficiency mentioned here is mainly space efficiency. For example, program memory usage and disk usage. Lucene uses the zlib compression technology to compress the document content, and uses a simple method to compress integers. However, the compression method is still too small. The result of using less compression is: 1) the file structure and code are simple; 2) the query speed is slow; 3) the index segment merge is fast; 4) the disk usage increases. At the same time, it has a lot to do with almost no compression and it has a lot to do with giving up the buffer, because the compressed index is easy to put in the memory, which is very satisfying the buffer space needs.
Postscript:
Because the author is extremely enthusiastic about algorithms, This article compares the algorithm differences between the open-source search engine Lucene (nutch) and basic commercial search engines. To make everyone better understand, there is almost no algorithm mentioned here.
There are too many writes, so I can add more when I have time.
07.5.9 Postscript:
Based on recent insights, I modified this article. Added some important statements.