The recall rate is the ratio of the number of retrieved documents to the number of relevant documents in the document library. It measures the query completion rate of the retrieval system (search engine; accuracy is the ratio of the number of retrieved documents to the total number of retrieved documents. It measures the precision of the retrieval system (search engine. For a retrieval system, the recall rate and accuracy cannot be the same: when the recall rate is high, the accuracy is low, and when the accuracy is high, the recall rate is low. Therefore, the accuracy of a retrieval system is often measured by the average value of 11 kinds of precision under 11 kinds of recall rates (that is, the average accuracy of 11 points. For a search engine system, because no search engine system can collect all WEB pages, it is difficult to calculate the recall rate. Currently, search engine systems are very concerned about accuracy.
There are many factors that affect the performance of a search engine system. The most important factor is the information retrieval model, it includes the presentation methods of documents and queries, the matching policies of evaluation documents and user query relevance, the sorting methods of query results, and the mechanism of user feedback on relevance.
III. Main technologies
A search engine consists of four parts: searcher, Indexer, searcher, and user interface.
1. Searcher
Searcher is used to roam the internet to discover and collect information. It is often a computer program that runs day and night. It needs to collect as much new information as possible and as quickly as possible. At the same time, because the information on the Internet is updated quickly, it also needs to regularly update the old information that has already been collected, to avoid dead connections and invalid connections. There are currently two methods to collect information:
● Starting from a starting URL set and following the Hyperlink in these URLs, information is found cyclically on the Internet in a width-first, depth-first, or heuristic manner. These starting URLs can be arbitrary URLs, but they are often popular websites that contain many links (such as Yahoo !).
● Divide the Web space by domain name, IP address, or country domain name. Each searcher is responsible for the exhaustive search of a subspace.
The searcher collects diverse types of information, including HTML, XML, Newsgroup articles, FTP files, word processing documents, and multimedia information.
The implementation of searchers often uses distributed and parallel computing technologies to speed up information discovery and updating. Commercial search engines can discover millions of webpages every day.
2. Indexer
The indexer is used to understand the information searched by the searcher and extract the index items from it to indicate the document and to generate the index table of the document library.
There are two types of index items: objective index items and content index items: Objective items are irrelevant to the semantic content of the document, such as the author name, URL, update time, encoding, length, and Link Popularity. Content index items are used to reflect the content of a document, such as keywords and their weights, phrases, and words. Content Index items can be divided into single index items and multi-index items (or phrase index items. A single index is an English word, which is easy to extract because there are natural separators (spaces) between words. Words must be separated in Chinese and other consecutive languages.
In a search engine, a single index item is usually assigned with a weight to indicate that the index item is differentiated from the document and used to calculate the relevance of the query results. Generally, statistical methods, information theory methods, and probability methods are used. The methods for extracting phrase index items include statistical method, probability method, and linguistic method.