How to build a better search engine by measuring query quality

Source: Internet
Author: User
Tags manual valid

The search engine is built on the use of the standard set of real-world test cases, which allows developers to measure the relative effectiveness of alternative methods. This article discusses the NIST Text Retrieval Conference (TREC) project, which creates an infrastructure for measuring the quality of query results.

We always think that text file queries in our native language are guaranteed, but web search engines like Yahoo, Google and Bing don't make a day, and Web content is not the only area we need to query. As the data becomes more ubiquitous, search requirements have expanded accordingly. People search for different purposes (such as repositioning known data items, answering specific questions, learning specific problems, monitoring data flow, and browsing) in different media (such as text, Web pages, tweets, voice recordings, static images, and video). In most cases, the techniques used to support these different search types are still being refined. How can the search technology develop? How can search engine developers know what works and why?

The careful measurement of the performance of standard, real-world search engines by participants from large, pluralistic search communities has proven to be critical; Through the text Retrieval Conference (TREC) project, National Institute of Standards and Technology (NIST) Over the past One-fourth centuries, the collection of community assessments has facilitated the development of search and search-related technologies.

Sources of TREC

Search algorithms are usually developed by testing sets, that is, comparing each alternative to a baseline task. The first Test set was taken in the 1960s from a series of experiments in Cranfield College on the language of Aero indexing. 1. The Cranfield test set contains a collection of aviation magazine article summaries, a set of queries for these summaries, and the correct answer points for each query. Judging from today's standards, it may be trivial, but the Cranfield set broke the record at that time, creating the first shared metric tool for the information retrieval system. The researchers can write their own search engine to query the abstract, and the return value can be measured by contrasting the main points of the answer.

Other research groups began to follow the experimental approach presented by the Cranfield Test, generating several other test sets for the 70 and 80 generations. But in the 90 's, everyone began to be more and more dissatisfied with the method. Although some research groups use the same test set, they disagree on whether to use the same data, whether to use the same evaluation metric, or whether to compare results across the query system. The business search engine company did not integrate the research results from the query system into their products, because at the time they felt that the test set used by the research community was too small to be worth borrowing.

In the face of this dissatisfaction, NIST was asked to build a large test set for evaluating text retrieval techniques that had developed as part of the U.S. Department of Defense Advanced Research Program (DARPA) tipster Project 2 . NIST agreed to build a large test set in the form of a workshop, while also supporting detection of larger issues such as the use of test sets. The seminar was the first TREC meeting in 1992, after which a TREC meeting was held annually. TREC The initial goal of building a large test set in the early days, and Trec has, in fact, built dozens of test sets that are being used by the entire international research community so far. TREC's greater achievement is the creation and validation of research paradigms, which continue to be extended to new tasks and applications each year.

Community Assessment

The research paradigm, centred on community-based assessments, called "Cooperative Competition (Coopetitions)", which emphasizes cooperation among competitors, creates greater benefits.

The main element of the paradigm is the evaluation task, which is typically a user task abstraction that clearly defines our expectations of the system. The evaluation project is associated with one or more metrics that reflect the quality of the system's response and are the means by which all infrastructures compute the metrics that can be built. The evaluation method encompasses tasks, metrics, and statements that effectively interpret measurement scores. A standard evaluation method allows results comparisons across different systems, which is why it is so important that the search contest is not a winner; it facilitates the integration of broader types of research, as can be solved by other research groups.

Referring to the specific examples of paradigms, we can consider the main ad hoc project in the first TREC, which expands the Cranfield method at that time. The Ad hoc Evaluation Project also retrieves related documents (or rather, creates a list of documents that precede all relevant documents in unrelated documents), and then gives a set of documents and a statement of the natural language of information requirements called topics. The retrieval output is measured by the rate of accuracy (the ratio of the number of related documents retrieved and the total number of documents retrieved) and the recall rate (the ratio of the number of related documents retrieved and all relevant documents in the document library), provide a set of related documents for each known topic (in other words, the answer points). The innovation of TREC is to use pooling3 to build related collections for large document sets.

The pool is a collection of the first X documents that are queried by all participating systems to retrieve a given topic. Only files from the topic pool are used by the manual reviewers to authenticate their dependencies, and other documents are not related to the validity score calculation. Although only a small portion of the entire test set is referenced, subsequent tests prove that the pooling found in the document set in TREC is mostly relevant. In addition, the test further verifies that in general, these retrieval systems that get higher scores on the test set and build on pooling tend to be more effective than those with lower scores in practical applications 4. At the same time, the test also exposes a limited number of valid applications by calculating the scores obtained from the test set. Because the absolute value of a score is determined by a variety of factors, rather than just a retrieval system (e.g., the use of different artificial reviewers usually has a certain impact on the score). It is only valid to compare the scores calculated by other systems on the exact same test set. This means that it is not valid to compare TREC fractions of different years because each TREC builds a test set that is new (different). To make pooling an effective strategy, it is essential to have a wide range of search methods in pool. Therefore, the TREC community factor-using multiple retrieval methods for retrieving different document sets-is an important factor in creating a good test set. Community factors are also important factors for TREC success in other areas. TREC The current technology can be validated only if all the retrieval methods are rendered. The annual TREC meeting not only promoted the different research groups, but also promoted the technical exchange among different research and development organizations. The annual Meeting also provided an effective mechanism for solving methodological problems. Finally, community members are often the source of new project data and use cases.

At the beginning of the TREC, it was also doubtful whether the lab-built statistical system (compared to the operating system using Boolean search on a manual indexed collection) could actually retrieve documents from large collections. The Ad hoc project in TREC proved that the search engine in the early 90 did not only expand into the large set category, but also improved continuously. Its effectiveness is demonstrated both in the TREC Test set lab and in the current operating system where the technology is assembled. In addition, the technology is now applying a much larger set of tests than was originally envisioned in 1992. The web search engine is the best example of statistical technical capability. The ability of search engines to bring users the information they want to query has become the foundation of the site's success. As mentioned earlier, the improvement of search effectiveness is not simply judged by the annual TREC score. However, the smart search system developer freezes every eight trec ad hoc projects they have ever used and saves system copy 5. After each TREC, they execute all the test sets on each system. For each test set, the new smart system is always more efficient than the old version, and the score is probably one times higher than the old version. Although this evidence can only confirm a system, in each TREC, the smart system results always track the results of other systems, so the smart result can represent the current level of technology.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.