Search engine Technology and trends

Source: Internet
Author: User
Keywords Search Engines Trends

Intermediary transaction SEO diagnosis Taobao guest Cloud host technology Hall

Li: 1982 graduated from Harbin University of Technology, 1986 graduated from the United States Stevens Institute of Technology Computer department, received a doctorate. He is currently a professor of computer science and Technology at Peking University, Ph. D. The research direction is computer parallel and distributed processing. Jianguo: Associate Professor, Computer department, Peking University.


With the rapid development of the Internet, the increase of web information, users to find information in the ocean, like a needle in the haystack, search engine technology to solve the problem (it can provide users with information retrieval services). At present, search engine technology is becoming the object of research and development of computer industry and academia. Search engine (Engine) is a technology that has been developed gradually since 1995 with the rapid increase of web information. According to the July 1999 article published in Science magazine, "Accessibility of web information" estimates that the world currently has more than 800 million web pages, more than 9T of effective data, and is still growing at a doubling rate every 4 months. Users in such a vast ocean of information to find information, will inevitably be "in a haystack" without work and return. Search engines are the technology that comes with solving this "trek" problem. Search engine to a certain strategy in the Internet to collect, find information, understanding, extraction, organization and processing of information, and to provide users with retrieval services, so as to play the purpose of information navigation. Search engine provides navigation service has become the Internet is very important network services, search engine site is also known as the "Network portal." Therefore, search engine technology has become the object of research and development in computer industry and academia. The purpose of this paper is to introduce the key technology of search engine briefly, so as to play a useful role.




Classification: According to the method of information collection and service delivery, the search engine system can be divided into three main categories:




1. Catalog Search Engine: Collect information manually or in a semi-automatic manner, after the editors view the information, form a summary of the information, and place the information in a predetermined categorization framework. Most of the information is web-oriented, providing directory browsing services and direct retrieval services. This kind of search engine because has joined the human intelligence, therefore the information is accurate, the navigation quality is high, the shortcoming is needs the artificial intervention, the maintenance quantity is small, the information quantity is not timely. This kind of search engine's representative is: YAHOO, LookSmart, Open Directory, go guide and so on.




2. Robot search Engine: A robot called Spider (Spider) in a certain strategy to automatically collect and find information in the Internet, by the indexer for the collection of information indexed, by the reader based on the user's query input to retrieve the index library, and return the results of the query to the user. Service mode is a web-oriented Full-text search service. The advantage of this kind of search engine is the information is big, update in time, need not human to intervene, the disadvantage is that the return is too much, have a lot of irrelevant information, the user must filter from the result. This type of search engine representatives are: AltaVista, Northern Light, Excite, Infoseek, Inktomi, FAST, Lycos, Google, the domestic representative: "Skynet", leisurely travel, openfind and so on.




3. Meta search engine: This type of search engine does not have its own data, but the user's query request to multiple search engines at the same time, will return the results of repeated exclusion, reordering, and so on, as their results returned to the user. Service mode is web-oriented Full-text search. The advantage of this kind of search engine is that the information of the return result is bigger and more complete, the disadvantage is that the function of the search engine can not be fully used, the user needs to do more filtering. This kind of search engine's representative is WebCrawler, Infomarket and so on.




Performance metrics We can view web information search as an information retrieval problem, which is to retrieve documents related to user queries in a document library composed of Web pages.   So we can measure the performance of a search engine by measuring the performance parameters of the traditional information retrieval system-recall rate (Recall) and precision (pricision). The recall rate is the ratio of the number of related documents retrieved and the number of related documents in the document library, which is the recall of retrieval system (search engine), the accuracy is the ratio of the number of related documents retrieved and the total number of documents retrieved, and the precision of retrieval system (search engine) is measured. For a retrieval system, recall rate and accuracy can not be the same: high recall rate, low precision, high precision, low recall. Therefore, it is often used to measure the accuracy of a retrieval system with the average value of 11 kinds of accuracy (i.e. 11 point average precision) under 11 kinds of recall rates. For the search engine system, because no one search engine system can collect all the Web pages, so recall is difficult to calculate.   The current search engine system is very concerned about precision. There are many factors affecting the performance of a search engine system, the most important is the information retrieval model, including the presentation method of document and query, the matching strategy of evaluation document and user query relevance, the sorting method of query result and the mechanism of user's correlation feedback.




Main technology: A search engine consists of four parts, such as searcher, indexer, retriever and user interface.




1. The function of the Searcher searcher is to roam the Internet, discover and collect information. It is often a computer program that runs day and night. It collects as many types of new information as possible, as quickly as possible, and as the information on the Internet is updated quickly, it also updates the old information that has been collected regularly to avoid dead connections and invalid connections. There are currently two strategies for gathering information: Start with a collection of starting URLs, follow the hyperlinks in these URLs (Hyperlink), and iterate over the Internet with width first, depth first, or heuristic. These start URLs can be arbitrary URLs, but are often very popular and contain many links to sites such as Yahoo!    )。   The web space is divided by domain name, IP address, or country domain name, and each searcher is responsible for an exhaustive search of a subspace.   The search engine collects a variety of types of information, including HTML, XML, newsgroup articles, ftp files, word processing documents, and multimedia information. The implementation of the searcher often uses distributed and parallel computing technology to improve the speed of information discovery and update. Business search engine information discovery can reach millions of pages per day.




2. The function of the indexer indexer is to understand the information searched by the searcher, extract the index entry, and use it to represent the document and the index table where the document library is generated. An index entry has both an objective index entry and a content index entry: The objective term is independent of the semantic content of the document, such as author name, URL, update time, encoding, length, link popularity, etc. the content index entries are used to reflect the content of the document, such as keywords and their weights, phrases, words, and so on. Popularity Content index entries can be divided into single index entries and multiple index entries (or phrase index entries).   Single index entry is an English word for English, it is easier to extract, because there is a natural separator between the words (space), for Chinese and other consecutive written language, must be a word segmentation. In search engines, it is common to assign a weight to a single index item to indicate the degree to which the index item distinguishes the document, and to compute the relevance of the query results. The methods used are statistical method, information theory and probability method.   There are statistical, probabilistic and linguistic methods for the extraction of phrase index items. Index tables typically use some form of inverted table (inversion list), where the index entry finds the appropriate document.   The index table may also want to record where the index entries appear in the document, so that the retriever computes an adjacent or close relationship between the index entries (proximity). Indexers can use either a centralized indexing algorithm or a distributed indexing algorithm. When the volume of data is very large, you must implement the real-time index (Instant indexing), otherwise can not keep pace with the rapid increase in information. Indexing algorithms have a significant impact on the performance of indexers, such as the response speed of a large peak query. The effectiveness of a search engine depends largely on the quality of the index.




3. The function of the retrieval device is to check out the document quickly in the index library according to the user's query, to evaluate the relevance of the document and query, to sort out the results to be output, and to realize some feedback mechanism of user relevance. There are four kinds of information retrieval models, such as set theory model, algebraic model, probability model and mixed model.




4. The user interface user interface function is the input user query, displays the query result, provides the user relevance feedback mechanism. The main purpose is to facilitate users to use search engines, efficient, and many ways to obtain effective and timely information from search engines.   User interface design and implementation of the use of human-computer interaction theory and methods to fully adapt to the thinking habits of mankind.   User input interface can be divided into simple interface and complex interface. A simple interface only provides a text box for the user to enter a query string; Complex interfaces allow users to limit queries such as logical operations (with, or, non-, +,-), close relationships (adjacent, NEAR), domain name ranges (such as. edu,. com), where they appear (such as title, content), information time, length, and so on. At present, some companies and organizations are considering the criteria for setting query options.




Future trends: Search engine has become a new area of research and development. Because it is used in information retrieval, artificial intelligence, computer network, distributed processing, database, data mining, digital library, natural language processing and other fields of theory and technology, so it is comprehensive and challenging. And because the search engine has a large number of users, has a very good economic value, so the world's computer science and information industry of great concern, the current research, development is very active, and there are a lot of noteworthy trends.




1. Pay great attention to improve the accuracy of information query results, improve the effectiveness of the search engine information query, not very concerned about the return of the results of the number, but see whether the results and their needs coincide. For a query, the traditional search engine frequently returns hundreds of thousands of or millions of documents, users have to filter the results. There are several ways to solve the phenomenon of excessive query results: One is to obtain the true use of the user not expressed in the query statement through various methods, including using intelligent agent to track the user's retrieval behavior, analyzing the user model, and using the correlation feedback mechanism, Enables users to tell the search engine which documents are relevant to their needs (and the extent to which they are relevant), which are irrelevant, and which are gradually refined through multiple interactions. Second, the text classification (text categorization) technology to classify the results, using visualization technology to display the classification structure, users can only browse their own categories of interest. Third, the site clustering or content clustering, reduce the total amount of information.




2. Intelligent agent based information filtering and personalized service Information Intelligent agent is another way to use the Internet Information mechanism. It uses automatically acquired domain models such as web knowledge, information processing, information resources related to users ' interests, domain organization structure, user model (such as user background, interests, behavior, style) knowledge collection, indexing, filtering (including interest filtering and bad information filtering), and automatically interested users, User-useful information is submitted to the user. Intelligent Agent has the ability to continuously learn and adapt to the dynamic change of information and user's interest, thus providing personalized service. Intelligent agents can be performed on the client side or on the server side.




3. Using distributed architecture to improve system size and performance the implementation of search engine can adopt centralized architecture and distributed architecture, and two approaches are different. However, when the scale of the system reaches a certain level (such as the number of Web pages reaches billion), it is necessary to adopt some kind of distributed method to improve the system performance. Search engine components, in addition to the user interface, can be distributed: The searcher can cooperate with each other on multiple machines, mutual division of Information Discovery, to improve the speed of information discovery and update; indexers can distribute indexes on different machines to reduce the index's requirements for machines The retrieval can be used for parallel retrieval of documents on different machines to improve the speed and performance of the retrieval.




4. Pay attention to the research and development of cross language retrieval the cross language information retrieval refers to the user submitting the query in the native language, the search engine carries on the information retrieval in the multilingual database, returns the document which can answer the user question all languages. If you add machine translation, the return result can be displayed in your native language. This technique is still in the preliminary stage, the main difficulty lies in the uncertainty of the expression and semantic correspondence between languages. But for the economic globalization, the Internet across the national boundaries of today, undoubtedly has a very important significance.




Academic research in the current search engine field of commercial development is very active, the major search engine companies have invested heavily in the development of search engine systems, but also constantly emerging new features of the search engine products, search engine has become one of the industry in the field of information. In this case, the academic research in the field of search engine technology has been paid attention to by universities and scientific research institutions. such as Stanford University in its digital Library project has developed a Google search engine, in the Web information efficient search, document relevance evaluation, large-scale indexing and other aspects of in-depth research, and achieved good results. Steve Lawrence and C of the NEC American Institute. Lee Giles in 1998 and 1999 for two consecutive years in the journal Nature and Science to review search engine technology research. The famous Information retrieval conference TREC also began to increase the Web Track topic from 1998 to examine the differences in the nature of Web documents and other types of documents, and to test the performance of the algorithms for information retrieval on large web libraries such as 100G bytes. The International Conference on search engines sponsored by the American Infornotics Company has been held once a year since 1996, summarizing, discussing and looking forward to the search engine technology, the participants have famous search engine company, university and Research institute scholar, have played a very good role in promoting the search engine technology.   In addition, there are more and more articles on search engine technical research published by the International web conference and Human-Computer Interaction conference hosted by IEEE. At home, there are universities and research institutes such as Peking University, Tsinghua University and the National Intelligence Research Center to carry out research on search engine technology and develop several better systems. such as the "Skynet" Chinese and English search engine http://pccms.pku.edu.cn:8000/gbindex.htm, developed by the computer Department of Peking University, has reached the technical level of the foreign medium search engine system in terms of system scale and system performance. For domestic users to provide a very good Internet search service, by the user's praise.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.