Research and Design of Meta Search Engine
Institute of Computing Technology Li Rui
Colin719@126.com
Summary: This paper briefly introduces the knowledge of meta-search engine, and puts forward a design concept of a meta-search engine system. The system uses a feedback mechanism to learn and adjust the results online. In the system design, the design of the search syntax, the automatic scheduling mechanism of the member search engine based on user preferences, and the support of personalized services are proposed, the key technologies for establishing a meta-search engine system are given. Finally, the significance of the system and the problems to be solved are analyzed.
Keywords: Internet search engine Meta Search Engine information search syntax
I. Introduction
In the early stages of Internet development, there were relatively few websites and a small number of webpages, so it was easier to find information. With the rapid development of the Internet, people are increasingly relying on the Internet to find the information they need. However, with the explosive development of the Internet, ordinary network users want to find the information they need, just like a haystack, so lost in the ocean of information at a loss, there is a strange phenomenon we call "rich information, poor knowledge. The search engine is a technology designed to solve this "lost" problem. Search engine (SE) collects and discovers Information on the Internet based on certain policies, and understands, extracts, organizes, and processes the information. It also provides users with retrieval services, this serves the purpose of Information Navigation.
There are many search engines on the Internet, including Google, Yahoo, Altavista, dogpile, and Baidu. According to different information collection methods and service provision methods, search engine systems can be divided into three categories: directory-based search engines, represented by Yahoo (recently changed to full-text search technology ); full-text search engines, represented by Google; Meta Search Engines, represented by dogpile.
Ii. Meta Search Engine Overview
It is believed that the network coverage rate of a single search engine can only cover up to 30-50% [3] of the entire Internet resource, so the full query rate cannot be guaranteed; coupled with the design of any search engine, it has a specific range of database indexes, unique features and usage, and the expected user group direction, resulting in the same search request, the query results in different search engines have a repetition rate of less than 34% [5], so the precision cannot be guaranteed. Therefore, to obtain a comprehensive and accurate result, multiple search engines must be called repeatedly to compare, filter, and verify the returned results. The Meta Search Engine came into being.
2.1 Definition
Meta Search Engine (MSE) is an engine that calls other independent search engines based on an independent search engine, also known as "the mother of search engines )". Here, Meta indicates "Overall" and "Beyond". Meta Search Engines integrate, call, control, and optimize the use of multiple independent search engines. Compared with the meta search engine, the independent search engine that can be used is called "source search engine" or "component search engine ). In terms of function, the meta-search engine is like a filtering channel: using the output results of multiple independent search engines as input, the final result is formed after some operations such as extraction, elimination, and extraction, then output the final result to the user.
The typical work process of 2.2 yuan search engine can be summarized as follows:
① The user enters the query request through the unified query interface, and the Meta Search Engine pre-processes the query.
② The Meta Search engine selects several member search engines based on the member search engine scheduling mechanism.
③ Based on the Query format of the selected member search engine, the Meta Search Engine localized the original query and converted it to the query Format String required by the member search engine.
④ Send formatted query requests to Member search engines and wait for the returned results.
⑤ Collect the returned results of each independent search engine.
⑥ Perform comprehensive processing on the returned results, such as eliminating duplicate links and dead links, to form the final result.
7. Return the final result to the user in a certain format.
Features of 2.3 yuan Search Engine
Unlike independent search engines, Meta Search Engines have the following features:
① There is no need to set up a large web database to save storage devices
② Provides a unified external mode to submit a query to multiple independent search engines
③ Secondary processing based on independent search engine results
④ Indicates the source search engine and its local relevance of the result record, providing global relevance.
Iii. Development Trend of Meta Search Engines
Currently, meta-search is very active in research and development. It is comprehensive and challenging to use theories and technologies in information retrieval, artificial intelligence, databases, data mining, natural language understanding, and other fields. As search engines have a large number of users, they have created many business opportunities and provided great economic value. It is estimated that there are already billions of dollars of global markets, this has aroused the high attention of the computer scientific community, information industry and business community in various countries around the world, and has invested a lot of manpower and material resources, and has also achieved remarkable results.
An ideal meta-search engine must meet the following functional requirements:
① It covers a large number of search resources, allows you to select and call independent search engines at will, and automatically schedule according to certain scheduling policies.
② Have as many functions as possible, such as resource types (websites, webpages, news, software, FTP, MP3, Flash, images, videos, etc) select, wait time control, return result quantity control, result time range selection, filter function Selection, and result display mode selection.
③ Powerful retrieval request processing functions (such as logical matching search, phrase search, and natural language search) and different search engines, such as search engines that do not support the "near" operator, can be automatically converted from "near" to "and" operator ).
④ Detailed and comprehensive descriptions of search results (such as webpage names, URLs, summaries, source search engines, relevance between results and user search requirements ).
⑤ Supports searching in multiple languages, such as Chinese and English.
⑥ Results can be automatically classified, such as by domain name, country, resource type, region, etc.
7. personalized services can be provided for different users.
Currently, there are many meta search engines on the Internet. In terms of function implementation, each has its own focus, and it is rare to achieve "ideal. Some meta-search engines do well in some aspects, but there are some defects or improvements in other functions: for example, most meta-search engines do not support natural language retrieval or Chinese retrieval. The functions of the meta-search engine are restricted by the source search engine and the meta-search technology. On the one hand, the powerful features of the source search engine are restricted by the meta-search engine and cannot be fully reflected. On the other hand, no meta-search technology can discover and use all functions of an independent search engine.
With the emergence of new technologies, Meta search engines will be better and better user satisfaction will be achieved. These technologies include:
1. Improve the search engine's intelligent understanding of users' search questions, and reflect the support for natural language query requests.
2. Determine the search engine information collection scope to improve the search engine's pertinence, embodied in topic search and multimedia search.
3. Information Filtering and personalized service based on intelligent proxy.
4. Focus on the research and development of cross-language search [9], provide multi-language search support, and provide localized search services.
5. Improve the accuracy of information query results and the effectiveness of search.
Iv. design conception of a Meta Search Engine
Based on the above research, we propose a metadata search engine design concept. In this conception, we adopt a feedback mechanism, but we have not refined each step in detail. We only provide an overall framework for functional modules in the architecture, we have made a detailed analysis of their functions and implementation technologies and provided a number of optional technologies. Several modules can be selected for implementation to reduce system complexity. Some functional modules can also be added to increase system functions, that is, this design concept has good scalability.
4.1 system architecture framework
4.2 Introduction to functional modules
4.2.1 graphical user interface (GUI)
This part is the interface between the program and the user. It is mainly used to accept the user's original query request and display the final result to the user. Several Interfaces can be used for implementation, such as the command line interface and GUI. This part does not involve data processing, so it can implement multiple views of a single piece of data. To achieve this, you can consider multiple human-computer interaction technologies and submit your query requests to the system.
On the page, you can set the member search engine list, including the maximum wait time and number of returned results for each member search engine, as well as the display method, sorting policy, and classification method. This part of information can be stored in the client's user cookie, so that users do not have to enter their own custom information every time, it also provides personalized services. Cookies can also store users' search records and perform knowledge mining on search history and search habits for pattern discovery.
4.2.2 query a Preprocessor
This part accepts the original query requests sent from the GUI and pre-processes the original query requests to provide functions such as cross-Language Retrieval and natural language support. This part requires query syntax and operations. Here we will briefly introduce the query syntax and operation rules we designed.
The query syntax and operation rules we designed are as follows:
Boolean logical operation
Including and, or, not, and (), which are the most basic and commonly used syntax rules:
And indicates that the search result contains all the keywords, which can be replaced by '+' (plus sign) and space.
Or indicates that the search result contains at least one keyword, which can be replaced by commas.
Not indicates that the keywords after not are excluded from the search results. You can use '-' (minus sign) or '! '(Exclamation point) instead. For example, if you search for jfc not MFC, The result contains only jfc, not MFC.
() Is used to limit the priority. Its role is similar to the () Operator in mathematical operations.
Other simple and commonly used syntax rules
"" Is used to support phrase search. The search engine uses the keywords or combinations in "" as a complete phrase for search. For example, to search information about the search engine, you can enter "Search Engine". The search engine uses "Search Engine" as a phrase to search. If you do not use "", you will find information that includes both search and engine. Obviously, many of them are not needed.
A wildcard is used to replace a combination of several characters, similar to a regular expression. The wildcard can be '*', representing any number of characters ,'? 'Indicates that the character at the current position can be any character.
Common Advanced Search syntax rules
Near can be defined by the keywords that appear in a certain area at the same time. These keywords may not be adjacent, and the smaller the interval, the closer the positions are. The near/N is used to control the interval, N is a specific value, indicating that the interval cannot exceed n words.
Intitle: Only search for keywords in the title
Inurl is limited to search for keywords only in the URL
Insite only allows you to search for resources in a given site.
A user's query request can be described in the following parts: the keyword to be included, the non-contained keyword (exclude), and any keyword (any ), the phrase or sentence to be included (all), The queried area, field, topic, and position. Handle the original query string sent from the GUI as follows:
1. Perform natural language parsing and query the database. If you can find the corresponding answer, the answer will be returned to the user.
2. According to the search syntax rules, scan the query string to form a formatted query string, that is, to separate which part is full and which part is not included.
3. Read "Stop Words" from the database and compare it with the information in the formatted query string to remove the keyword words that are obviously unnecessary to search.
4. Perform "stemming" for the keywords in the formatted string. This step can be implemented by member search engines to reduce processing complexity.
5. Based on the keyword information, the domain, topic, region, and location of the query are formed.
4.2.3 member search engine Scheduler
When the program is started, several member search engines are set by default based on the user's search history and habits. If you are not satisfied, you can also set the search engine list for members. In addition, the program also has its own automatic search engine scheduling mechanism, which queries the topic, domain, region, and other information based on the user's information, and the performance of member search engines in the past (response time, number of returned results, user satisfaction, domain targeting, and advanced search functions supported ), generate a list of suitable member search engines.
Because the information of member search engines (especially the formatting information of query strings) often changes, it is unreasonable to fix their code in the Meta Search Engine winner program, therefore, we use a member search engine description file, which is described in XML and uses a formal description. For each newly added member search engine, we only need to create a description file for it in this form, it is easy to add it to the system.
4.2.4 query distributor
Receives the member search engine scheduling list generated by the member search engine scheduler, connects to the database, and reads information about these member search engines, including host information, connection information, and query parameter string formatting information. Based on this information, several threads are started synchronously to connect to the corresponding member search engine. Send the query information processed by the query Preprocessor to them. A large part of this function is database connection. In fact, some information allows the query proxy to connect to the database, but in order to reduce the number of database connections, this part of the function is centralized for one connection, multiple processing, multiple use.
4.2.5 query proxy
Provides an interactive interface between the meta search engine and a specific member search engine. It first receives the Query format string sent from the query distributor. Then, obtain your query parameterization information from the query distributor, and then convert the Query format string to your desired format based on the query parameterization information. Here, we need to deal with the details, that is, some member search engines certainly do not support some of the advanced search functions of this meta search engine, such as the phrase search and wildcard functions. When processing, delete the request information in the original query string.
Then, the local query request is sent to the member search engine, waiting for the returned results. Because sometimes some services are unavailable, You can first use a program similar to the ping command to test whether the server is available and then send a query request, set a wait time threshold after the connection starts, and give up after the timeout.
After receiving the returned results, use the HTML Parser to extract the results from the results page, which must contain the following information: link information, the member search engine that obtains the link, the sorting information in the member search engine, the site information on the target page, the description of the target page, and the anchor text.
4.2.6 integrated processing module
This is the core module of meta-search engine implementation. The execution efficiency of a meta-search engine is closely related to the implementation of this module. It requires several functional modules, for specific implementation techniques, see section 4.3:
The result collection module synchronously receives the results returned by the member search engine and presents the results returned by the first member search engine to the user to reduce the user's waiting time.
The webpage filtering Module removes duplicate links in returned results based on the criteria for judging duplicate results, and removes redundant link information based on user resource requirements, time restrictions, domain restrictions, and other information.
The webpage sorting module integrates the search results based on certain result fusion technologies.
The Comprehensive processing module submits the final results to the GUI, And the GUI presents the results to the user. This module is also responsible for evaluating the search using the search evaluation mechanism, and recording the search in the cookie of the client.
4.2.7 Database
The database here is a general concept, including both the actual database and some configuration files and settings, used to save the data required for running the system. This information includes answers to natural language questions, member search engine information (host information, function information, parameterized information, search performance information ), user Information (search for historical information, personalized settings, personal information, etc.), forbidden word table, vocabulary (synonyms, antonyms, translation information, domain information, topic information, etc ). In specific implementation, some information can be stored on the client to reduce the storage pressure on the server.
4.3 Key Technologies in implementation
Criteria for repeated results [12]
The Links, anchor, and description in the search results can be used to determine whether the two results are repeated. We make judgments based on the following policies:
1. First, determine whether the two results have the same hyperlink. If they are the same, they are considered to be the same result.
2. Compare the similarity of URLs. If the Host IP address, path, and file name are identical, they are considered to be the same result.
3. Compare the meta information of the document, such as the title, author, abstract, and size. The results that exceed the similarity threshold are considered to be the same. This item cannot be implemented to speed up the system response.
Result fusion technology [6, 13, 14]
From the working principle of the meta-search engine, it can be seen that result fusion technology is crucial. Therefore, many methods have been proposed for implementation. A simple method is to present the results of the search engine with the fastest response speed to the user. The results returned by each search engine are displayed without any processing. What is more complicated is to implement result Fusion Based on certain policies.
In the document [13], Zhang Weifeng and others provided four synthesis algorithms. In [14], J. P. Callan and others have provided four typical synthesis algorithms for different situations. For more information, see references. The essence of result fusion is the process of re-sorting the search results. We propose a technology based on this cognition: The importance of a search result depends on three aspects: the number of searched member search engines, it retrieves the ranking position of each member search engine and the performance evaluation of its member search engine. Assume that the search engine of m members can retrieve the search engine. The position of the search engine of the I member is Ri, and the performance evaluation of the search engine of the I member is WI, the final weight P of this result is:
P = Σ (Ri * WI) I = 1... m
Based on user settings, You can further process the search results: Check whether the target page exists to eliminate dead links; Retrieve the target page of the results for text analysis, in order to provide higher relevance judgment and provide Web snapshots, the processed results can be classified by domain, topic, site, and so on.
Effective information extraction technology
After receiving the results returned by the member search engine, a very important technology is how to extract the desired search results from the results page. Because of the different technologies used by member search engines, the structure is also quite different. It is very difficult to extract the results correctly. Based on this cognition: the search results are dynamically generated, so the results must be packaged, that is, a header and a tail can be found, the content between the beginning and the end is what we need. The current method is to manually find the header and tail, and then tell the system in the configuration information that the query agent is responsible for extracting the desired results based on the information.
Now there is also such an implementation method, that is, the statistical method, the use of artificial intelligence technology, so that the system has a self-learning function, so that there is no need for manual intervention, the member search engine's information on result extraction can be formed independently.
Currently, Google provides Web services that can directly extract the relevant information (such as search results, response times, number of results, and document relevance). However, only registered users can use the service without restrictions. This may be a better solution, because independent search service providers are more aware of their systems and technologies, and can provide the results we need more directly.
Member search engine scheduling mechanism and performance evaluation mechanism [3, 12]
The member search engine scheduling technology is the core technology of the Meta Search Engine, that is, it determines the member search engines that send user queries to which they can receive good search results. Each member search engine in the Meta Search Engine has its own text database composed of a series of documents, the member search engine scheduling technology provides a list of member search engines that are most likely to contain useful documents for each query, which is crucial to the execution efficiency of the Meta Search Engine. Four methods are mentioned in [12]: simple algorithm, qualitative method, quantitative method, and learning-based method.
To enable automatic scheduling of member search engines, we use user feedback to implement a learning-based scheduling mechanism. In this way, a member search engine performance evaluation mechanism is required.
The evaluation is based on the record data of multiple retrieval activities, including the response time and returned quantity. The main part is the document relevance, which can be submitted by users (this is the best method, but generally, users only use it without feedback), and can also be obtained by tracking users' click link activities. The evaluation is performed in different levels: Evaluation Based on a single search result, evaluation based on a search activity, Evaluation Based on search term, Evaluation Based on search field, and evaluation of overall search performance.
For a user's search request, if the system contains the evaluation data of each search engine on the request keyword, select some search engines with the optimal keyword evaluation data for retrieval; otherwise, the system will determine the domain in which the search request belongs, and the system will have the evaluation data of each member's search engine in this field. Then, some search engines in this field will be selected for retrieval; otherwise, select some search engines with the best overall search performance for retrieval. If the user sets the fastest speed or the maximum number of returned results, select the search engines with the top indexes for retrieval.
V. Analysis of the Application of search engines in E-commerce
Search engines can also expand in e-commerce. Now many websites rely on Google's bidding ranking service to carry out their businesses. This is also one of the profit channels of some search service providers. In addition, based on the user's registration information, search history information, search keyword domain, search habits, access records, and so on, you can find out your potential desire to buy and interested products, this information can be used by e-commerce sites to discover potential customers, and their customers can regularly send update lists of products they are interested in.
Vi. Summary
Internet is a huge source of information, and it is in a period of rapid expansion. People now tend to find the information they need on the Internet. To facilitate the use of the rich resources on the Internet, search engines are the most commonly used tools by people who develop tools based on the research results of relevant disciplines.
An efficient meta-search engine built on the basis of an existing independent search engine can expand the processing capability of the independent search engine, increase the query accuracy rate, and may further improve the precision. However, the autonomy of member search engines has caused difficulties in integration, mainly from: differences in the search interface, different document indexing methods, differences in related functions, different query parameters, and search functions. The system we designed absorbs some of the advantages of a successful system and has its own features: providing its own search syntax and evaluating the search engine's search performance; the automatic scheduling mechanism of the member search engine is designed. The method of the search engine description file is designed to make the system highly scalable. The result fusion algorithm is provided, which can be used by users, accept user feedback for autonomous learning and adjustment to make the system adaptive.
VII. Follow-up work
We chose Java as the programming language tool for implementing the system. We have used object-oriented software engineering theory to analyze the system structure and used Java to implement the partial classification function, several interfaces are defined. The next step is to implement the entire system and further optimize and improve some of the algorithms. You can also add several functional modules as needed to make the system more powerful and robust.
[References]
[1] Lawrence S., Giles c.l. Accessibility of infomation on the Web [J]. Nature, 1999,400 (7): 7-109
[2] Barker J. Meta-search engines. Teaching library Internet workshops University of California [Eb/Ol]. http://www.lib.berkeley.edu/TeachngLib/Guides/Internet/MetaSearch.html,Berkeley,April 2000
[3] Agent-based meta-search engine research and design: Chen Junjie, Xue Yun, Song hantao, Lu yuchang, and Yu xueli.
Computer Engineering and application, 2003, Vol.10
[4] integrated search engine and Meta Search Engine min Zhiyu http://www.sowang.com/zhuanjia/xzhy.htm
[5] yuan search engine research Zhang Weifeng, Xu Baowen, Zhou Xiaoyu, Li Dong, Xu Lei Computer Science 2001 Vol.28
[6] Yang Xiaohua, Liu Zhenyu, Tan Minsheng, Liu Jie, and Zhang agile constraints of the synthesis algorithm of the meta-search engine system.
Journal of software 2002 vol.14 No. 7
[7] search engine technologies and trends Li Xiaoming and Liu Jianguo.
[8] Yang M. H ., yang C. C ., chung y.m. A natural language proccessing based Internet agent [J]. system Man and cybernetics IEEE, 1997 (1): 100-105
[9] research progress of cross-language information retrieval Zhang yongkui, Wang Shufeng computer engineering and Application 2002 vol.19
[10] building efficient and valid tive metasearch engines ACM computing surveys, vol.34, No.1, March 2002
[11] search engine and Internet Information Acquisition Xue hongming Yancheng Normal University
[12] iseeker-Peng honghui, an efficient meta-search engine, Lin zuoying Computer Engineering 2003 vol.29 No. 10
[13] research on results generation technology of meta-search engines Zhang Weifeng, Xu Baowen, Zhou Xiaoyu, Xu lei, and Li Dong
Small Computer System vol.24 No.1
[14] Callan, J. P ., lu, Z ., croft, W. b. searching distributed collections with inference networks. in: Fox, E. A ., ingwersen, P ., fidel, R ., EDS. proceedings of the 18th International Conference on research and development in information retrieval. ACM press, 1995.21 ~ 28.