Research of distributed search engine

Source: Internet
Author: User

Research of distributed search engine <?xml:namespace prefix = o ns = "Urn:schemas-microsoft-com:office:office"/>

Shizenhua Yang Yanjuan


(Department of Computer Science and technology, Nanjing 210094)

 

This paper introduces the distributed search engine technology and combines the distributed search engine technology with distance education. Aiming at the centralization, classification and collation of the information resources in the current education field, we set up the information resource navigation base based on www, so that the users can find the resources quickly according to their own needs and improve the retrieval efficiency and precision of the information resources in the distance education.

keywords distribution, search engine, distance education, HJ-YHS

With the rapid development of the internet, www (World Wide Web, www) has become a huge information space, providing users with valuable information resources. In the face of a large number of information resources, through the browser step-by-step browsing has been very inconvenient, how quickly and accurately from the WWW to obtain the required information, become a crucial issue. The advent of search engines has greatly improved the ability of people to gather information. However, the existing search engine in search efficiency, information maintenance, information duplication, network and site, load and other aspects of problems and difficulties.

At present, from the architecture, most of the search engines are centralized. That is, from the Internet to retrieve the page, after analysis, processing all the index information stored in a site, the user through access to the site to implement the query. There is usually no collaboration between them, they independently search and process information, resulting in a lot of duplication of work and serious bandwidth waste, and sometimes even cause network congestion. This architecture is difficult to adapt to the increasing scale of the network, the industry has put forward the establishment of a distributed search engine strategy.

1 distributed search engine

The distributed search engine divides the whole network into several autonomous regions according to the geographical, the subject, the IP address and other standards, and sets up a retrieval server in each autonomous region, and each retrieval server consists of information search robot, index search software database and agent three parts. The information search robot is responsible for the information search in this autonomous region, and the index information is stored in the index database. The agent is responsible for providing the user with the query interface and exchanging with other agents to retrieve the information exchanged between the servers, and the query can be redirected, that is, if an index database does not meet the query requirements, it can send the query request to the other search server.

1. 1 Distributed search engine architecture

<?xml:namespace prefix = v ns = "urn:schemas-microsoft-com:vml"/>

Compared with a centralized search engine, it has the following advantages:

Each retrieval server shares resources with each other, and the site only provides information to the information search robot in this autonomous region, which alleviates the load of the network and each site.

The mutual collaboration among agents and the redirection of queries make the service more perfect.

It is suitable for the distributed characteristic of the web itself, it has good expansibility and is easy to maintain.

The index information is divided into the respective index database, which makes each index database relatively small, and the response time of the query is relatively short.

Some of the retrieval server failed, the other parts can work correctly.

Web server cluster is a kind of typical distributed processing system. The so-called Web cluster is the use of high-speed network, the original independent of a number of servers together, as a whole to provide services, the arrival of the request to assign to the cluster in the various backend servers, so that they share the load and I/O, through parallel processing to improve performance. At this time, it involves the technical problem of request allocator and load balance.

Requests for all users in the allocator set, and then assigns the requests to each retrieval server for parallel processing. At present, the main implementation methods of this technology are IP translation, TCP Proxy, Dynamic DNS and HTTP redirection, the corresponding typical products are Cisco's local Director, distributed Director , IBM's network Dispatcher, UIUC NCSA Scalable Web Server, and so on. At present, the request allocator uses the TCP proxy technique more. Each request is initiated by the client to establish a separate TCP connection, which is removed by the server after the answer is completed. In TCP proxy technology, the request allocator accomplishes the following tasks mainly

Receiving the user's information request and forwarding it to the retrieval server;

Receives the query results returned by the retrieval server and forwards them to the customer;

If one end interrupts a TCP connection, immediately interrupts the TCP connection on the other end.

In order to increase the efficiency and throughput of the request allocator, multithreading and multipath blocking I/O technologies are used, such as the Microsoft Internet Information Server and Netscape's Enterprise server. Once a TCP connection from the client is received by the connection listening process that requests the allocator, a worker process is generated in memory immediately. Subsequent work, including establishing a TCP connection with the server, receiving and sending data, removing connections, etc., is done by the thread. When the answer is complete, the request allocator immediately deletes the thread from memory. The multipath blocking I/O is implemented by invoking the select Primitive, in which each worker process listens for two TCP connections to both the client and the server. When receiving data, if no data arrives, select causes the thread to be in a very low cost sleep state and, once the data arrives, exits the hibernation state and initiates the receiving process to receive the data. The same is true for sending data.

Load balancing is typically done by requesting the allocator to select a target retrieval server. At present, there are three kinds of request allocation algorithms, such as "Rotary Method", "least connection method" and "quickest connection method". In order to effectively improve the efficiency of the request allocator algorithm and adapt the algorithm to the heterogeneous server cluster, the request allocator should know the processing ability of each retrieval server and be able to analyze the content of each user request, and should be able to accurately track the load situation of each server.

The cluster technology of Web server makes the processing ability more powerful, I/O bandwidth is enlarged, expansibility is good, the reliability is high, and the cost is easy to manage. There are two main ways to make a common parallel Web server cluster:

The "isolation", represented by Cisco's local director, uses the "least connection" or "Fastest connection" method for allocating requests.

In NCSA Scalable WEB server as the "non-isolated", the "rotation method" for request allocation.

Either way, the access path and content of the Web information on the backend server is required to be exactly the same. The difference is whether these servers are visible to users on the Internet. The "isolated" cluster is similar to the proxy technology, and only the request allocator has an external IP address, all user requests are sent to the request allocator, and then the request allocator sends the request to each backend server in the cluster, and the return result is returned to the customer by the request allocator. Each server in the "non-isolated" cluster has a separate IP address, the request assignment is implemented through dynamic DNS, and the response to the request is not passed through the request allocator, but directly to the client by the server.

2 distributed search engine examples

Taking the distance Education information resource search system of Hanjiang Middle School in Jiangsu province--Voyage search system (HJ-YHS) as an example. With the continuous progress of society and the development of science and technology, all aspects of education including educational ideas, ideas, educational means and methods should be reformed accordingly. The traditional way of education can not meet the needs of the development of the Times. An important means of teaching in the information age is to introduce the interactive distance education based on www. On the Internet, resources are dispersed, data types are many, and address changes are large, and it is not easy to find the information quickly. Therefore, it is necessary to centralize, classify and organize the information resources on the Internet, set up the information resource navigation base based on WWW, so that users can find the resources quickly according to their own needs. HJ-YHS is based on the background of the development and design, aimed at building a distributed search engine technology to improve the distance education information resources search efficiency and precision.

1. 2. 1 Application system structure and function realization

HJ-YHS system with Windows NT 4.0 as the development platform, using ASP (Active Server page) to achieve Dynamic query page generation and results show that the background of the Web server with IIS 4.0, the database server with SQL Server 7.0 to provide data Services, Web Client installation 95/98 operating system and IE 4.0 browser, development tools using Visual InterDev 6.0 or VB 6.0. HJ-YHS features are:

The educational information and academic content are the main subjects, and the service objects are all kinds of middle schools and education departments, and the information resources are selected by the academic standard.

According to predefined topics, selectively search related pages, avoid unrelated web pages, and deposit index information into the indexed database.

A large number of preliminary search results are analyzed and classified, and the retrieval range is further reduced according to the user interaction feedback information, and the retrieval accuracy is improved.

2. 1. 1 HJ-YHS Architecture:

2. 1. 2 system function module:

Web Search module: Responsible for the regular launch of the Web collection system, according to the given site name within the specified range of information collection.

Information Analysis module: The collection of web pages for analysis, collation, extraction of keywords and abstracts, the index information into the index database.

Resource Upload module: Receives the upload file from the client and uploads the file to the Web server-side specific directory, adding information to the index database that only allows users with the appropriate permissions.

ASP Information Retrieval module: Started by the user query interface, providing three kinds of query methods: that is, based on keyword query, subject classification query and by grade query. Based on the information submitted by the user to produce a two-level query interface, further clear the search intention, and then synthesize all user information to the full text of the Web page query; Based on the subject classification and grade, the query can be searched in the specified range according to the needs of the users.

Dynamic page Generation module: output results by relevance size. The dynamically generated page gives the page title, URL address, content summary, and so on that the query produces.

Static page Generation module: classified by subject specialized catalogue. Static page generation module generates static pages of different categories based on directory

To sum up, based on the advantages of distributed search engine technology and the rapid increase of information resources on the global Internet, distributed search engine technology will be widely developed in the field of distance education.

Reference Documents

 

1 Zhu Yijun, Mavan aid, lyratum color. Distributed search engine and Z39.50 protocol. World Network and Multimedia, jan.1999

2 Onghuiyu, Mavan, Zhu Yijun. Analysis of the current situation of Internet search engine. Information journal, vol.18,1999

3 Yu Yang, Mavan, Zhu Yijun. Web search technology based on www (on). Shanghai Microcomputer, no.21,jul.1998

4 Yu Yang, Mavan, Zhu Yijun. Web search technology based on www (next). Shanghai Microcomputer no.22,aug.1998

Note: The original address is: http://www.bjx.com.cn/files/wx/jsjyxxjs/2002-7/11.htm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.