Search Engine robot Technology

Source: Internet
Author: User
Tags contains manual file size return sort access
Where the Internet users have used search engines, AltaVista, InfoSeek, HotBot, Network Compass, North Skynet and China good Netscape Chinaok, and so on, their index database covers more than 100 million of the Internet page (AltaVista and HotBot), The University of Skynet has also collected 320,000 www pages (domestic), the establishment of index database needs to visit these pages and then index, how to do so many pages of access, now search engine whether for English or Chinese, are using online robots to achieve online search (Yahoo! is an exception).

Online robot
Online robots (Robot) are also known as Spider, Worm, or random, with the core purpose of acquiring information on the Internet. Robots use hypertext links in the home page to traverse the web and crawl from one HTML document to another through URL references. The information collected by the online robot can be used for a variety of purposes, such as indexing, HTML file verification, URL link verification, getting updated information, site mirror image and so on.

The algorithm of robot searching WWW document
Robots crawl on the web, so they need to set up a URL list to record access paths. Uses hypertext, the URL to another document is hidden in the document, the extraction URL needs to be parsed, and robots are typically used to generate index databases. All WWW search procedures have similar steps:
1 The robot takes out the URL from the starting URL list and reads its contents from the Internet;
2 extract some information from each document and put it in the index database;
3 Extract the URL from the document to the other document and add it to the URL list;
4 Repeat the above 3 steps until no new URLs are found or have exceeded certain limits (time or disk space);
5) to the index database with the query interface, to the Internet users to publish.
The algorithm has two basic search strategies, depth first and breadth first.
The robot determines the search strategy in the form of URL list access:
1) Advanced first out, then form breadth First search. When the start list contains a large number of Web server addresses, breadth-first search produces a good initial result, but it is difficult to drill down into the server.
2) Advanced, then form a depth first search. This creates a better document distribution and makes it easier to find the structure of the document, that is, to find the maximum number of cross-references.

Result processing Technology
The main factors of Web page selection
Search engines should be able to find sites that correspond to search requirements and sort search results by their relevance. The correlation here refers to the frequency with which the search keyword appears in the document, up to a maximum of 1. The higher the frequency, the higher the relative degree of the document is considered. But because the current search engine is not intelligent, unless you know the title of the document to find, otherwise the result of ranking first is not necessarily the "best" result. So some documents, despite their high level of relevance, are not necessarily documents that users need more.
Search engine is a computer network application system with high technology content. It includes network technology, database technology, retrieval technology, intelligent technology and so on. In this respect, because a lot of foreign advanced technology is based on the western language core, so we can not simply introduce copy. As a Chinese search engine, how to play our strengths in Chinese processing, develop a core of our own copyright technology, so that we in the Chinese search engine competition to occupy a favorable position.

Four main factors of Web page selection:
A. The size of the Web page database, mainly after manual browsing.

B. The time to retrieve the response was mainly derived from the procedure.
The program first notes the time to access the search engine, and then after the record is recorded, then write down the time, and then the two time to reduce the results of the retrieval response time.

C. The quality of the Web page is divided mainly by manual scheduling.
Search engines always return the search results to the user, and the results show a direct impact on the use of the search engine effect. Therefore, the results show that the content of the organization, how to sort, whether to provide enough relevant information (internal code, file size, file date, etc.), the user's judgment on the results of the search has a great impact.

D. The relevance of each Web site is related to: the relevance of each Web site, the ability to differentiate the relevance of search results (pertinency).
L artificial to the site to set a correlation coefficient, such as Yahoo 1.0,goyoyo 0.94;
L Link, the number of keywords appearing in summary;
L record return time, that is, the time the response was retrieved

Result processing
(1) Order by frequency
In general, if a page contains more keywords, the better the relevance of its search target, which is a very sensible solution.

(2) Sorted by page access degree
In this approach, the search engine records how often the pages it searches are accessed. People who visit more pages should usually contain more information, or have other attractive advantages. This solution is suitable for the general search users, and because most of the search engines are not professional users, so this scheme is also more suitable for the general search engine use.

(3) further purification (refine) results
According to certain conditions for the search results again to optimize, you can choose the category, related words and so on.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.