Deep research on search engine technology

Source: Internet
Author: User
Keywords Web Spider

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

With the rapid development of network science and technology, People's reliance on the Internet search engine is more and more strong, especially in today's network resources are rich, the network information demand rising 21st century, the search technology occupies a very important commanding point of the Internet. Now people often use search engines to search for multimedia materials, the latest information and maps and other information.

First, the basic principles of search engines

Search engine is a system that can get Web page data, build database and provide query.

1.1 Structure of the search engine

Web search is through web spiders crawling on the Web page, and along each page of the link to crawl other pages, eventually can crawl to many pages, and these pages compressed processing, stored in the Knowledge base. Web Spider program will continue to crawl the entire network to ensure the timeliness of information and effectiveness.

Preprocessing is the collection of Web pages for link analysis, the importance of Web page calculation and keyword extraction, the establishment of an index database, the database architecture must facilitate the search, and contains the information to be as comprehensive as possible.

Service refers to the service to the user, when the user entered the keyword, according to the keyword in the index database quickly find the relevant information, returned to the user.

1.2 Classification of search engines

Search engine can be divided into three categories: Full-text search engine, directory search engine, meta search engine.

Full-text search engine is through the web spider to crawl each page, the information extracted and stored in a database, when users use the keyword to match the user input, and the information returned to the user. This is one of the most used search engines, and Google,baidu belongs to this type.

Directory search engine is to search the resources according to a certain way to classify, and eventually build a very large directory system, user query can open browsing directory, and finally find the information, Directory search engine Strictly is not a real search engine. We use Yahoo, Sina is this.

The meta search engine is a kind of engine that calls other search engines, it can cover more resources and provide more comprehensive service. The use of more dogpile,vivisimo and domestic search star.

These three different search engines can be used for different occasions, with their own advantages and disadvantages. Full-text search engine is generally used for comprehensive search, its advantage is that the information is large, update timely, do not need manual intervention, the disadvantage is that the information processing is large and difficult. Directory search engines are mostly web-oriented, providing directory browsing services and direct search services, its advantage is that manual intervention is conducive to improving the accuracy of information search, the disadvantage is that the need for human intervention, maintenance costs, update slowly, the information is small. Meta-search engine because it can query a number of other search engines, so it is particularly suitable for the case of high rate of search, but at present different search engines, the establishment of index database and the implementation of queries to search the specific methods or rules are not the same, greatly affecting the search effect of the meta searching tool.

Second, several key technologies of search engine implementation

2.1 Web Spider

Web spiders can be implemented in several ways:

(1) based on breadth first. Breadth-first algorithms are accessed in the order of the links encountered. It is one of the simplest strategies in all web spiders.

(2) based on depth first. Based on the idea of depth first, the similarity between Web page and search subject is calculated according to the selected conditions, and the most similar link is chosen to search, in the calculation of similarity, cosine is usually used.

(3) based on the page rating. Based on the Web rating is the use of web ratings and content of the search for the collection of documents to rating, using computed results from the selection of the highest rated link as the next search object.

(4) Infospider. Infospider is the use of evolutionary keyword lists and neural network methods, calculates the similarity of the pages associated with the topic, determines the next object to be searched based on the results of the calculation, calculates the relevance of the newly acquired document to the subject, and the cost of acquiring the document to correct the agent's energy, And according to its energy level, the agent is decided to undo, regenerate and survive.

2.2 Evaluation of the importance of Web pages

There are two main methods of evaluating the importance of Web pages, one based on link

method, the other is based on similarity.

Based on the link method of the calculation of the link information and the linked object must have some kind of credible mapping relationship. The following are frequently used in the application process:

(1) Entry: Contains the number of links targeted at this page;

(2) Degree: The number of links to the page from the webpage;

(3) Page rank: the possibility of users accessing the Web page at any given time.

This method is widely used and effective.

Based on the similarity calculation, the vector space model is used to transform the query string and text into vectors, then the similarity between the text and the query string is evaluated.

2.3 Search Engine hardware system establishment

Search Engine hardware system is the backbone of the entire system, in order to provide faster query speed, the hardware system generally adopts a distributed structure, Google's servers are distributed around the world, but also the use of parallel technology to speed up the execution rate. In addition, the hardware design of the index database is also important, which is critical to improve the speed of data access.

Third, search engine anti-show trend

The future search engine has the characteristics:

(1) Be able to collect almost all the information on the Internet;

(2) can block some illegal information;

(3) The improvement of full rate and precision ratio

(4) Not only can recognize the text retrieval words, but also can construct the recognition image, audio, video and so on;

(5) Information update faster;

(6) Cross-Library inquiries to facilitate the introduction;

(7) humanized and personalized interactive interface;

(8) Intelligent search can be realized.

(9) Mobile search will make great progress.

Iv. Summary

This article has carried on the detailed explanation to the search engine, has carried on the analysis to his key technology realization, and proposed the future development tendency, along with the technology development, the people demand enhancement, the search engine will be other more and more intelligent, more and more efficient and practical.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.