In-depth research on Search Engine Technology

Source: Internet
Author: User

With the rapid development of network science and technology, people are increasingly dependent on network search engines, especially in the 21st century, where network resources are abundant and network information demand is increasing, search technology occupies a very important commanding heights of the Internet. Nowadays, people often use search engines to search for multimedia materials, latest information, maps, and other materials.

1. Basic principles of search engines

A search engine is a system that can obtain website webpage information, establish databases, and provide queries.

1.1 search engine structure

Web page collection crawls web pages through web crawlers and crawls other web pages along the links on each web page. In the end, many web pages can be crawled and compressed, stored in the knowledge base. Web Spider programs constantly crawl the entire network to ensure the timeliness and effectiveness of information.

Preprocessing analyzes links to the collected webpages, computes the importance of webpages, and extracts keywords to create an index database. The architecture of this database must be conducive to searching, the information should be as comprehensive as possible.

A service is used to provide services for users. After a user inputs a keyword, the user quickly finds the relevant information in the index database based on the keyword and returns it to the user.

1.2 search engine Classification

There are three types of search engines: full-text search engine, directory search engine, and meta search engine.

The full-text search engine crawls web pages through web crawlers, extracts information from them, and stores the information in a database. When a user uses the search engine, the keywords entered by the user are matched, and return the information to the user. This is the most widely used search engine. Google and Baidu belong to this type.

The Directory Search Engine classifies search resources in a certain way, and finally creates a large directory system. During user queries, you can open the browsing directory layer by layer and finally find the desired information, the Directory Search Engine is not a real search engine strictly. We use Yahoo, which belongs to Sina.

Meta-search is an engine that calls other search engines. It can cover more resources and provide comprehensive services. Dogpile, Vivisimo and domestic search stars are widely used.

The above three different search engines can be used in different scenarios and have their own advantages and disadvantages. Full-text search engines are generally used for comprehensive search. They have the advantage of large amounts of information, timely updates, and no manual intervention. The disadvantage is that they process a large amount of information, making it difficult to filter information. The Directory Search Engine is mostly for websites and provides directory browsing and direct retrieval services. Its advantage is that manual intervention can improve the accuracy of information search. The disadvantage is that manual intervention is required and the maintenance cost is high, the update process is slow and the amount of information is small. Because the meta-search engine can query multiple other search engines, it is particularly suitable for scenarios requiring a high query speed. However, different methods or rules are used to create an index database and perform question search, which greatly affects the retrieval performance of Meta Search Tools.

2. Several Key Technologies of Search Engine implementation

2.1 web spider

Web spider can be implemented in the following ways:

(1) Breadth-based priority. The breadth-first-based algorithm accesses the link in sequence. It is the simplest strategy among all web spider.

(2) Based on depth first. Based on the idea of depth precedence, the similarity between webpages and search topics is calculated based on the selected conditions. The link with the highest similarity is selected for search. Cosine is usually used for calculation during the similarity calculation process.

(3) webpage-based rating. Based on the Web page rating, you can use the web page rating and content to rate the list of searched documents, and use the calculated results to select the link with the highest rating as the next search object.

(4) infospider. Infospider uses the evolutionary keyword table and neural network method to calculate the similarity of webpages related to topics. Based on the calculation results, it determines the next object to be searched, at the same time, calculate the correlation between the newly obtained document and the topic, and modify the proxy energy at the cost of obtaining the document, the proxy is revoked, regenerated, and survive based on its energy level.

2.2 webpage importance evaluation

There are two main methods to judge the importance of web pages. One is link-based.

Method.

The link-based computing is based on a certain trusted ing relationship between the link information and the linked object. The following content is often used in the application process:

(1) inbound: includes the number of webpages with link targets pointing to the current webpage;

(2) outbound: Number of webpage links linked from the webpage;

(3) Page Rank: it indicates the possibility of users accessing the webpage at any time.

This method is widely used and very effective.

Similarity-Based Computation generally uses the vector space model to convert the query string and text into vectors, and then evaluates the similarity between the text and the query string.

2.3 Establishment of search engine hardware system

The hardware system of the search engine is the pillar of the entire system. To provide faster query speeds, the hardware system generally adopts a distributed structure. Google servers are distributed all over the world, and parallel technologies are also used, speed up execution. In addition, the hardware design of the index database is also very important, and it is critical to speed up data access.

Third, search engine reverse trend

Future Search Engines have the following features:

(1) ability to collect almost all information on the Internet;

(2) Blocking illegal information;

(3) Improvement of query accuracy and precision

(4) Not only can recognize text tokens, but also can recognize images, audios, videos, and so on;

(5) faster information updates;

(6) convenient cross-database query;

(7) user-friendly and personalized interactive interfaces;

(8) intelligent search can be implemented.

(9) mobile search will make great strides.

Iv. Summary

This article gives a detailed explanation of the search engine, analyzes the implementation of its key technologies, and puts forward future development trends. With the development of technology, people's needs have improved, search engines will become more intelligent and more efficient and practical.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.