Technical Principles of search engines

Source: Internet
Author: User
Tags split words

1. Overview

A search engine is used to collect Internet Information Based on certain policies and specific computer programs. After the information is organized and processed, the system that provides the retrieval service for users.

2. Search Engine Classification

According to different information collection methods and service provision methods, the Search Engine system can be divided into three categories: Full Text Search Engine and Search Index/Directory) meta Search Engine ).

2.1 full-text search engine

A full-text search engine is a real search engine. Google is a representative in foreign countries, and Baidu is a famous search engine in China. They extract the information of each website from the Internet (mainly webpage text), establish a database, retrieve records matching the user's query conditions, and return results in a certain order.

2.2 Directory Index Search Engine

The directory-based search engine collects information manually or semi-automatically. After the editors view the information, they manually form a summary of the information and place the information in a predefined classification framework. Although the Directory Index has the search function, it is not a real search engine in a strict sense. It is only a list of website links classified by directories. You do not need to query Keywords. You can find the desired information only by using the category directory. The most representative of the catalog index is Yahoo!, which is the famous one. Other famous examples include Open Directory Project (DMOZ), LookSmart, and About. Sohu, Sina, and Netease search in China also belong to this category.

2.3 yuan Search Engine

This type of search engine does not have its own data, but submits the user's query requests to multiple search engines at the same time. After processing the returned results, such as repeated exclusion and re-sorting, this type of search engine integrates the information of multiple search engines and adds new sorting and Information Filtering to improve user satisfaction.

3. Full-text search engine

The typical search engine structure is generally composed of the following three modules: Information Collection Module (Crawler), index module (Indexer), and query module (Searcher ).

Crawler: Collects webpage data from the web.

Indexer: analyze the collected data of the Crawler to generate an index.

Searcher: receives the query request, obtains the query result through a certain query algorithm, and returns it to the user.

3.1 web crawlers capture webpages from the Internet

The full-text search engine's "Robot" or "Spider" is a kind of software on the network, whose core objective is to obtain information on the Internet. It is generally defined as "a software that retrieves a file on the network and automatically tracks the Hypertext Structure of the file and cyclically Retrieves all referenced Files ". Robots use hypertext links on the home page to traverse WWW and use Url references to crawl from one HTML document to another. Information collected by online robots can be used for multiple purposes, such as indexing, verifying the validity of HIML files, verifying and verifying URL links, monitoring and obtaining updated information, and website images.

Robots crawl on the Internet, so you need to create a URL list to record the access track. It uses hypertext. URLs pointing to other documents are hidden in documents and need to be analyzed and extracted. Robots are generally used to generate index databases. All WWW search programs have the following steps:

(1) The robot extracts the URL from the starting URL list and reads the content it points to from the Internet;

(2) extract some information (such as keywords) from each document and put it into the index database;

(3) extract the URLs pointing to other documents from the documents and add them to the URL list;

(4) Repeat the preceding three steps until no new URL is available or exceeds certain limits (time or disk space );

(5) Add a search interface to the index database and publish or provide it to online users for retrieval.

Search algorithms generally have two basic search strategies: depth first and breadth first. The robot uses the URL list access method to determine the search policy: first-in-first-out, which forms a breadth-first search. When the starting list contains a large number of WWW server addresses, the breadth-first search will produce a good initial result, but it is difficult to go deep into the server. When advanced, a deep-first search will be formed, which will produce a better document distribution, it is easier to find the structure of a document, that is, to find the maximum number of cross references. You can also use the traversal search method to directly change the 32-bit IP address and search the entire Intemet one by one.

Traffic-based search engine optimization is a highly technical network application system. It includes network technology, database technology, Dynamic Indexing Technology, search technology, automatic classification technology, machine learning and other artificial intelligence technologies.

3.2 Create an index

Indexing is one of the core technologies of search engines. The search engine needs to organize, classify, and index the collected information to generate an index database. The core of the Chinese search engine is Word Segmentation technology. Word Segmentation uses certain rules and lexicon to split words in a sentence and prepare for automatic indexing. Currently, most indexes use the Non-clustered method. This technology has a lot to do with the understanding of language and text. The specific points are as follows:

(1) store the syntax library and use it with the vocabulary library to separate the words in the sentence;

(2) to store the vocabulary library, you must store the usage frequency and common matching methods of the vocabulary at the same time;

(3) The word width should be divided into different professional libraries to facilitate the processing of professional documents;

(4) treat sentences that cannot be segmented as words.

The indexer generates a relational index table from the keyword to the URL. An index usually uses an inverted list, that is, the index item searches for the corresponding URL. The index table should also record the position where the index item appears in the document, so that the searcher can calculate the adjacent or close relationship between the index items and store them on the hard disk in a specific data structure.

Different search engine systems may adopt different indexing methods. For example, Webcrawler uses full-text retrieval technology to index each word on a webpage. Lycos only indexes the Page name, title, and the most important 100 comments and other selective words; infoseek provides conceptual search and phrase search, and supports Boolean operations such as and, or, near, and not. The indexing methods of the search engine can be divided into three types: automatic indexing, manual indexing, and user logon.

3.3 Query

The main function of the searcher is to search in the inverted table formed by the indexer Based on the keywords entered by the user. At the same time, the relevance evaluation between the page and the retrieval is completed, and the results to be output are sorted, and implement a user-related feedback mechanism.

There are often hundreds of thousands of search results obtained by a search engine. To obtain useful information, the common method is to rate webpages based on their importance or relevance and sort the relevance. The relevance here refers to the quota of search keywords in the document. The higher the quota, the higher the degree of relevance of this document. Visibility is also a common measure. The visibility of a webpage refers to the number of superlinks at the webpage entry. The visibility method is based on the following idea: the more a webpage is referenced by other webpages, the more valuable the webpage is. In particular, the more important a webpage is, the more important it is. Result processing technology can be summarized:

(1) frequency-based scheduling. Generally, if a page contains more keywords, the search Target relevance should be better. This is a common solution.

(2) In this method, the search engine records the frequency of page access. When people access a large number of pages, they should usually contain more information or have other attractive advantages. This solution is suitable for general search users. Because most search engines are not professional users, this solution is also suitable for general search engines.

(3) further purify (than flne) the secondary search results, optimize the search results according to certain conditions, and select categories and Related Words for secondary search.

Because the current search engine is not smart, unless you know the title of the document to be searched, the first result may not be the "best" result. Therefore, some documents, although highly relevant, are not necessarily the most required documents.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.