Analysis of the key points of the search engine

Source: Internet
Author: User

Intermediary transaction SEO diagnosis Taobao guest Cloud host technology Hall

With the rapid development of the Internet, the increase of web information, users to find their own information in the ocean, like a needle in a haystack, search engine technology just solves the problem (it can provide users with information retrieval services). Search engine refers to the Internet to provide search services, such as Web sites, the server through the network search software (such as network search robot) or network login, etc., will intemet on a large number of Web site page information collected to the local, processing and processing to establish information database and index database, In order to provide users with the necessary information or relevant pointers. The user's retrieval way mainly includes the free word full-text search, the keyword search, the classification retrieval and other special information retrieval (for example Enterprise, person name, telephone yellow pages, etc.). The following is an example of a web search robot to illustrate search engine technology.

1. Network Robot Technology

The network Robot (Robot) is also called Spider, worm, or random, and the core purpose is to obtain information on the Intemet. Generally defined as "a software that retrieves files on a network and automatically tracks the hypertext structure of the file and loops through all referenced files." The robot uses hypertext links in the homepage to traverse www and crawl from one Ht2lil document to another through a U-toe reference. The information collected by online robots can be used for a variety of purposes, such as indexing, verifying the legality of HIML files, verifying and confirming URL links, monitoring and obtaining update information, and site mirroring. Robots are crawling on the web, so you need to set up a URL list to record access paths. It uses hypertext, the URL to the other document is hidden in the document, the extraction URL needs to be parsed from, and the robot is typically used to generate the index database. All WWW search procedures have the following work steps:

(1) The robot takes out the URL from the starting URL list and reads its point from the Internet.

(2) Extract some information (such as keywords) from each document and put it in the index database.

(3) Extract the URL from the document to the other document and add it to the URL list.

(4) Repeat the above 3 steps until no new URLs appear or exceed certain limits (time or disk space).

(5) Add the retrieval interface to the index database and publish or provide it to the user for retrieval.

Search algorithms generally have two basic search strategies, depth first and breadth first. The robot determines the search strategy by accessing the URL list: Advanced first out, then form breadth First search, when the start list contains a large number of WWW server address, breadth-first search will produce a good initial results, but it is difficult to go deep into the server; This creates a better document distribution and makes it easier to find the structure of the document, that is, to find the maximum number of cross-references. can also use traversal search method, is directly to 32-bit IP address changes, search the entire intemet.

Search engine is a high technology content network application system. It includes network technology, database technology dynamic indexing technology, retrieval technology, automatic classification technology, machine learning and other artificial intelligence technology.

2. Indexing technology

Indexing technology is one of the core technologies of search engine. Search engine to collect the information to organize, classify, index to produce index library, and the core of Chinese search engine is word segmentation technology. Participle technology is the use of certain rules and thesaurus, cut out a sentence in the word, for automatic indexing ready. The present index uses the Non-clustered method more, this technique and the language knowledge has the very big relation, specifically has the following several points:

(1) Storage grammar library, and vocabulary library to separate the words in the sentence;

(2) Storage of vocabulary, the use of both the frequency of vocabulary and common collocation methods;

(3) Vocabulary width, should be divided into different professional libraries, to facilitate the processing of professional literature;

(4) For sentences that cannot be participle, treat each word as a word.

The indexer generates a relational index table from the keyword to the URL. Index tables typically use some form of inverted table (1nversionUst), where the index entry finds the appropriate URL. The index table also records where the index entries appear in the document so that the Retriever calculates the adjacent or close relationship between the index entries and stores them on the hard disk in a specific data structure.

Different search engine systems may use the same indexing method. For example, WebCrawler uses Full-text search techniques to index every word in a Web page; Lycos Only index selected words such as page name, title, and most important 100 annotation words, InfoSeek provides concept retrieval and phrase retrieval, and supports Boolean operations such as and, or, near, and not. The indexing method of the retrieval engine can be divided into three categories: Automatic index, manual index and user login.

3. Retrieval and result processing technology

The main function of the Retriever is to search the inverted table formed by the indexer based on the user input keyword, and to evaluate the correlation between the page and the search, to sort out the results to be output, and to realize some feedback mechanism of user relevance.

Search engines to get the results are often hundreds of thousands, in order to get useful information, the common method is to page the importance or relevance of the page rating, the relevance of the ranking. The correlation here refers to the amount that the Search keyword appears in the document. The higher the amount, the higher the degree of relevance of the document. Visibility is also one of the common metrics. The visibility of a Web page refers to the number of hyperlinks to the page entry. The visibility approach is based on the view that the more a page is referenced by other pages, the more valuable the page is. In particular, a webpage is referenced by the more important pages, the more important the page is. Results processing technology can be summed up as:

(1) Order by frequency usually, if a page contains more keywords, its search target relevance should be better, this is a very common sense of the solution.

(2) Sort by page access degree in this method, the search engine records how often the pages it searches are accessed. People visiting more pages should usually contain more information or have other attractions. This solution is suitable for the general search users, and because most of the search engines are not professional users, so this scheme is also more suitable for general search engine use.

(3) Two retrieval further purification (than Flne) results, according to certain conditions to optimize the search results, you can select categories, related words for two times search.

Because the current search engine does not have intelligence, unless you know the title of the document to find, otherwise ranked first results may not be the "best" results. So some documents, although highly relevant, are not necessarily the documents that users need most.

The industry application of search engine technology:

The industry application of search engine generally refers to a variety of search engine industry and product application mode, which is similar to that of KW communication, which is divided into several forms as follows:

1, government agencies industry application

N Real-time tracking and collection of information sources related to business work.

N fully meet the needs of internal staff in the global observation of Internet information.

n Solve the problem of the information source of the government affairs outside network and the intranet in time, and realize the dynamic release.

N quickly solve the government's main web site on the local level of information to obtain demand.

n Comprehensive integration of information, to achieve the government within the cross-regional, cross-sectoral information resources sharing and effective communication.

n Save information collection of human, material, time, improve office efficiency.

2. Enterprise Industry Application

n Real-time and accurate monitoring and tracking of competitor dynamics is a sharp weapon for enterprises to obtain competitive intelligence.

n Timely access to the competitor's public information in order to study the industry's development and market demand.

N provides a convenient and multi-channel enterprise strategic decision-making tool for the decision-making department and management.

n the key to improve the core competitiveness of enterprises is to improve the efficiency of information acquisition and utilization, and to save the related cost of information collection, storage and excavation.

n Improve the enterprise's overall analytical research ability, market rapid response ability, establish a knowledge management as the core of the competitive intelligence data Warehouse, is to improve the core competitiveness of the nerve center.

3. News media Industry Application

n Fast and accurate automatic tracking, collection of thousands of network media information, expand news clues, improve acquisition speed.

N supports effective crawling of tens of thousands of news daily. The depth and breadth of the monitoring range can be set by itself.

N supports intelligent extraction and auditing of required content.

n Realize the integration of collecting, browsing, editing, managing and publishing of Internet information content.

4. Industry website Application

N Real-time tracking and collection of information sources related to the site. N Timely tracking Industry information sources website, automatic, quick update website information. Dynamic update information.

n Realize the integration of collecting, browsing, editing, managing and publishing of Internet information content.

N for business Web site proposed business management model, greatly improve the industry Web site business application needs.

N for the information site Classification directory generation, proposed user-generated site classification structure. The classification structure can be added and updated in real time. Not subject to progression limits. Thus greatly benefit the application of high industry.

n Provide search engine SEO optimization professional services, and quickly improve the promotion of industry websites.

N provides advertising cooperation with CCDC call search engines. Set up the industry website Alliance, improve the industry website popularity.

5 Network information monitoring and monitoring

N Network public opinion system. such as "KW communication-network public opinion radar monitoring System"

n Website information and content monitoring and monitoring system, such as "KW communication-web site information and content monitoring and monitoring system (in-station God)"

With the rapid development of the Internet, the increase of web information, users to find information in the ocean, like a needle in the haystack, search engine technology to solve the problem (it can provide users with information retrieval services). At present, search engine technology is becoming the object of research and development of computer industry and academia.

Search engine (Engine) is a technology that has been developed gradually since 1995 with the rapid increase of web information. According to the July 1999 article published in Science magazine, "Accessibility of web information" estimates that the global destination Web page is over 800 million, has more than 9T of effective data, and still grows at a rate of doubling every 4 months. Users in such a vast ocean of information to find information, will inevitably be "in a haystack" without work and return. Search engines are the technology that comes with solving this "trek" problem. Search engine to a certain strategy in the Internet to collect, find information, understanding, extraction, organization and processing of information, and to provide users with retrieval services, so as to play the purpose of information navigation. Search engine provides navigation service has become the Internet is very important network services, search engine site is also known as the "Network portal." Therefore, search engine technology has become the object of research and development in computer industry and academia. The purpose of this paper is to introduce the key technology of search engine briefly, so as to play a useful role.

Classification

According to the different methods of information collection and service delivery, the search engine system can be divided into three main categories:

1. Catalog Search Engine: Collect information manually or in a semi-automatic manner, after the editors view the information, form a summary of the information and place the information in a predetermined categorization framework. Most of the information is web-oriented, providing directory browsing services and direct retrieval services. This kind of search engine because has joined the human intelligence, therefore the information is accurate, the navigation quality is high, the shortcoming is needs the artificial intervention, the maintenance quantity is small, the information quantity is not timely. This kind of search engine's representative is: Yahoo, LookSmart, Open Directory, go guide and so on.

2. Robot search engine: by a robot called Spider (Spider) in a certain strategy to automatically collect and find information in the Internet, by the indexer for the collection of information indexed, by the reader based on the user's query input to retrieve the index library, and return the results of the query to the user. Service mode is a web-oriented Full-text search service. The advantage of this kind of search engine is the information is big, update in time, need not human to intervene, the disadvantage is that the return is too much, have a lot of irrelevant information, the user must filter from the result. This type of search engine representatives are: AltaVista, Northern Light, Excite, Infoseek, Inktomi, FAST, Lycos, Google, the domestic representative: "Skynet", leisurely travel, openfind and so on.

3. Meta search engine: Such search engines do not have their own data, but the user's query request to multiple search engines at the same time, will return the results of repeated exclusion, reordering, and so on, as their results returned to the user. Service mode is web-oriented Full-text search. The advantage of this kind of search engine is that the information of the return result is bigger and more complete, the disadvantage is that the function of the search engine can not be fully used, the user needs to do more filtering. This kind of search engine's representative is WebCrawler, Infomarket and so on.

Performance Index

The search for Web information can be viewed as an information retrieval problem, in which documents related to user queries are retrieved in a document library consisting of Web pages. So we can measure the performance of a search engine by measuring the performance parameters of the traditional information retrieval system-the recall rate (R ecall) and the precision (pricision).

The recall rate is the ratio of the number of related documents retrieved and the number of related documents in the document library, which is the recall of retrieval system (search engine), the accuracy is the ratio of the number of related documents retrieved and the total number of documents retrieved, and the precision of retrieval system (search engine) is measured. For a retrieval system, recall rate and accuracy can not be the same: high recall rate, low precision, high precision, low recall. Therefore, it is often used to measure the accuracy of a retrieval system with the average value of 11 kinds of accuracy (i.e. 11 point average precision) under 11 kinds of recall rates. For the search engine system, because no one search engine system can collect all the Web pages, so recall is difficult to calculate. The current search engine system is very concerned about precision.

There are many factors affecting the performance of a search engine system, the most important is the information retrieval model, including the presentation method of document and query, the matching strategy of evaluation document and user query relevance, the sorting method of query result and the mechanism of user's correlation feedback.

Main technical

A search engine consists of four parts, such as the searcher, the indexer, the searcher and the user interface.

1. Finder

The function of the searcher is to roam the Internet, discover and collect information. It is often a computer program that runs day and night. It collects as many types of new information as possible, as quickly as possible, and as the information on the Internet is updated quickly, it also updates the old information that has been collected regularly to avoid dead connections and invalid connections. There are currently two strategies for gathering information:

Start with a collection of starting URLs, follow the hyperlinks (Hyperlink) in these URLs, and loop through the Internet to find information in width first, depth first, or heuristic. These start URLs can be any URL, but are often very popular and contain many links to sites (such as Yahoo!).

The web space is divided by domain name, IP address, or country domain name, and each searcher is responsible for an exhaustive search of a subspace. The search engine collects a variety of types of information, including HTML, XML, newsgroup articles, ftp files, word processing documents, and multimedia information. The implementation of the searcher often uses distributed and parallel computing technology to improve the speed of information discovery and update. Business search engine information discovery can reach millions of pages per day.

2. Indexers

The function of an indexer is to understand the information searched by the searcher, extract the index entries from it, and use it to represent the document and the index table where the document library is generated.

An index entry has both an objective index entry and a content index entry: The objective term is independent of the semantic content of the document, such as author name, URL, update time, encoding, length, link popularity, etc. the content index entries are used to reflect the content of the document, such as keywords and their weights, phrases, words, and so on. Popularity Content index entries can be divided into single index entries and multiple index entries (or phrase index entries). Single index entry is an English word for English, it is easier to extract, because there is a natural separator between the words (space), for Chinese and other consecutive written language, must be a word segmentation.

In search engines, it is common to assign a weight to a single index item to indicate the degree to which the index item distinguishes the document, and to compute the relevance of the query results. The methods used are statistical method, information theory and probability method. There are statistical, probabilistic and linguistic methods for the extraction of phrase index items.

Index tables typically use some form of inverted table (inversion list), where the index entry finds the appropriate document. The index table may also want to record where the index entries appear in the document, so that the retriever computes an adjacent or close relationship between the index entries (proximity).

Indexers can use either a centralized indexing algorithm or a distributed indexing algorithm. When the volume of data is very large, you must implement the real-time index (Instant indexing), otherwise can not keep pace with the rapid increase in information. Indexing algorithms have a significant impact on the performance of indexers, such as the response speed of a large peak query. The effectiveness of a search engine depends largely on the quality of the index.

3. The function of the retrieval device is to check out the document quickly in the index library according to the user's query, to evaluate the relevance of the document and query, to sort out the results to be output, and to realize some feedback mechanism of user relevance.

There are four kinds of information retrieval models, such as set theory model, algebraic model, probability model and mixed model.

4. User interface

The function of user interface is to input user query, display query result and provide user relevance feedback mechanism. The main purpose is to facilitate users to use search engines, efficient, and many ways to obtain effective and timely information from search engines. User interface design and implementation of the use of human-computer interaction theory and methods to fully adapt to the thinking habits of mankind.

User input interface can be divided into simple interface and complex interface.

A simple interface only provides a text box for the user to enter a query string; Complex interfaces allow users to limit queries such as logical operations (with, or, non-, +,-), close relationships (adjacent, NEAR), domain name ranges (such as. edu,. com), where they appear (such as title, content), information time, length, and so on. At present, some companies and organizations are considering the criteria for setting query options.

Future trends

Search engine has become a new research and development field. Because it is used in information retrieval, artificial intelligence, computer network, distributed processing, database, data mining, digital library, natural language processing and other fields of theory and technology, so it is comprehensive and challenging. And because the search engine has a large number of users, has a very good economic value, so the world's computer science and information industry of great concern, the current research, development is very active, and there are a lot of noteworthy trends.

1. Pay great attention to improve the accuracy of information query results, improve the effectiveness of the search engine information query, not very concerned about the return of the results of the number, but see whether the results and their needs coincide. For a query, the traditional search engine frequently returns hundreds of thousands of or millions of documents, users have to filter the results. There are several ways to resolve the phenomenon of too many query results: first, through various methods to get the user does not express in the query statement of the true use, including using intelligent agents to track user retrieval behavior, analysis of user models, use of relevance feedback mechanism, so that users tell the search engine which documents and their own needs related ( And the degree of its relevance), which are not relevant, through multiple interactions gradually refinement. Second, the text classification (text categorization) technology to classify the results, using visualization technology to display the classification structure, users can only browse their own categories of interest. Third, the site clustering or content clustering, reduce the total amount of information.

2. Information filtering and personalized service based on intelligent agent

Information Intelligence agent is another mechanism that utilizes Internet information. It uses the automatically obtained domain model (such as we b knowledge, information processing, information resources related to users ' interests, domain organization structure, user model (such as user background, interests, behavior, style) knowledge collection, indexing, filtering (including interest filtering and bad information filtering), and automatically interested users, User-useful information is submitted to the user. Intelligent Agent has the ability to continuously learn and adapt to the dynamic change of information and user's interest, thus providing personalized service. Intelligent agents can be performed on the client side or on the server side.

3. Improve system size and performance with distributed architecture

Search engine implementations can take a centralized architecture and distributed architecture, with two different approaches. However, when the scale of the system reaches a certain level (such as the number of Web pages reaches billion), it is necessary to adopt some kind of distributed method to improve the system performance. Search engine components, in addition to the user interface, can be distributed: The searcher can cooperate with each other on multiple machines, mutual division of Information Discovery, to improve the speed of information discovery and update; indexers can distribute indexes on different machines to reduce the index's requirements for machines; The search engine can be on different machines.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.