Analysis on the Development of Search Engine Technology

Source: Internet
Author: User

In the previous period, we introduced several traditional search engine technologies. How will the search engine technology develop in the future? With the further maturity of AI technology and the diversity of information services, search engines are evolving towards intelligence and personalization.

With the "eye-catching economy" sweeping across the Internet, thousands of funds are rapidly flowing to the most eye-catching search engine market. A large number of surveys show that the search engine market is in a period of rapid development and will become one of the most promising industries in the next few years.

How long does it take to log on to a website and search for a certain type of content on the Internet to obtain the latest and most comprehensive information?

A few years ago, people expected results in a dozen or so seconds at most 30 seconds, but now the expected value is 1 ~ 2 seconds, that is to say, with the click of the mouse, the page of the display screen has changed, and the title of the top 10 or 20 pieces of information has already appeared in front of you.

Currently, search engine technology is the second core technology behind the portal in the Internet, it is comprehensive and challenging to use theories and technologies in information retrieval, artificial intelligence, computer networks, distributed processing, databases, data mining, digital libraries, natural language processing, and other fields. With the popularity of the Internet and the explosive growth of online information, it has attracted more and more attention.

Deep Processing of search results

When you use a search engine to search for information, you do not pay much attention to the number of returned results. Instead, you can see that the results meet your needs. For a common query, the traditional search engine often has hundreds of thousands or millions of documents. Such search results do not make much sense.

There are multiple methods to solve the problem that the retrieval results are too complex. First, you can use various methods to obtain the true purpose that users are not expressed in the search type, including intelligent proxy tracking of user search behaviors, analyzing user operation models, and using relevant feedback mechanisms, determine the relevance between documents and user requirements to improve the retrieval accuracy. The second is to use text classification technology to classify the results and use visualization technology to display the classification structure. Users only browse the categories they are interested in. Third, perform site Clustering or content clustering to reduce the total amount of information, so as to help find the information required by users from a large number of returned results.

Provide personalized services

To achieve personalized services, You need to obtain user interest information. You can use either of the following methods to obtain user interest information. In the training phase, keywords are classified based on information theory and their feature degree is expressed (keywords are divided into positive feature words, negative feature words, and zero feature words by contribution rate ), then define the feature degree of the title and make statistics on various feature words.

In the test phase, the application interest description file (usually stored in XML file format) dynamically obtains the user's interest and provides the pages that the user is interested in. This method avoids the difficulty of describing users' interests. It is difficult for users to describe their interests, but you can determine an articleArticleWhether it meets the requirements.

Another method is to dynamically update user interests based on the user's bookmarks, keywords entered each time, and user responses. By analyzing the intention of a user's behavior, you can obtain information about the user's interest and the sensitivity of the user's interest. In addition, keywords entered by users are also used as positive feature words to dynamically update user interest files.

Smart Search

There is no doubt that the intelligent direction of search engines is evolving. Smart Hunter is based on the current development trend of search engines. In addition to providing traditional functions such as fast retrieval and relevance sorting across the entire network, it also provides functions such as user role registration, automatic recognition of user interests, semantic understanding of content, Smart Information Filtering and pushing, provides users with a truly personalized and intelligent network information collection tool (see the figure below ).

Intelligent Search engines use technologies such as neural networks, decision trees, association rules, case-based reasoning, fuzzy clustering, rough sets, and hidden Markov models to implement distributed parallel search. Data Mining and knowledge discovery are the main means, in addition, the natural language understanding technology further analyzes the search results and filters out information that is irrelevant to user requirements or is weak, so as to improve the system performance and retrieval accuracy and effect.

1. Natural language search

The intelligent search engine is based on natural language search. It is based on a large-scale knowledge base and uses a powerful inference engine to analyze the search requirements expressed in natural language, then a search policy is formed for search. Users only need to input their requirements into the computer to get the search results, so that users can be freed from complicated Search rules.

Natural language query can be divided into two types of user interfaces:

One is a sentence that inputs a natural language. It splits and extracts multiple pairs of group words to form a finite state machine, and then matches with the database to accumulate the frequency of each retrieved record, after several searches, the results are sorted Based on the hit frequency and returned to the user. This is only a natural language analysis of query requests.

The other is to conduct natural language analysis on the target document. This involves not only word segmentation technology, lexical analysis, syntax analysis, and semantic analysis, but also the analysis of the chapter structure, it is to understand the meaning of the article. It is technically difficult and there is basically no successful model yet.

2. Mobile proxy technology.

Mobile Agent is a new type of distributed computing technology that provides mobile capabilities on the network and can run independently to complete specified tasks as required by users.Program.

Mobile Agent technology is a new generation of distributed computing technology. Mobile Agent technology is completely different from traditional distributed computing technology. In mobile proxy mode, the client does not submit some simple requests to the server, but includesCodeAnd the moving object of the data. A mobile object represents a user. In accordance with the principle of "program approaching data", it moves between servers to complete data processing tasks.

Applications based on the mobile proxy mode can greatly save network bandwidth, effectively overcome various problems caused by network latency, and intelligently Implement Asynchronous execution. It overcomes the traditional "data close to programs" running mode of search engines, greatly reducing network data traffic and Saving network resources.

3. parallel retrieval.

Parallel information retrieval is a computer system consisting of multiple processing components or processors that can work simultaneously to retrieve information. The Information Retrieval System can utilize parallel policies such as parallel tasks, parallel data, and their hybrid methods. Parallel information retrieval builds the information search process on a neural network.

If you do not need a neural network, you can use the existing information for retrieval.AlgorithmSeparate data and computing.

Data is separated by logical and physical documents. Logical document splitting requires extension of Inverted Files so that each parallel process can directly access some indexes, which correspond to the subset of the documents to be processed by the processor; physical document segmentation divides documents into discrete and self-contained document subsets. Each subset corresponds to a parallel processor, and each subset has its own inverted file.

4. Distributed Retrieval.

The distributed search engine stores and maintains information through the distributed storage of network physics, and combines a wider range of distributed and heterogeneous documents to form a logical whole, providing users with Distributed Information Retrieval.

In addition to a large amount of text, the distributed document set also includes other types of data: graphics, images, videos, audio, and other multimedia data. The purpose of the Distributed Information Retrieval is to identify and retrieve the distributed document set according to consistent information descriptions. The distributed information collection tool directs you to the distributed information space, selects the appropriate document set, and retrieves the data.

Search engines are comprehensive and challenging, and involve artificial intelligence, computer networks, distributed processing, parallel computing, data mining, Knowledge Discovery, natural language processing, and many other technologies. With the further development of these technologies, the performance will be constantly improved, and search engines that better meet users' needs will also emerge.

Grandstand

Search engine tips

◆ Search with logical words

Common logical words include: and (and), or (OR), not (no, and not), and near (closeness of the two words ).

◆ Use double quotation marks for exact search

If you are looking for a phrase or multiple Chinese characters, the best way is to enclose them in double quotation marks.

◆ Use the plus or minus sign to limit search

The plus sign (+) before a search term is used to limit the words that must be included in the search result. The minus sign (-) is used to limit the words that cannot be included in the search result.

◆ Case sensitive

Many search engines use uppercase/lowercase letters to distinguish between uppercase and lowercase letters.

◆ Restricted query range

The more powerful the range limit is, the more accurate the desired information can be found.

◆ Use as few spaces as possible

When entering Chinese characters as keywords, do not append unnecessary spaces after the Chinese characters, because spaces are considered as special operators, which are used in the same way as and.

◆ Search for the author name, organization name, or company name from the top or bottom of the page

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.