SE handles queries, builds summaries, and determines importance

Source: Internet
Author: User

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

For the sake of elaboration, the following is referred to as "search engine" for "SE".

How does SE deal with internet users ' inquiries?

Queries refer to the form in which SE allows Internet users to submit queries. Considering the different backgrounds and different information needs of various Internet users, it is impossible to have a universal way. It is generally believed that the most natural way for ordinary internet users is to "enter what they want". But this is a rather vague statement. For example, the Internet users input "medium Iron Express", it may be that he wants to know the contact information of the China Railway Express Company, may also want to look at the report, or he would like to know what the outside world of China Railway Express some comments (or want to see the other authoritative website on the News of China Railway). There are two rather different needs.

In other cases, Internet users may be concerned about indirect information, such as "the height of the Himalayas", 8848 meters should be what he needs, but it is unlikely to be included in the phrase. The surfer who enters the "bright Moonlight" is likely to want to know who the author of the word is, or to remind you what the previous sentence is. However, using a word or phrase to directly express information needs, you want the page to contain the word or the word in the phrase, is still the mainstream SE query mode. This is not only because it does represent the majority of the situation, but also because it is relatively easy to implement. In this way, generally speaking, the system is facing the query phrase.

In English, it is a sequence of words; in the case of Chinese, it is a text containing several words. Generally, we use q0 to represent the original query submitted by the Surfer, for example, q0 = "Network and distributed Systems Lab". It first needs to be "cut" or "phrase", that is, to divide it into a sequence of words. As in the previous example, the network and Distributed Systems Lab (note that different word-strokes software may produce different results). Then you need to delete words that don't have a query meaning or appear almost every page (for example, ""), in this case "and". The final form is used to participate in matching query thesaurus, q = {t1, t2, ..., TM}, in this case is q = {network, distributed, System, lab}.

How does a Web page summary come into being?

The result of SE is an ordered list of entries, each with three basic elements: title, URL, and summary. The summary needs to be generated from the body of the Web page. Generally speaking, it is an important topic in the field of natural language understanding to generate an appropriate abstract from a text, and people have done many years of work and achieved some results. But the related technology uses the network SE to have two basic difficulties.

One is that the writing of the Web page is usually not standard, the text is more casual, so it is difficult to do from the angle of language understanding. The complex language understanding algorithm takes too much time and does not adapt to the need of the SE to deal with the massive web information efficiently. Some people do statistics, even if it is the work of the word (text understanding of the basis), in high-grade microcomputers can only complete 20 pages per second processing. So SE is a lot simpler to generate summaries, basically can be summed up in two ways, one is static, that is, independent of the query, according to some rules, in advance in the preprocessing phase from the content of the Web page extracts some text, such as intercepting the beginning of the body of the page 512 bytes (corresponding to 256 characters), Or put together the first sentence of each paragraph, and so on. This form of summary is stored in the query subsystem, once the relevant page is selected to match the query item, read out to return to the Internet users.

Obviously, this approach is the easiest for the query subsystem and does not require additional processing. But one of the biggest drawbacks of this approach is that the summary is irrelevant to the query. A Web page may be the result of several different queries, when the surfer input a query, he is generally hope that the summary can highlight and query directly corresponding to the text, want to appear in the summary and he is concerned about the text related to the sentence. Therefore, there is a "dynamic summary" way, that is, in response to the query, according to the query word in the page location, extract the surrounding text, in the display of the query words marked bright. This is the way most se uses today. In order to ensure the efficiency of the query, it is necessary to remember the location of each keyword in the page during the preprocessing stage.

How is it important to identify a Web page?

The information on the Web is heterogeneous and dynamic, due to time and storage space constraints, even the largest se can not be all the world's Web page search, a good search strategy is to search for important pages, so that in the shortest possible time to the most important web page crawl over. In this request, on the one hand to use distributed parallel architecture to work together, on the one hand to search for important pages. The importance of the Web page assessment, based on the search for information on the different applications. Therefore, the search for information can adopt different strategies. For a relatively small number of applications, such as the discovery of professional information and design of the theme of Web information search system, can be based on customized keywords, priority search page contains or part of the page contains these keywords, by improving the Web page URL and contains the right value of the URL to achieve the goal. How to evaluate the "importance" of a Web page for the scalable Web information search system designed to deal with massive data is still a question worth studying and discussing.

According to search experience, the characteristics of the importance of Web pages are:

1 The entry of the Web page is large, indicating that the number of other web pages cited more;

2 The parent page of a webpage is in a large degree;

3 The image of the Web page is high, which shows that the content of the webpage is more popular, so it is important;

4 Web page Directory depth is small, easy to browse to the Internet.

This defines URL directory depth as: The page URL to remove the domain name part of the directory level, such characteristics are not speculative, but from the long-term work in SE, from the work of SE years and surfer behavior log, can reflect this general law, such examples such as: Important Academic Paper page, Because it is often cited, it is manifested in a large degree; it can also be considered valuable and important if it is referenced by important web pages or by other sites, such as the depth of the URL directory in the Web site, which is "shallow" in the website, and is usually considered important by the person who edited the page, and placed in accessible places, The homepage of the website or the homepage of each plate is usually browsed frequently and appear important.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.