Internet search engine page value where

Source: Internet
Author: User
Keywords Search engine user value Internet page
Tags access access speed advertising application audience based blog change

Search engines handle hundreds of millions of query requests a day, each of which represents a user's specific need for a resource. Most of the time, these requirements are met by the results of the pages returned by the query, and we can assume that certain pages in the results are valuable to the specific needs of a particular user. So for search engines, the value of the page is what, why should we study the value of the page, how to judge the value of the page? This article will answer these questions one by one.

One, what page value.

We said that a page to meet the specific needs of a user, it reflects the value of the page to the user. So what are the values for search engines? A simple corollary is that all pages that are likely to be of value to the user are valuable to search engines, and these pages can be built into the search engine's index to meet the needs of their users, which we call the value of retrieval. As long as it can solve a user's information needs, and can be through some of the normal search requirements to reach, then there is the value of retrieval.

Pupil Zhang like to write a diary on the Qzone, write what he ate the day before yesterday, what to play today. These things are valuable. They are valuable to Zhang parents, classmates, teachers, and other elementary students, and to those interested in the diary of primary school students. For this information body, the name "Zhang" is the "key" of the search.

There are a number of information units, only "browsing" the value, but not to the retrieval of the information, then the resource may be valuable, but the retrieval value is very low. For example, a map near the Baidu building, from the perspective of browsing, is valuable, but if there is no surrounding text description (or link's anchor text), only a bare map, there is no search value. Of course, if the image of the content recognition technology, one day can automatically identify this is "Baidu building near the map", or can automatically analyze the map of the various buildings, streets, restaurants and other names, then this picture has become a search value. So whether a page has a search value should depend on two points:

1 Whether it can solve a specific demand (value)

2 whether the information can be obtained through a regular search method (retrieve)

So, does not search the value of the page, whether the search engine is not worth it? To think about it, the answer is no. Index is only a link in the search engine, for other links, there is no search value of the page may be better for us to include those who retrieve high value of the page is helpful. For example, in charge of crawling Internet resources spider, there are some pages, not the value of the search itself, but through the capture and analysis of these pages, can help us faster to master this category of pages do not retrieve the value of this important information, so as to save more traffic for more effective crawl.

Considering that this value can be counted as an "indirect" retrieval value, and ultimately based on the index value, it is no longer discussed in this article, we only focus on the fundamental problem of "retrieving value". The "page value" mentioned below refers specifically to the "retrieval value" of the page.

Second, why study page value

First of all, the Internet page is endless, and the search engine hardware resources are limited, want to use limited resources to cover endless Internet, we need to make a judgment on the page value, do not include those without the search value of the page, less than the search for low value of the page. This is the page value in the collection of control applications.

Second, search engine Spider's ability to crawl is limited, for the sake of access friendliness, for a website or an IP crawl rate needs to have a crawl rate of the upper limit. In this limit, crawl or page update needs to have a sequence, and this sort of the main reference is the page value, or to the page value of the prediction (when not crawled). This is the application of page value in spider scheduling.

Third, for some pages, the content of the page changes, resulting in its search value from there to none, the typical is to become a "dead chain", or "black." For these pages, a good search engine will be the first time to exclude it from the index, or to screen it when the search to ensure that the result returned to the user is more search the value of the "good page." For other pages, it not only has a high retrieval value, but also has a strong "timeliness", the first time to allow users to retrieve these pages for the search experience has a great improvement. For search engines, the faster the indexed and indexed pages means the more additional resource overhead, the faster the speed of the collection and the update of the index in a short period, the need to guide the analysis of the page value. These two aspects are the application of page value in dead chain rate and timeliness of two search engine index.

Finally, the value of the page in the general sense of the search engine returned to the user of the order of the results are also instructive. Ideally, the search engine results are sorted according to the relevance of the query request, and in the context of roughly the same relevance, users are more inclined to view pages with high page value in general sense. This is the application of page value in ranking.

It can be said that the page search value of the search engine is a relatively basic work, the value of the page to understand and judge the accuracy of the direct impact of search engine coverage, dead chain rate, timeliness and other major indicators.

Iii. How to Judge page value

The previous article mentioned a pupil Zhang Qzone diary example. We think this page is valuable to Zhang's classmates, friends and family. Similarly, Baidu CEO Robin Li posted a more than 10-word I post on the I bar, which is also valuable and valuable to Li's tens of millions of fans. Although Li's i-paste length may be far less than Zhang's diary, in terms of the value of both pages, we all have a common understanding that, in general terms, Li's i-paste value is much larger than Zhang's diary. (Of course, for Zhang's mother, it's possible that the value relationship is the opposite)

For example, search for a person's mobile phone number, the search engine returned a result, is this person in a forum of a reply. Although this mobile phone number does not care about many people, but because the resources are absolutely scarce, for the concern of this mobile phone number query requirements, this page is completely irreplaceable, and therefore has a very high value.

In addition, the page search value, but also affected by the quality of the page. Similar pages, to meet the needs of users, often there will be a big difference, such as the speed of the download of resources, the layout of the page, the amount of advertising. This kind of difference, let's call it page quality.

Finally, some pages have obvious public topic nature, and these resources tend to have a very high degree of attention when they are just produced, and the heat drops significantly with the passage of time, which has the characteristics of "news". Typical of a variety of "door" events, earthquakes, fires and other large-scale natural disasters. We consider such resources to have "timeliness" characteristics.

Therefore, the retrieval value of a page is roughly affected by the following four elements:

Interested audience size

The degree of scarcity of the page (replaceable)

The quality of the page

The timeliness characteristics of the page

These four elements, referred to as audience, scarcity, quality and timeliness.

1. Audience

The size of the audience group, which represents the size of the user's retrieval requirements. The size of the audience is mainly based on the audience and the content of the information source. Specific factors include and are not limited to:

Site Loyal user group size

Generally speaking, have their own loyal user group of well-known websites, their success lies in their content and services, more than others to attract and meet users. From this perspective, we can infer that content on a site with more loyal users will have more existing and potential audiences than content on sites with fewer loyal users. In this way, a loyal user group size can become a measure of the value of resource retrieval within the site. The advantage of a loyal user base is that it is variable. If a website gets worse, users will vote with their feet. The chain has expired problem, cheating problem, and false user group cheating is difficult. The commonly called website popularity will be closely related to the number of loyal user groups.

Distribution Law of resources

We will consider the problem of audience size that is reflected in the distribution of resources within a website. such as Sina News page of those promotional content. Why did Sina editors push the content? Because they think these are the users ' most interesting. So from the index value point of view, the equivalent of a large editorial team, the content has been labeled as "in line with the public taste." Search engines only need to enjoy their achievements. In this way, the link depth of resources relative to some structural key pages (home page, channel pages, etc.) can also be a measure of the size of a resource audience.

Access popularity

We will consider the problem of audience size from the perspective of access popularity. This is the most direct, of course, it requires third-party tools to obtain critical data. In this way, we should not only get the pages that need to be stored, but also the access mode of the user accessing a website.

Super Chain

The hyper-chain is also a reflection of the size of the audience. The higher the quality of a resource, the greater the number of contacts you get, and the more often a normal link is obtained.

Content Features

A: I write a blog: Rumors of the Guo De program on the Spring Festival Gala. ”

B: I write a blog: I have breakfast today. ”

The same source, the former audience must be higher than the latter. That is, when the publication source is the same, the content score with public properties is higher.

2. Scarcity

Scarcity is primarily a description of the uniqueness of the page in the Internet. When it comes to scarcity, which is often thought of as repetition, is scarcity equal to no repetition, how should we interpret this concept? One example can be seen:

Someone published an original blog about a news event and was then reprinted by Sina to the news channel. This is a repetition of what is described. But this repetition is only a repetition of the main content, on the one hand, it brings the access speed, stability, and other aspects of the gain, and then the search users may also use "news events + Sina" to retrieve this news. This can be called site gain. On the other hand, it may change the title of the page in the process of reprint, and rely on its audience, on the reprint page, there may be more valuable comments and replies, and there may be links to other related events news. These can be called content gain. So even if the subject content does not change any, Sina this reprint is also valuable, its scarcity degree is also high.

Similarly, conversely, if the reprinted site is rather unknown, it will not bring the site name/stability/speed gain. What's more, after the reprint on the page to add a large number of ads to prevent reading, or only reproduced the content of incomplete part, such a reprint, or the acquisition, is pure repetition, and the source of the collection is not the value of the search.

To sum up, for the main content of repeated pages, we should evaluate whether there is site gain and content gain, only for a large number of completely no gain repeat page, we should think its scarcity degree is low.

3. Quality

The quality of a page is a manifestation of its satisfaction with demand. To judge the quality of the page, should be from the most basic needs in turn progressive.

First of all, can not be dead chain, the site must have a certain degree of stability, access speed to be satisfactory.

Second, the main content is complete, layout and font is easy to read, all kinds of advertising will be too much.

Finally, whether the information is rich and the secondary requirements extended are met.

Typical low quality pages have some of the following characteristics:

Main requirement Invalid/unsatisfied (expired classifieds/software download page, invalid download link, etc.)

Dead Chain

false information/fraud

Empty page

Site instability

Permissions issues that affect the main requirement (download/Browse need to register members/points, etc.)

Incomplete information (reproduced not congruent)

Browsing experience Poor (advertising/font/page layout, etc.)

Typical high-quality pages have the following characteristics:

Fast Access (page load fast/resource download faster)

The page is clean and tidy and the main content is in a prominent position.

The page information is complete.

Rich page elements (text, pictures, comments, related recommendations, etc.)

4. Timeliness

"Timeliness" is the page value of a property, it is generally reflected in two aspects: first, the page describes the thing itself has a strong public topic, easy to spread. This is actually a manifestation of the audience. Second, the page description of the thing only at the first time there is a higher heat, with the passage of time the heat significantly decreased. This is a kind of "news" sex. For pages with both of these attributes, we think the page is time-sensitive if the search engine spider finds that the page is in the "burst" or "burst" period of the thing.

It should be explained that the search engine's broad sense of "timeliness" refers to all the valuable new resources in a timely manner to provide the search, and all of the valuable new resources, there is a large part of the increase in the speed of the user's search experience to improve the meaning is not, such as the introduction of how to thin the knowledge of the article, Zhang Diary. Page value "Timeliness" refers to a sudden timeliness, that is, all of the value of the page most need to be included in the timely manner. The timeliness of the page is judged to guide us to the search engine limited resources into the most critical place, resulting in the best cost-effective.

To determine the timeliness of the value of the page, mainly through the following channels:

Whether the page itself has a short time of sudden increase, such as the hyper-chain burst. Jia's post is a classic example.

Does the Internet page that describes the same thing have a sudden increase in time? Jia events in a short period of time erupted a large number of related discussions, reports, and the event related to all the content has a timeliness attribute.

Based on whether the pages in a set have the above two characteristics, the time value of the set is inferred. For example, the World of Warcraft is often a number of popular posts, public topics, we speculate that from the World of Warcraft post its timeliness "potential value" is relatively high.

Four, page value research focus

The paper has introduced the meaning of page value, the significance of research and the method of value judgment. Finally, we look at the direction of the research in this direction from the technical point of view. The research work on page value is focused on three aspects:

1. The understanding of the page value system. Our current understanding of the value of the page from the previous four dimensions, this understanding is comprehensive, for the changing Internet environment and user needs, these dimensions should be how to expand and change to better serve the overall search experience to promote, is a very important issue.

2, to reflect the page value of the page feature extraction. Bricks, mining More page features, more accurate and reasonable feature extraction is the basis for the improvement of the accuracy rate of the page value judgment.

3, a variety of page characteristics of the combination of strategies (machine learning). Aiming at the unused application direction, it is necessary to use the corresponding characteristics to fit the final evaluation result of the page value through a reasonable and efficient strategy.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.