A search engine processes hundreds of millions of query requests every day. each query request represents a user's specific needs for a certain resource. Most of the time, these requirements are met by querying the returned Web page results. we can think that some pages in the results produce value for the specific needs of specific users. So what is the value of a page for a search engine, why should we study the value of a page, and how can we determine the value of a page technically?
A search engine processes hundreds of millions of query requests every day. each query request represents a user's specific needs for a certain resource. Most of the time, these requirements are met by querying the returned Web page results. we can think that some pages in the results have a value for the specific needs of specific users. So what is the value of a page for a search engine, why should we study the value of a page, and how can we determine the value of a page technically?
1. what page value?
As we mentioned above, a page meets the specific needs of a user and reflects the value of this page to users. So what are the benefits of search engines? In a simple inference, all pages that may generate value to users are valuable to search engines, these pages can be indexed by the search engine to meet the needs of users who finally retrieve them. we call this valueSearch value. As long as a user's information needs can be met and can be achieved through some normal retrieval needs, it is of great search value.
Primary school student John likes to write a diary on qzone and write about what he had eaten the day before yesterday and what he played today. These contents are valuable. They are of great value to the parents, students, teachers, and other primary school students, and those interested in the primary school diary. For this information body, the name "Michael" is the "key" for retrieval ".
There are some information units that only "browse" the value, but do not reach the retrieval path of the information, then the resource may be valuable, but the retrieval value is very low. For example, a map near Baidu building is valuable from the perspective of browsing. However, if there is no surrounding text description (or link's anchor text), there is only one bald map, there is no search value. Of course, if the image content recognition technology can automatically identify the "map near Baidu Tower" one day ", or you can automatically analyze the names of various buildings, streets, restaurants, and so on in the map. this example becomes valuable for retrieval. Therefore, whether a page has retrieval value depends on two points:
- Whether a specific requirement (value) can be solved)
- Whether the information can be obtained through a regular search method (search)
So, is there no value for search engines on pages without retrieval value? If you think about it, the answer is No. Indexing is only a part of the search engine. for other aspects, pages with no search value may be helpful for us to better include those pages with high search value. For example, for spider who is responsible for capturing Internet resources, some pages have no retrieval value, but through the crawling and analysis of these pages, it can help us quickly grasp the important information of this type of page without retrieving value, thus saving more traffic for more effective crawling.
Considering that this value can be regarded as an "indirect" search value, it is still based on the index value, and will not be discussed in this article, we only focus on the fundamental problem of "retrieval value. The "page value" mentioned below refers to the "retrieval value" of the page ".
II .? Why study page value
First, the pages on the Internet are endless, while the hardware resources of search engines are limited. to use limited resources to cover the endless Internet, we need to judge the value of the page, those pages with no search value are not included, and those pages with low search value are rarely included. This is an application of page value in indexing control.
Second, the crawling capability of search engine spider is limited. for the sake of access friendliness, there must be an upper limit on the crawling speed of a website or an IP address. Under this restriction, capture or page update requires a sequential order, and the main reference for this sorting is the page value, or predict the value of the page (when the page is not captured ). This is an application of page value in spider scheduling.
Third, for some pages, the content of the page changes, resulting in its retrieval value from available to Unavailable. a typical change is to "dead chain" or "hacked ". For these pages, a good search engine will exclude them from the index at the first time, or block them during retrieval, to ensure that the results returned to users are "good pages" with higher search value ". For other pages, it not only has a high search value, but also has a high "timeliness", which allows users to quickly retrieve these pages, greatly improving the search experience. For search engines, the faster the indexing and index pages, the more additional resource overhead the more fast the indexing and the more short the cycle to update the index, the analysis of page value is required. These two aspects are the application of the page value in improving the dead chain rate and timeliness of the two search engine indicators.
Finally, in the general sense, the value of the page also has guiding significance for sorting the results returned by the search engine to users. Ideally, search engine results are sorted based on the relevance of query requests. when the relevance is roughly the same, users tend to browse webpages with high page value in the general sense. This is an application of page value in ranking.
It can be said that the research on the value of page search is a basic task in search engines, the understanding and accuracy of the page value directly affects the coverage rate, dead chain rate, timeliness, and other major indicators of the search engine.
3 .? How to determine page value
I mentioned an example of Zhang San qzone diary for primary school students. We believe that this page is valuable and valuable to Michael Jacob's classmates, friends, and family. Similar to this, Li Yanhong posted a dozen-character I post on the I post bar, which is also valuable to Li Yanhong's tens of millions of fans. Although Li Yanhong's I post length may be far less than Zhang San's diary, we all share a common understanding of the value of these two pages, that is, in a general sense, li Yanhong's I post is much more valuable than John's diary. (Of course, for James's mother, it is likely that the value relationship is the opposite)
Another example is to search for a person's mobile phone number. The search engine returns a result, which is a reply from this person on a forum. Although there are not many people who care about this mobile phone number, this page is completely irreplaceable because resources are absolutely scarce.
In addition, the value of page retrieval is also affected by page quality. Similar pages are often very different to meet user needs, such as resource download speed, page layout, and the amount of advertisement. This type of difference is called page quality.
Finally, some pages have obvious public topics, and these resources tend to have a very high degree of attention when they are just generated. as time goes by, the Heat decreases significantly and there is a "news" feature. Typical events include various "door" events, earthquakes, fires, and other large natural disasters. We believe that such resources are "time-sensitive.
Therefore, the retrieval value of a page is roughly affected by the following four factors:
- Size of the target audience
- Scarcity of the page (alternative)
- Quality of the page
- Timeliness features of the page
These four elements, short for audience, scarcity, quality and timeliness.
1 .? Audience
The size of the audience represents the size of the user's search needs. The size of the evaluated audience is mainly based on the audience of the information publishing source and the audience of the information content. Specific factors include and are not limited:
- The size of the loyal user group of the website.Generally, the success of a well-known website with its own loyal user base lies in its content and services, which are more attractive to and satisfying users than others. From this perspective, we can infer that the content on websites with more loyal user groups will be less content than those on websites with fewer loyal user groups, there are more and more potential audiences. In this way, the size of loyal user groups can become a measure of the value of intra-site resource retrieval. The benefit of loyal user groups is that they are changed. If a website gets worse, users will vote with their feet. Hyperlinks are prone to expiration and cheating, which is difficult for fake users. Generally, website visibility is closely related to the number of loyal user groups.
- Resource distribution rule.We will consider the size of the audience reflected by the resource distribution within a website. Such as those on the Sina news homepage. Why should Sina editors push such content? Because they think these are users' most interested. From the perspective of index value, there is a huge editorial team that has already labeled the content as "satisfying the tastes of the masses. Search engines only need to enjoy their achievements. In this way, the link depth of resources relative to some structural key pages (home pages, channel pages, etc.) can also be used to measure the size of a resource audience.
- Access popularity.We will also consider the audience size from the perspective of access popularity. This is the most direct. of course, it requires a third-party tool to obtain key data. In this way, we should not only obtain the pages that require warehouse receiving, but also the access mode for users to access a website.
- Hyperchain.To some extent, hyperchains are also reflected by the group size. The higher the quality of a resource, the larger the audience, the larger the number of normal connections.
- Content features.A: I wrote A blog: "It is rumored that Guo Degang is about to attend the gala ."? B: I wrote a blog: "I have breakfast today ."? The audience of the former must be higher than that of the latter. That is, when the publishing source is the same, the content with public attributes has a higher score.
2 .? Scarcity
Scarcity mainly describes the uniqueness of pages on the Internet. When it comes to scarcity, we often think of repetition. is scarcity equivalent to non-repetition? How should we interpret this concept? Let's look at an example:
Someone posted an original blog about a news event, which was then reproduced by Sina on the news channel. In terms of the description, this is a repetition. However, this type of repetition is only the repetition of the subject content. On the one hand, its reproduction brings access speed, stability, and other gains, in addition, users may use "news events + Sina" to retrieve the news. This can be called site gain. On the other hand, it may change the page title during the reprinting process. In addition, relying on its audience, there may be more valuable comments and replies on the reprinting page, there may also be news links pointing to other related events. These can be called content gain. Therefore, even if the content of the topic remains unchanged, Sina's repost is also valuable, and its scarcity is also high.
Similarly, if the reposted website is not well known, it cannot generate site name/stability/speed gain. Even more, after reprinting, a large number of advertisements are added to the page to impede reading, or only part of the incomplete content is reproduced. Such reprinting or collection is purely repetitive, compared with the collection source, there is no search value.
To sum up, for pages with repeated content, we should evaluate whether there are site gains and content gains. only for a large number of duplicate pages with no full gain, we should think that the scarcity is low.
3 .? Quality
The page quality is a manifestation of the degree to which it meets requirements. Judging the page quality level should be progressive from the most basic requirements.
First of all, it cannot be a dead chain, a website must have a certain degree of stability, and the access speed should be satisfactory.
Second, whether the subject content is complete, whether the layout and font are easy to read, and whether there will be too many advertisements.
Finally, whether the diversified and extended sub-requirements are met.
Typical low-quality pages have the following features:
- Invalid master requirements/unfulfilled (expired classified ads/software download pages, invalid download links, etc)
- Dead chain
- False information/fraud
- Blank page
- Site instability
- Permission issues that affect master requirements (registration of members/points required for downloading/browsing)
- Incomplete information (not fully reproduced)
- Poor browsing experience (advertisement, font, page layout, etc)
Typical high-quality pages have the following features:
Fast Access (fast page loading/fast resource download)
- The page is clean and tidy, and the subject content is in a prominent position.
- Complete page information.
- Rich page elements (text, images, comments, and related recommendations)
4 .? Timeliness
"Timeliness" is a property of page Value. it is generally reflected in two aspects: first, the things described on the page itself have a strong public problem and are easy to spread. This is actually a manifestation of the audience. Second, the page describes things that have a high degree of popularity only in the first place, with a significant decrease in popularity over time. This is a kind of news. For pages with the preceding two attributes, if the search engine spider finds that the page is in the "outbreak period" or "outbreak period" of the transaction, we believe that the page is time-sensitive.
It should be noted that, in the broad sense, "timeliness" of search engines refers to the provision of search for all valuable new resources in a timely manner, and all valuable new resources, most of the improvements in indexing speed have little significance for the user's search experience, such as the introduction of knowledge-based articles about how to slim down, and James's diary. The "timeliness" in page value refers to the burst timeliness, that is, the most timely indexed among all valuable pages. The determination of page timeliness is to guide us to invest limited resources of search engines in the most critical areas, to produce the best cost-effectiveness.
The following methods are used to determine the timeliness value of a page:
- Whether the page itself has a sudden increase in the audience for a short time, such as the leeching. Jia Junpeng's post is a typical example.
- Whether the Internet page of the same thing has a burst of time. The Jia Junpeng incident burst into a large number of discussions and reports within a short time, and all content related to the incident had a timeliness attribute.
- The timeliness value of a set is estimated based on whether the pages in a set have the preceding two features. For example, World of Warcraft often has some popular posts and public topics. We speculate that posts from World of Warcraft have a relatively high "potential value" timeliness.
4 .? Research focus on page value
The previous article has introduced the meaning of page value, the significance of the study, and the methods of value judgment. Finally, let's take a look at the key areas of research in this direction from a technical point of view. The research work on page value mainly focuses on three aspects:
- Understanding of the page value system. Our current understanding of the page value comes from the four dimensions described above. is this understanding comprehensive? is it necessary for the ever-changing Internet environment and user needs, how can these dimensions be expanded and changed to better serve the overall search experience improvement.
- Page feature extraction that reflects the page value. It is difficult to mine more page features, and more accurate and reasonable feature extraction is the basis for improving the accuracy of page value determination.
- A combination of various page features (machine learning ). For unused application directions, use appropriate features to fit the final evaluation results of page value with reasonable and efficient strategies.
This article is available at http://www.nowamagic.net/librarys/veda/detail/1485.