Search engine Web page sorting algorithm

Source: Internet
Author: User
Tags idf

2.1 Search engine based on word frequency statistics--position weighting of words

Using keywords in the document frequency and position ranking is the main idea of the earliest search engine sequencing, its technology development is also the most mature, is the first phase of the search engine ranking technology, the application is very extensive, is still a lot of search engine core sequencing technology. The basic principle is: the higher the keyword in the document morphemes frequency, the more important the position appears, the better the relevance of the search term is considered.

1) Word Frequency statistics

The word frequency of a document refers to how often a query keyword appears in a document. Query keyword The higher the frequency in the document, the greater the correlation. But when the keyword is a common word, it has very little meaning to relevance judgment. TF/IDF a good solution to the problem. The TF/IDF algorithm is considered to be the most important invention in information retrieval. TF (term Frequency): single-text vocabulary frequency, the number of keywords divided by the page total word count, its quotient is called "keyword frequency." IDF (Inverse Document Frequency): Inverse text frequency exponent, the principle is that a keyword in n pages appear, then the larger the N, the smaller the weight of the keyword, and vice versa. When keywords are commonly used words, their weights are very small, which solves the defects of Word frequency statistics.

2) Word position weighting

In the search engine, the word position weighting is mainly for the Web page. Therefore, the analysis of page layout information is critical. Through the search keywords in the Web page different location and layout, give different weights, so as to determine the value of the search results and the relevance of the keyword retrieval. You can consider the layout information: Whether it is a title, whether it is a keyword, whether it is the body, font size, whether bold and so on. At the same time, the anchor text information is also very important, it is generally able to accurately describe the content of the page pointed to.

2.2 Second generation search engine based on link analysis sort

The idea of link analysis sequencing originates from the mechanism of Citation index, that is, the more the number of papers cited or the more authoritative papers are quoted, the more valuable the paper is. The idea of linking analytic sorting is similar, the more times a webpage is referenced by another page or the more authoritative a page is referenced, the greater its value. The more times the page is referenced by another page, the more popular it is, the more authoritative the page is, and the higher the quality of the page. The link analysis sorting algorithm can be broadly divided into the following categories: Based on random roaming models, such as PageRank and Repution algorithms, based on probabilistic models, such as salsa, phits, and based on the hub and authority mutually reinforcing models, such as hits and its variants , based on Bayesian model, such as Bayesian algorithm and its simplified version. All the algorithms are optimized with traditional content analysis technology in practical application. This paper mainly introduces the following kinds of classical sorting algorithms:

1) PageRank algorithm

The PageRank algorithm was presented by Stanford University PhD graduate Sergey Brin and Lwraence page. The PageRank algorithm is the core ranking algorithm of Google search engine, which is one of the most successful search engines in the world, and it also opens the craze of link analysis and research.

The basic idea of the PageRank algorithm is that the importance of the page is measured by the PageRank value, and the PageRank value is mainly embodied in two aspects: the number of pages referring to the page and the importance of the page referring to the page. A page P (a) is referenced by another page P (b) and can be viewed as P (b) recommended P (a), p (b) assigns its importance (PageRank value) to the average of all pages referenced by P (b), so the more pages refer to P (a), the more pages are assigned PageRank values to P (a), The higher the PageRank value, the more important P (A) is. In addition, the more important P (B) is, the more PageRank the page it refers to, the higher the PageRank value of P (A), and the more important it is.

The formula is:

PR (a): PageRank value of page A;

D: damping coefficient, because some pages do not have links or links, can not calculate the PageRank value, in order to avoid this problem (that is, linksink problem), and proposed. The damping coefficient is often specified as 0.85.

R (PI): The PageRank value of the page Pi;

C (Pi): The number of links out of the page link;

PageRank value is the same as the initial value of the calculation, in order not to ignore the important page linked to the page is also important this important factor, the need for iterative operations, according to Zhang Yinghai the results of the calculation, the need to carry out more than 10 iterations after the link evaluation of the value tends to stabilize, so after several iterations, the system's PR value convergence.

PageRank is a static algorithm that is independent of the query, so PageRank values for all Web pages can be computed offline. This reduces the sort time required for user retrieval and greatly reduces query response time. However, there are two defects in PageRank: First, the PageRank algorithm seriously discriminates against the newly added web page, because the new page's links and links are usually very small, and the PageRank value is very low. In addition, the PageRank algorithm only relies on the number of external links and the importance of ranking, but ignores the theme of the page relevance, so that some topics unrelated to the page (such as advertising page) to obtain a larger PageRank value, thus affecting the accuracy of the search results. To this end, a variety of topic-related algorithms have emerged, including the following algorithms are the most typical.

2) topic-sensitive PageRank algorithm

Since the original PageRank algorithm did not consider the topic related factors, Stanford University Computer Science Department Taher Haveli-wala proposed a subject-sensitive (topic-sensitive) PageRank algorithm to solve the "theme drifting" problem. This algorithm takes into account that some pages are considered important in some areas, but it does not mean that it is also important in other areas.

Page A links to page B, which can be thought of as page A's rating of page B, and if page A is the same subject as page B, it can be considered more reliable to score B. Because A and B can be viewed as peers, peer-to-peer understanding is often more than not peers, so peer scoring is often more reliable than the peer rating. Unfortunately, TSPR did not use the relevance of the topic to improve the accuracy of the link score.

3) Hilltop algorithm

Hilltop is a patent for a Google engineer Bharat in 2001. Hilltop is a query correlation link analysis algorithm that overcomes the shortcomings of PageRank query independence. The hilltop algorithm believes that links to related documents with the same subject matter will have greater value for searchers. Consider only those expert pages (Export Sources) that are used to guide people through resources in hilltop. When a query request is received, hilltop first calculates a list of the most relevant expert pages based on the subject of the query, and then sorts the target pages according to the number and relevance of the non-dependent expert pages that point to the target page.

The hilltop algorithm, which determines how the Web page matches the search keyword, replaces the method of relying too much on PageRank to find those authoritative pages, avoiding many ways to cheat by adding many invalid links to improve the PageRank value of the page. The hilltop algorithm ensures the relevance of the evaluation results to key words through grading of different grades, ensures the relevance of the subject (industry) by scoring in different locations, and prevents the stuffing of keywords by differentiating the number of phrases.

However, the search and determination of expert pages plays a key role in the algorithm, and the quality of the expert pages plays a decisive role in the accuracy of the algorithm, ignoring the impact of most non-expert pages. Expert pages in the Internet accounted for very low (1.79%), can not represent the Internet all pages, so hilltop there are certain limitations. At the same time, unlike the PageRank algorithm, the operation of the hilltop algorithm is on-line, which produces great pressure on the response time of the system.

4) HITS

The HITS (Hyperlink induced Topic Search) algorithm was proposed by Kleinberg in 1998 and is one of the most well-known algorithms in the hyperlink Analysis ranking algorithm. The algorithm divides the Web page into two types of pages in the direction of hyperlinks: the Authority page and the hub page. Authority page, also known as authoritative page, refers to a query keyword and combination of the most similar page, the Hub page is also called the Directory page, the content of the page is mainly a large number of links to the Authority page, its main function is to put these authority pages together. For authority page p, when the hub page that points to P is more, the higher the quality, the higher the Authority value of P, and for the hub page H, when H points to a authority page, the higher the Authority page quality, the greater the hub value of H. For the entire Web collection, the authority and hub are interdependent, mutually reinforcing, interdependent relationships. The relationship between authority and hub is optimized, which is the basis of the hits algorithm.

The basic idea of hits is that the algorithm measures the importance of a webpage based on the degree to which a page is entered (a hyperlink to the page) and the degree of the page (from which the page points to another page). After scoping, a matrix is established based on the page's exit and penetration, and the two vectors authority and the hub values are continuously updated to converge through the iterative operation of the matrix and the threshold values of the defined convergence.

The experimental data show that hits's ranking accuracy is higher than PageRank, and the design of hits algorithm conforms to the universal standard of network resource quality evaluation, so it is convenient for users to use Network Information retrieval tool to access Internet resources better.

However, there are the following defects: firstly, the hits algorithm only calculates the main eigenvector and handles the problem of the topic drift, and secondly, the topic generalization problem may arise when the narrow topic query is carried out; Thirdly, the hits algorithm can be used to say an experimental experiment. It must be calculated based on the link relationship between the result page of the content retrieval and its directly connected pages after the content-oriented retrieval operation of the network information Retrieval system. Although some attempts to improve the algorithm and specifically establish a link Structure Computing server (Connectivity server) and other operations, can achieve a certain degree of online real-time computing, but its computational cost is still unacceptable.

2.3 Third generation search engine based on intelligent sequencing

Ranking algorithm is of particular importance in search engines, and many search engines are now studying new sorting methods to improve user satisfaction. But at present, the second generation search engine has two shortcomings, in this context, based on the intelligent ranking of the third generation of search engine has emerged.

1) Relevance Issues

Relevance refers to the degree to which a search term and page are related. Because of the complicated language, it is one-sided to judge the relevance of the search terms and pages simply by analyzing the links and the surface features of the Web pages. For example: Search "Rice blast", a webpage is introduced information on rice pests and diseases, but there is no "rice blast" The word, the search engine simply can not be retrieved. It is the above reasons, resulting in a large number of search engine cheating phenomenon can not be solved. The way to solve the correlation should be to increase the semantic understanding, analysis of the relevance of search keywords and web pages, the more accurate correlation analysis, the better the user's search results. At the same time, the relative low page can be eliminated, effectively prevent search engine cheating phenomenon. Search keywords and Web page relevance is running online, will give the system a great amount of time pressure, can adopt distributed architecture can improve the system size and performance.

2) Single Problem of search results

On search engines, the result of anyone searching for the same word is the same. This does not meet the needs of users. Different users are not as required to retrieve the results. For example, the average farmer searches for "rice blast" and only wants to get information about the blast and how to control it, but agricultural experts or technologists may want to have a paper about blast.

A single way to resolve search results is to provide personalized services that enable intelligent search. Through web data mining, the user model (such as user background, interest, behavior, style) is established to provide personalized service.

Reference documents:

[1] The research progress of network search engine ranking algorithm Rovou; Fang Kui; Zhu Xinghui Hunan Agricultural Science 2010 (7)

[2] Search engine Web page sorting algorithm research summary Ningli; Yang Wu Tang Yong Computers and Telecommunications 2010 (5)

[3] Search engine sequencing technology research Wang Tao; Jane Computer Knowledge and Technology 2009 (5)

http://blog.csdn.net/a479898045/article/details/9749493

Search engine Web page sorting algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.