Abstract: This paper briefly introduces and compares the current ranking algorithms used by search engines, mainly including the word frequency position weighted ranking algorithm, link analysis sorting algorithm, and emphatically introduces the idea of PageRank algorithm and hits algorithm and the advantages and disadvantages of the comparison.
Keywords: search engine; sort; PageRank; HITS
1
Objective
The rise of Google and Baidu is largely due to the fact that they have used more advanced sorting techniques than previous search engines. Because people tend to focus only on the top 10 or 20 of the search results, it is important to align the information that is most relevant to the results of the user query in the front row of the results. For example, pages under the. JP,. de, and. edu domain names are often more useful than web pages under. com and. NET domain names [1]. How to make the user's attention to the page arrangement in the search results by the front line, so that the search engine companies continue to improve the direction of optimization. By reading papers and web-based materials, the author will summarize and introduce some of the most important sorting algorithms: Frequency position weighted ranking algorithm and link analysis sorting algorithm.
2
Weighted ranking algorithm for Word frequency position
This kind of technology is developed on the basis of traditional information retrieval and technology, that is, the higher the frequency that the user input search term appears in the webpage, the more important the position appears, then the higher the relevance of the Web page to this search term is considered, the more the position of the searching result appears, the earlier search engine such as InfoSeek, Excite, Lycos use such sorting methods.
2.1
Word Frequency weighted
Word frequency weighting is based on the number of times a user provides a search term in a Web page to determine the weight of the Web page correlation. The methods of Word frequency weighting include absolute frequency weighting, relative frequency weighting, inverse frequency weighting, and weighting based on the value of words resolution. For a single word search engine, can give a weight only by simply calculating the frequency of a word appearing in a Web page, whereas for a search engine, with a logical grouping function it must use other weighted methods. Because the, search results are related to each search term in the search query by using a set of search terms, and the total frequency of each search term in all pages is different, if the total weights are sorted, It will result in irrelevant results. This can be solved in many other ways. For example,, uses the relative word frequency weighting principle, can use the statistical, of a large number of Web pages to assign a lower initial value to the more frequently occurring words in all web pages relative, the low frequency words in all pages, give a higher weight [ 2].
2.2
Word Position weighting
By giving different weights to different positions and layouts in the Web page, we can determine the correlation between the search results and the search terms according to the weights. The location of the word includes the page Title element, the page description key element, the body title, the body content, the text link, the logo and so on, the layout includes the font, the font size, has the bold emphasis and so on. For example, in order to understand the sorting technology, in the search "sorting technology", there are two results, a title is "Search engine sequencing Technology", another article titled "Web Information Retrieval", but the content has a particular reference to search engine sequencing technology, obviously the first result of the relevance of greater, "sequencing technology" This word should be given a larger weight in the first result.
2.3
Advantages and disadvantages of such algorithms
The main advantage of such methods is ease of use easy to implement, and the development of the most mature, basically at present all the search engine sequencing Core Technology Foundation. However, due to the current network content quality can not be guaranteed, in order to make its web page in the search engine ranking in the front, in the page and the background color of the same layer, and fill in a lot of hot keywords, people to browse the page is completely invisible, but the search engine in the indexing, but can be found. This problem has been improved to a certain extent, but it is still not fully eradicated.
3
Link Analysis Sort
The idea of link analysis sorting algorithm, in fact, comes from the paper index mechanism, that is, a paper or literature is quoted more times, then its academic value is higher, the same analogy to the Web page, if the link to a page more, then the importance of the page is higher. Link analysis algorithm is mainly divided into random roaming model, such as PageRank algorithm, based on the hub and authority mutually strengthen the model, such as hits and its variants, based on probabilistic model, such as salsa; Bayesian model-based, such as Bayesian algorithm and its simplified version. Each of these algorithms is described below.
3.1
PageRank algorithm
The Google search engine has two important features that enable it to achieve high-accuracy results. One, it uses the network link characteristic to calculate the Web page quality rank, this is PageRank; second, it uses links to improve search results [3].
The simple PageRank principle, as shown in 1, is that the link from page A to page B is considered a support vote on page A for page B, and Google determines the importance of the page based on the number of votes. But Google does not just look at the number of votes (that is, the number of links), the polling page is also analyzed. High-importance pages will be more highly rated for the votes cast.
Fig. 1 PageRank simple principle [4]
Initial PageRank algorithm: PR (A) = (1-d) + D (pr (T1)/C (T1) + ... + PR (TN)/C (TN))
Type: PR (a): Page a page PageRank value; PR (TI): Link to page A of the page Ti pagerank value; C (Ti): The number of outbound links to the Web ti; d: damping coefficient, 0<d<1. Lawrence Page and Sergey Brin provide a very simple and intuitive explanation for the above PageRank algorithm. They see PageRank as a model where users do not care about the content of the page and click the link randomly.
In the second version of the algorithm: PR (A) = (1-d)/N + D (pr (T1)/C (T1) + ... + PR (TN)/C (TN))
Here n is the total number of internet pages. This algorithm 2, is not completely different from the algorithm 1. In the random surfing model, the PageRank value of the page in algorithm 2 is the actual probability of reaching the page after clicking on many links. Therefore, the PageRank value of all pages on the Internet forms a probability distribution, and the sum of all Ragerank values is 1.
Since PR (A) depends on the other link to page A's page PageRank value, and the other Web page PR value also depends on the page to the PR value of the page, so go back and forth, is a recursive process. This seems to require an infinite calculation to obtain the PR value of the Web page, according to the test in reference 5, It calculates the recursive calculation of the 322 million links in the network, and finds that the convergence and stable PageRank value can be obtained after 52 computations, and 45 times when the PageRank value is calculated for half of the links. Through the experiment, it is found that the number of recursive calculations and the number of links is increasing in logarithmic proportion, that is, to calculate the PageRank value of n links, a stable PageRank value can be obtained by simply logn the recursive calculation [5].
3.2
Hits algorithm
The PageRank algorithm treats links equally and considers that each link contributes the same weight, while in real life some links point to ads, while others point to authoritative pages. It can be seen that the weighted value of uniform distribution does not accord with the actual situation. Dr Jon Kleinberg of Cornell University first proposed the hits algorithm in 1998.
The results of the hits algorithm's quality assessment of Web pages are reflected in its two evaluation values for each page: Content Authority (authority) and Link Authority (HUB).
Content authority is related to the quality of the content information directly provided by the Web page, the more the Web page refers to the page, the higher the Content authority, and corresponding, the link authority is related to the quality of the hyperlinks provided by the Web page, the more the content of high-quality web pages, the higher the link authority. Submit the query to a traditional keyword-based search engine. The search engine returns many pages, from which the first n pages are taken as the root collection. The pages in the root collection are included in the page, and the pages that point to the root collection are included, which expands into the underlying collection. The hits algorithm outputs a set of pages with a larger hub value and a Web page with a large authoritative value [6].
Unlike practical algorithms such as PageRank, the hits algorithm is more of an experimental nature. On the surface, the HITS algorithm has a smaller number of pages to sort, but it takes time to extract the root set from the content-based search engine and to augment the basic set, and the PageRank algorithm, on the surface, has far more data than the HITS algorithm. However, because of its computation in the user query has been done independently by the server side, do not need to wait for the client, based on this reason, from the user-side wait time, the PageRank algorithm should be shorter than the hits [7].
3.3
Other link analysis sorting algorithms
The PageRank algorithm is based on the intuitive knowledge of the user's random forward browsing, and the hits algorithm considers the enhanced relationship between the Authoritive Web page and the Hub Web page. In practice, users are most likely to be browsing the web, but many times they will be back to browse the Web. Based on the above intuition knowledge, R Lempel and S. Moran proposed the salsa (Stochastic approach for link-structure analysis) algorithm, considering the user fallback to browse the Web page, [8] Preserving the PageRank random roaming and hits the idea of dividing web pages into authoritive and hubs, and eliminating the mutually reinforcing relationship between the authoritive and the hub.
Allan Borodin and so forth a complete Bayesian statistical method to determine the hub and authoritive Web pages. Suppose there are m hub pages and N authority pages, which can be the same collection. Each hub page has an unknown real number parameter that represents the general trend of having a hyperlink, an unknown nonnegative parameter that represents a trend that has links to authority Web pages. Each authoritive page J, has an unknown nonnegative parameter that represents the level of J's authority. The statistical model is as follows, the prior probability of the link of the hub page I to authority page J is given as follows: P (I,J) =exp (+)/(1+exp (+)). Hub page I to authority page J without link, P (i,j) =1/(1+exp (+)). From the above formula can be seen, if very large (indicating that the hub page I has a very high trend point to any page), or is very large (indicating that I is a high-quality hub,j is a high-quality Authority Web page), then the probability of i->j link is relatively large [9].
4
Other sorting techniques
In addition to the above two categories of sorting algorithms, there are some other sort of methods, such as: Bid ranking is a number of search engine companies such as Baidu launched a ranking by the price of the network promotion method, but the information authenticity of the bidder needs to be strictly selected, Otherwise the user's trust in the search engine will be used by the gray industry [10]), through user feedback to improve the sorting accuracy, through the understanding of increasing the relevance of sequencing, through intelligent filtering to reduce the repeatability of sorting results.
5
Conclusion
In summary, the current search engine, such as Google, where the sequencing method is very complex, it needs to consider all aspects of the factors rather than a single one of the above algorithm. I personally think that the search engine will become more humane in the future, according to user preferences to sort and filter the results, in addition to specific areas of professional search engines will gradually develop, such as financial, sports and other specialized search engines. It is believed that the search engine will exert a greater influence in the future of the browser's becoming stronger.
References:
[1] Dennis fetterly, Mark Manasse, Marc najork, Janet wiener:a Large-scale Study of the Evolution of Web Pages, in :P roc.of the 12th Int ' L World Wide Web conf.new york:acm press,2003.669-678.
[2] Yang Silo. Research on ranking technology of search engine [J]. Modern Library and information Technology, 2005, (01).
[3] S.brin and L.page, "The anatomy of a large-scale hypertextual Web search engine," presented at proceeding of the S Eventh International World Wide Web Conference (WWW7)/computer Networks, Amsterdam, 1998
[4] Page L, Brin S, etc. The PageRank citation ranking:bringing order to the web[j]. Stanford Digital Libraries Working paper,1998, (6): 102-107.
[5] T. Have Liwala. Efficient computation of PageRank. Technical report 1999-31, 1999.
[6] Http://www.360doc.com/showWeb/0/0/569471.aspx
[7] He Xiaoyang, Wu Qiang, Wu Zhiong: Hits Algorithm and PageRank algorithm comparative analysis. 2004 2nd issue of Intelligence Magazine
[8] Http://hi.baidu.com/en_seo/blog/item/ba54f586b343f13c67096e97.html
[9] Wei, Wang Chao, Li June and so on. Research on web hyper-chain analysis algorithm. Computer science, 2003,30 (1)
[10] Changlu, Shazuch; Several common sorting algorithms for search engines. 2003-year 6th issue of Library and information work
Reprint http://blog.csdn.net/arthur0808/article/details/4030340
Ranking technology of search engine