Http://www.cnblogs.com/zxjyuan/archive/2010/01/06/1640136.html

1. Introduction

The World Wide Web, www, is a huge, globally distributed information Service center that is expanding at a rapid pace, Wide. There were about 350 million documents [14] on the WWW in 1998, and about 1 million of the documents were added daily [6], and the total number of documents in less than 9 months was doubled [14]. The document on the web and the traditional document comparison, there are many new features, they are distributed, heterogeneous, unstructured or semi-structured, which presents a new challenge to the traditional information retrieval technology.

Most of the traditional web search engines are based on keyword matching, and the result is a document containing query items, as well as a search engine based on directory classification. The results of these search engines are not satisfactory. Some sites intentionally increase the frequency of keywords to improve their own in the search engine importance, to undermine the objectivity and accuracy of search engine results. In addition, some important pages do not contain query items. Search Engine Classification Directory is also impossible to consider all the classification of comprehensive, and most of the catalogue by manual maintenance, subjective strong, high cost, slow update speed [2].

In recent years, many researchers have found that the WWW over-chain structure is a very rich and important resource, if it can be fully utilized, can greatly improve the quality of search results. Based on the idea of this hyper-chain analysis, Sergey Brin and Lawrence Page presented the PageRank algorithm in 1998 [1], the same year J. Kleinberg proposed the hits algorithm [5], some other scholars have also put forward another link analysis algorithm, such as Salsa,phits,bayesian algorithm. Some of these algorithms have been implemented and used in the actual system, and have achieved good results.

The 2nd part of the article analyzes the various link analysis algorithms in detail in chronological order, and compares the different algorithms. The 3rd part evaluates and summarizes these algorithms, and points out the existing problems and improvement directions.

2. Web-based hyper-chain analysis algorithm

2.1 Google and PageRank algorithms

Search engine Google was originally a PhD graduate of Stanford University Sergey Brin and Lawrence page implementation of a prototype system [2], has now developed into one of the best search engines on the www. Google's architecture is similar to a traditional search engine, and the biggest difference from a traditional search engine is that the Web page is sorted by authoritative values so that the most important pages appear at the top of the results. Google calculates the PageRank value of a webpage by pagerank the meta-algorithm, which determines the location of the page in the result set, the higher the PageRank value of the page, the more the position appears in the result.

2.1.1 PageRank algorithm

The PageRank algorithm is based on the following 2 prerequisites:

Premise 1: A Web page is referenced multiple times, it may be important, although a Web page is not referenced many times, but it is important to refer to the page, it may also be important, the importance of a Web page is passed on the average to the page it refers to. This important web page is called the authoritative (authoritive) Web page.

Premise 2: Assume that the user at the beginning of random access to a page in the collection of Web pages, and later to follow the page of the outward link forward to browse the Web page, do not fall back to browse, the probability of browsing the next page is the PageRank value of the browsed page.

The simple PageRank algorithm is described as follows: You are a Web page, a collection of pages that you point to, a collection of pages that point to U, a number of links that you point outside, and apparently =| | , c is a factor used for normalization (Google usually takes 0.85), (this notation also applies to the algorithm introduced later) and the rank value of U is calculated as follows:

This is the formal description of the algorithm, you can also use a matrix to describe the algorithm, set a as a square, row and column corresponding to the page set page. If page I has a link to page j, otherwise = 0. Set V is a vector of the corresponding page set, with a characteristic vector of v=cav,v that has a characteristic root of C. In fact, only the eigenvector of the maximum feature root is required, which is the final PageRank value corresponding to the page set, which can be computed by iterative method.

If there are 2 pages that point to each other, B, they do not point to any other pages, there is a page C, point to A, a, a, such as a, then in the iterative calculation, the rank value of a, B is not distributed and constantly accumulated. Such as:

To solve this problem, Sergey Brin and Lawrence Page improved the algorithm, introduced the decay factor e (U), E (U) is a corresponding page set of a constant amount, corresponding to the initial value of rank, the algorithm is improved as follows:

wherein, = 1, the corresponding matrix form is V ' =c (AV ' +e).

There are also some special links that point to pages that do not have outward links. PageRank calculation, the first to remove the link, and so on after the calculation, and then add, which the original calculated page of the rank value of the impact is very small.

In addition to sorting the search results, the PageRank algorithm can also be applied to other aspects, such as estimating network traffic, backward-linked predictors, navigation for users, etc. [2].

Some problems of 2.1.2 algorithm

Google is a text-based approach to the implementation of the PageRank algorithm [2], so that only the page containing the query items, and then based on the rank value of the page to the search results sorted, the highest rank of the page to the top, but if the most important page is not in the result page set, The PageRank algorithm is powerless, such as searching for search engines in Google, like Google,yahoo,altivisa, and so on, but these pages do not appear in the results returned by Google. The same query example can also illustrate another problem, Google,yahoo is the most popular web page on the WWW, if it appears in the query item car result set, there will be a lot of pages pointing to them, will get a higher rank value, in fact, they are not very related to car.

On the basis of the PageRank algorithm, other researchers have proposed an improved PageRank algorithm. Matthew Richardson and Pedro Dominggos, of the Department of Computer Science and engineering at the University of Washington, proposed a PageRank algorithm that combines links and content information, removing the prerequisite for PageRank algorithm 2, The increase takes into account the situation where a user jumps directly from one Web page to another page that is not directly adjacent but is related to the content [3]. [4] Taher Haveliwala, Department of Computer Science at the University of Stein, proposed the subject-sensitive (topic-sensitive) PageRank algorithm. Arvind Arasu, a computer science department at Stanford University, has been tested to show that the computational efficiency of PageRank algorithm can be greatly improved [22].

2.2 Hits algorithm and its variants

The PageRank algorithm has an average contribution to the weighted value of the outward link, that is, regardless of the importance of the different links. Web links have the following characteristics:

1. Some links are annotative, and some links are navigation or advertising functions. Annotated links are used for authoritative judgments.

2. Based on commercial or competitive considerations, few Web pages point to the authoritative web pages of their competitive domain.

3. Authoritative pages rarely have an explicit description, such as a Google page that does not explicitly give descriptive information such as a web search engine.

The visible average distribution weights do not conform to the actual link [17]. J. kleinberg[5] The proposed hits algorithm introduced another kind of Web page, called the Hub page, the Hub page is a Web page that provides links to the authoritative Web page, it may not be important in itself, or there are few pages pointing to it, But the Hub page does provide a collection of links to sites that are most important to a topic, a list of recommended references on a course home page. In general, a good hub page points to a number of good authoritative web pages, and a good authoritative Web page is a Web page that has many good hub pages to point to. This relationship between the hub and authoritive Web pages can be used for the discovery of authoritative web pages and automatic discovery of web structures and resources, which is the basic idea of the hub/authority approach.

2.2.1 Hits algorithm

HITS (hyperlink-induced Topic search) algorithm is a method of searching using Hub/authority method, the algorithm is as follows: The query q is submitted to the traditional search engine based on keyword matching. The search engine returns a number of pages, from which the first n pages are taken as the root set (root set), denoted by S. S meets the following 3 conditions:

1. The number of pages in S is relatively small

2. Most pages in S are related to query Q

3. The pages in s contain more authoritative pages.

By adding S to the pages that are referenced by S and referring to the pages of S, it expands to a larger set of T.

With the Hub page in T as the vertex set VL, the hyperlinks in the page to V2 in the authoritative web page for the vertex set V2,VL are the edge set E, forming a two-point graph sg= (v1,v2,e). For any vertex in V1 V, the hub value of page V is represented by H (v), and A (U) is used to represent the authority value of the Web page for vertex u in V2. At the beginning H (v) =a (u) = 1, perform i operation on U to modify its a (U), perform O operation on V to modify its H (v), then normalize a (U), H (v), so that constant repetition computes the following Operation I,o, until a (U), H (v) converges. (Proving convergence of this algorithm is visible)

I operation: (1) O operation: (2)

Normalization of a (U) and H (v) is required after each iteration:

Formula (1) reflects that if a Web page is directed by a number of good hubs, its authoritative value increases accordingly (that is, the authority value increases to the sum of the existing hub values for all pages that point to it). Formula (2) reflects that if a Web page points to a number of good authoritative pages, the hub value also increases accordingly (that is, the hub value increases to the sum of the authoritative values for all pages linked to the page).

As with the PageRank algorithm, the algorithm can be described in matrix form, where it is omitted to write.

The hits algorithm outputs a set of web pages with a larger hub value and a Web page with a large authoritative value.

The problem of 2.2.2 hits

The hits algorithm has several problems:

1. In practice, the time spent generating t from S is expensive and requires downloading and analyzing all the links contained in each page in S, and excluding duplicate links. Generally t is much larger than s, and it is time consuming to generate a graph from T. The a/h value of the Web page needs to be computed separately, and the computational amount is larger than the PageRank algorithm.

2. Sometimes, many documents on a host a May point to a document on another host B, which increases the hub value of the document on a and the authority of the document on B, in the opposite case. Hits is the assumption that the authoritative values for a document are determined by a different individual organization or individual, which affects the hub and authority values of the documents on A and b [7].

3. Some unrelated links in the Web page affect the calculation of a,h values. In the production of web pages, some development tools automatically add some links to the page, which are mostly unrelated to the topic of the query. Links within the same site are intended to provide navigation assistance to users, and are not related to query topics, and there are commercial advertisements, sponsors, and links for friendship exchanges that also reduce the accuracy of the hits algorithm [8].

4. [12] The hits algorithm computes only the main eigenvector, that is, the main community (Community) in the T collection, and ignores other important communities. In fact, other communities may also be very important.

5. [8] The biggest weakness of the hits algorithm is the poor handling of the topic drift problem (topic drift) [7,8], which is the close link tkc (tightly-knit Community Effect) phenomenon. If there are a few pages in the collection T unrelated to the topic of the query, but they are tightly linked, the hits algorithm may be the result of these pages, because hits can only discover the main community, thus deviating from the original query topic. The TKC problem is resolved in the salsa algorithm discussed below.

6. When you use hits for narrow topic queries, you may have topic generalization issues [5,9], meaning that the extension introduces new topics that are more important than the original topic, and that new topics may not be relevant to the original query. This generalization is due to the fact that the Web page contains an out-of-the-box link to different topics, and that links to new topics are more important.

Variants of 2.2.3 Hits

The hits algorithm encounters problems, mostly because hits is purely based on link analysis algorithms, without regard to textual content, following J. After Kleinberg proposed the hits algorithm, many researchers have improved the hits, and proposed many hits variant algorithms, mainly:

2.2.3.1 Monika R. Henzinger and Krishna Bharat improvements to hits

For the 2nd question encountered by the hits mentioned above, Monika R Henzinger and Krishna Bharat were improved in [7]. Assuming that there are K pages on host a pointing to a document D on Host B, the K document on a has a total of 1 authority contributions to B, and each document contributes 1/k instead of contributing 1 to each document in hits, contributing a total of K. Similarly, for a hub value, assuming that a document T on host a points to m documents on Host B, the M document on B contributes a total of 1 to the hub value of T, and each document contributes 1/m. I,o operation changed to the following

I Operation:

O Operation:

The adjusted algorithm solves the problem 2 effectively, which is called the IMP algorithm.

On this basis, Monika R Henzinger and Krishna Bharat also introduced the traditional information retrieval content analysis technology to solve 4 and 5, in fact, also solved the problem 3. Here's how to extract the first 1000 words from each document in the root set S, concatenate them as query subject Q, and the similarity of the document DJ and subject Q as follows:

,, = number of occurrences of item I in query Q,

= The number of occurrences of item I in a document DJ, IDFI is an estimate of the number of documents containing item I on www.

After the S is extended to T, the theme similarity of each document is calculated, and according to different thresholds (threshold), the median value of all document similarity, median of the root set document similarity, and the maximum document similarity score, such as 1/10, are selected as the threshold value. According to the different thresholds for processing, delete the documents that do not meet the criteria, and then run the IMP algorithm to calculate the a/h value of the document, these algorithms are called MED,STARTMED,MAXBY10.

In this improved algorithm, it is very expensive to calculate the similarity time of the document.

2.2.3.2 Arc algorithm

The clever engineering team at IBM's Almaden Research Center proposed an arc (Automatic Resource compilation) algorithm that improved the original hits, giving the linked anchor (anchor) text A link to the initial value of the associated matrix for the page set. Adapted to the situation where different links have different weights.

The arc algorithm differs from hits mainly in the following 3 points:

1. When the root set S is extended to T, hits only extends pages with a link path length of 1 in the root collection, that is, only the pages that are directly adjacent to s are expanded, while arc extends the extension link length to 2, and the expanded page set is called the Add-set (Augment set).

2. In the hits algorithm, the matrix value for each link is set to 1, in fact each link is of different importance, and the ARC algorithm considers the text around the link to determine the importance of the link. Consider link p->q,p with several link tags, text 1<a href= "Q" > Anchor text </a> text 2, set query item T in Text 1, anchor text, text 2, occurrences of N (t), then W (p,q) =1+n (t). The length of text 1 and 2 is tested to 50 bytes [10]. Construct The matrix W, if there is a Web page i->j, wi,j=w (i,j), otherwise the Wi,j=0,h value is set to the transpose matrix of 1,z W, and the iteration performs the following 3 operations:

(1) A=WH (2) H=za (3) Normalization a,h

3. The goal of the ARC algorithm is to find the top 15 most important web pages, only need to a/h the first 15 values relative size to remain stable, do not need to a/h the entire convergence, so that the number of iterations in 2 is very small to meet, [10] that the iteration 5 times can be, so the ARC algorithm has a high computational efficiency, The overhead is mainly on the extension root set.

2.2.3.3 Hub averaging (Hub-averaging-kleinberg) algorithm

Allan Borodin in [11] pointed out a phenomenon, with M+1 Hub Web page, m+1 authoritative Web page, the first M-hub point to the second authoritative page, the M+1 hub page points to all m+1 authoritative Web pages. Obviously according to the hits algorithm, the first authoritative Web page is the most important, there is the highest authority value, this is what we want. However, according to hits, the M+1 hub Web page has the highest hub value, in fact, the M+1 hub page points to the first authoritative web page with high authoritative values, but also points to other web pages with low authoritative values, its hub value should not be higher than the hub value of the previous m pages. Therefore, Allan Borodin modifies the O operation of the hits:

O Operation:, n is the number of (V,u)

After adjustment, the hub value of a Web page that only points to a high authoritative value is higher than the hub value of a Web page that has a high authoritative value and a low authoritative value, which is called the Hub average (Hub-averaging-kleinberg) algorithm.

2.2.3.4 threshold value (Threshhold-kleinberg) algorithm

At the same time, the Allan Borodin in [11] simultaneously proposed 3 kinds of threshold control algorithms, namely the hub threshold algorithm, the authority threshold algorithm, and the full threshold algorithm combined with 2.

When calculating the authority of page p, regardless of the contribution of the hub value of all the pages that point to it, consider only the contribution of the hub value over the average page, which is the hub threshold method.

The authority threshold algorithm and the hub threshold method are similar, regardless of the authority of all P-directed pages to the hub value of P, only the contribution of the former K authoritative pages to its hub value, which is based on the goal of the algorithm is to find the most important K authoritative Web page premise.

The algorithm of using both the authority threshold algorithm and the hub threshold method is the full threshold algorithm.

2.3 Salsa Algorithm

The PageRank algorithm is based on the intuitive knowledge of the user's random forward browsing, and the hits algorithm considers the enhanced relationship between the Authoritive Web page and the Hub Web page. In practice, users are most likely to be browsing the web, but many times they will be back to browse the Web. Based on the above intuition knowledge, R Lempel and S. Moran proposed the salsa (Stochastic approach for link-structure analysis) algorithm [8], considering the user's fallback to browse the Web page, PageRank's random roaming and the idea of dividing web pages into authoritive and hubs in hits have been retained, eliminating the mutually reinforcing relationship between authoritive and the hub.

The specific algorithm is as follows:

1. As with the first step of the hits algorithm, the root set is obtained and extended to the Web page collection T, and the orphaned nodes are removed.

2. Construct the graph G ' = (vh,va,e) from the set T

Vh = {SH | S∈c and Out-degree (s) > 0} (Hub side of G ').

Va = {SA | S∈c and In-degree (s) > 0} (Authority side of G ').

e= {(sh, RA) | S->r in T}

This defines 2 chains, authority chains, and hub chains.

3. The change matrix of 2 Markov chains is defined, and also the random matrix, which is the hub matrix h,authority matrix A.

4. The main eigenvector of Matrix H,a is obtained, which is the static distribution of the corresponding Markov chains.

5. A medium value of the corresponding Web page is to find the important page.

The salsa algorithm does not have an iterative process that is mutually reinforcing in hits, and the computational amount is much smaller than hits. The salsa algorithm only considers the influence of the directly adjacent pages on its own a/h, and hits calculates the effect of the entire page set T on its AH.

In practice, Salsa ignores a number of unrelated links when extending the root set, such as

1. Links within the same site, because most of these links only play a navigational role.

2. CGI script links.

3. Advertising and sponsorship links.

Experimental results show that for single-subject query Java,salsa has more accurate results than hits, the results of multi-topic query abortion,hits focus on some aspects of the topic, and the results of salsa algorithm cover many aspects, that is, for TKC phenomenon, The salsa algorithm has higher robustness than the hits algorithm.

2.3.1 BFS (Backword Forward Step) algorithm

When the salsa algorithm calculates the authority value of a webpage, it only considers the popularity of the page in the direct adjacent page set, ignoring the influence of the other pages. The hits algorithm considers the structure of the entire graph, in particular, after the N-step, the weight of the page I authority is, the number of paths to leave the page I, that is, the page j<>i, the weight of I contribution equals the number of paths from I to J. If a loop is included from I to J, then the contribution of J to I will increase exponentially, which is not what the algorithm expects because the loop may not be related to the query.

Therefore, Allan Borodin and other [11] proposed BFS (backward Forward Step) algorithm, both salsa expansion and hits restrictions. The basic idea is that salsa only consider the effects of direct adjacent pages, and BFS expands to consider the impact of adjacent pages with a path length of N. In BFS, it is specified that the set of nodes that can reach I through a path, so that the contribution of J to I is dependent on the distance from J to I. BFS adopts the method of reducing weight by means of point number, and the weight of node i is calculated as follows:

= | B (i) |+ | BF (i) | + | BFB (i) |+......+| |

The algorithm starts from node I, accesses backward in the first step, and then continues to access the neighbor either forward or backward, and each step encounters a new node join weight calculation, and the node is added to the calculation only when it is first accessed.

2.4 phits

D. Cohn and H. Chang proposed a statistical algorithm for calculating the hub and authority phits (probabilistic analogue of the HITS) [12]. They present a probabilistic model in which a potential factor or topic Z affects a link to document D to document C, and they further assume that given a factor z, the conditional distribution P (c|z) of document C exists, and the conditional distribution P (Z|D) for the given document D, Factor z also exists.

P (d) p (z|d) p (c|z), where

Based on these conditional distributions, a possibility function (likelihood functions) L is proposed.

, M is the corresponding link matrix

Then, the phits algorithm uses the EM algorithm proposed by the Dempster, [20] to allocate the unknown conditional probabilities so that l maximizes, that is, the best explanation of the link between the Web pages. The algorithm requires that the number of factor z is given beforehand. Allan Borodin points out that the EM algorithm used in phits may converge to the partial maximization rather than the true global maximization [11]. D. Cohn and T. [13] Hofmann also proposes a probabilistic model that combines document content and hyperlinks.

2.5 Bayesian algorithm

Allan Borodin and so forth a complete Bayesian statistical method to determine the hub and authoritive Web page [11]. Suppose there are m hub pages and N authority pages, which can be the same collection. Each hub page has an unknown real number parameter that represents the general trend of having a hyperlink, an unknown nonnegative parameter that represents a trend that has links to authority Web pages. Each authoritive page J, has an unknown nonnegative parameter that represents the level of J's authority.

The statistical model is as follows, the prior probability of the link of the hub page I to authority page J is given as follows:

P (I,J) =exp (+)/(1+exp (+))

Hub page I to authority page J no link, P (i,j) =1/(1+exp (+))

From the above formula can be seen, if very large (indicating that the hub page I has a very high trend point to any page), or is very large (indicating that I is a high-quality hub,j is a high-quality Authority Web page), then the probability of i->j links is relatively large.

In order to conform to the Bayesian statistical model specification, to give 2m+n unknown parameters (,,) to specify a priori distribution, these distributions should be generalized, do not provide information, do not rely on the observed data, the results can only have a small impact. Allan Borodin is specified to satisfy the positive distribution N (μ,), mean μ=0, standard variance δ=10, specifying and satisfying the EXP (1) distribution, i.e. x>=0,p (>=x) =p (>=x) =exp (-X).

Next is the standard Bayesian method for processing and hits of matrix feature roots.

2.5.1 Simplified Bayesian algorithm

Allan Borodin also proposes a simplified Bayesian algorithm, which completely removes the parameters and eliminates the need for the parameters that are too distributed μ,δ. When the calculation formula changes to: P (i,j) =/(1+), the hub page to authority page J has no link, p (i,j) =1/(1+).

Allan Borodin points out that the results of simplified Bayesian production are very similar to those of the salsa algorithm.

2.6 Reputation

All of the above algorithms, from the query item or topic, through the algorithm processing, to obtain the results of the Web page. Alberto Mendelzon, Davood Rafiei, a computer department at the University of Toronto, proposed a reverse algorithm that entered the URL of a Web page as a set of topics, and the Web page has a reputation on these topics (repution) [16]. For example, the input, www.gamelan.com, the possible output is "Java", the specific system can access the htpp://www.cs.toronto.edu/db/topic.

Given a page p, calculate the prestige on the subject T, first define 2 parameters, permeability and focus rate, for simplicity, page p contains the subject item T, it is considered p on the topic T.

is the number of pages that point to P and contains T, the number of pages that point to P, and the number of pages that contain T. Combining the non-conditional probabilities, introduced, is the number of pages on the web. The prestige of P on T is calculated as follows:

Specifies that both points to P have the probability of containing t, i.e., obviously there is

We can get from the results of search engines (such as AltaVista), and the total number of pages on the web is estimated by some organizations that are often published, and in the calculation a constant does not affect the ordering of RM, which the RM finally calculates:

Given page p and subject T,RM can be computed as above, but most of the cases are given only for page p, which needs to be extracted after the topic is computed. The goal of the algorithm is to find a set of T, so that the RM (P,T) has a larger value. The topic system is the topic of extracting anchor text from pages that point to P (the anchor text above has been discussed to describe the target page very well, the accuracy is high), avoids downloading all the pages that point to P, and the RM (p,t) calculation is very simple, the algorithm is more efficient. When the topic is extracted, the text that is used for navigating, repeating the link is also ignored, and the stop word (stop word) is also filtered, such as "a", "the", "for", "in" and so on.

The reputation algorithm is also based on the random roaming model (randomness walk), which can be said to be a combination of PageRank and salsa algorithms.

3. Classification and evaluation of link algorithms

Link analysis algorithm can be used to improve the search engine query effect, you can find the important community on the WWW, you can analyze a website topology, reputation, classification, etc., can be used to achieve automatic classification of documents and so on. In the final analysis, it can help users to find the information needed in the www. This is an area of research that is rapidly developing.

From the historical point of view, we summarize the development process of link analysis algorithm, introduce the basic idea and implementation of the algorithm in detail, and discuss the existing problems of the algorithm. Some of these algorithms are in the research stage, some have been implemented in the concrete system. These algorithms can be broadly divided into 3 categories, based on random roaming models, such as the pagerank,repution algorithm, based on the hub and authority mutually reinforcing the model, such as hits and its variants, based on probabilistic models, such as Salsa,phits, based on the Bayesian model, such as the Bayesian algorithm and its simplified version. All the algorithms are optimized with traditional content analysis technology in practical application. Some of the actual systems implemented some algorithms, and achieved good results, Google implemented the PageRank algorithm, IBM Almaden Center for clever project implementation of the ARC algorithm, The computer Department of the University of Toronto has implemented a prototype system topic to calculate the prestigious theme of a given Web page.

Brian Amento of the Shannon Laboratory in T-Lab pointed out that using authority to evaluate the quality of Web pages was consistent with the results of human expert evaluations, and that the results of various link analysis algorithms differed very little in most cases [15]. However, Allan Borodin also points out that no one algorithm is perfect, and that under some queries the results may be good, and under other queries, the results may be poor [11]. Therefore, according to the situation of different queries, choose a different appropriate algorithm.

Based on the link analysis algorithm, it provides an objective method to measure the quality of Web pages, independent of the language, independent of the content, without human intervention, can automatically discover the important resources on the web, excavate the important community on the Web, automatically implement the document classification. But there are some common problems that affect the accuracy of the algorithm.

1. The quality of the root set. The root set quality should be very high, otherwise, expanded page assembly to add a lot of irrelevant pages, the theme of drift, topic generalization and a series of problems, the computational volume also increased a lot. The algorithm is no better, and can not find a lot of high quality web pages in the Low quality Web page set.

2. Noise link. Not every link on the Web contains useful information, such as ads, site navigation, sponsors, links for friendship Exchange, not only not helpful for link analysis, but also affect the results. How to effectively remove these unrelated links is also a key point of the algorithm.

3. The use of anchor text. Anchor text has high precision, and the description of links and landing pages is more accurate. The above algorithm uses anchor text to optimize the algorithm in the concrete implementation. How to use anchor text accurately and fully has great influence on the accuracy of the algorithm.

4. The classification of the query. Each algorithm has its own application, for different queries, should adopt different algorithms, in order to obtain the best results. Therefore, the classification of queries is also very important.

Of course, these questions with great subjectivity, for example, the quality can not be precisely defined, whether the link contains important information is not effective method can be accurately judged, analysis of anchor text and related to semantic issues, query classification is not clear boundaries. If the algorithm to achieve better results, in these areas need to continue to do in-depth research, I believe that in the near future there will be more interesting and useful results appear.

Search engine Algorithms and research

theEDGE recommended [2007-9-2]

Source: Yi Tian Rui Xin

Christine Churchill

As a search engine or SEO professional, do you really need to understand the algorithms and techniques that support search engines? In a recent search engine strategy meeting, search engine algorithms and research Panel experts reply is affirmative: absolutely necessary.

This is a special report from the February 2005 2 August-March 3rd in the United States in New York, the search engine strategy meeting.

Members of this search engine algorithm and research panel include: ASK Jeeves, vice president of product management and search technology Rahul Lahiri,smart Interactive (recently websourced acquisition) Ceo,mike Grehan and from MI Dr. Edel Garcia of Islita.com.

What are the ins and outs of the problem?

"Do we really need to know everything on the search engine technology level?" "Grehan asked. Yes He answered unequivocally and continues to explain when you understand the competitive advantage of search engine algorithms.

"If you know what causes a document to be ranked higher than the other one, you can strategically optimize and better serve the customer." And if your client asks, ' Why is my competitor always in the top 20, and I don't? How does the search engine work? ' If you say, ' I don't know--they're like that--how long do you think they can keep their accounts? ”

Grehan explained his point by citing Brian pinkerton--as the first person to develop a full-text search engine in 1994. "Give it a picture," he explained: "A customer enters a large travel goods store, which has everything that can be used to travel around the world, and he looks at the young man there and asks ' travel '. Where do you think that salesperson should start? ”

Search engine users want to achieve their goals through minimal perception and maximum happiness. They do not think carefully when entering queries, they use inaccurate 3 words to search, and do not learn the correct composition of the query. This makes the search engine work more difficult.

The evolution of search methods, adequacy problems and algorithms

Grehan continues to talk about the important role of search in document rankings. "The combination of a variety of fascinating things creates a ranking. We should learn as much as possible, so when we talk about why an article is ranked higher than another, we should at least be able to show some evidence of what is happening. ”

Grehan illustrates the process of search engine algorithms over time. In early search engines, text is extremely important. But search researcher Jon Kleinberg found what he called "ample problems." When you enter a search and then return thousands of pages that contain the appropriate text, there are plenty of problems. How do you know which page is the most important or the most appropriate? How does the search engine determine which Web page should appear at the top of the search results list? Search engine algorithms must continue to evolve into complexity in order to adapt to overly abundant problems.

From Ask Jeeves's view

According to the claim from ask Jeeves Rahul Lahiri, on the Internet, ask Jeeves property ranked seventh, ranked fourth in search engines, Lahiri lists several things that are key to the Ask Jeeves search engine, including index size, The freshness and data structure of the content. ASK Jeeves's attention to data structures is unique and is distinguished from other search engines by this approach.

There are two key drivers in Web search: Content analysis and link analysis. Lahiri confirms that Ask Jeeves treats the network as a picture and looks at the link relationships between them and attempts to draw clusters between relevant information.

By dividing the network into different chunks of information, ASK Jeeves can better understand a query and provide it to searchers based on the authoritative "knowledge" from each chunk. If you have a smaller site but are very relevant in your chunks, your site may be better ranked than the larger sites outside the chunks that provide the relevant information.

Why co-occurrence is important

Dr. Edel Garcia was delayed and was unable to report to the Panel, but he prepared a PowerPoint file with voice-over instructions. The host, Chris Sherman, told everyone that the slide was explained by Dr. Garcia.

Dr. Garcia is an expert who has a special interest in AI and information retrieval. He explained that the co-existing phrases tended to be considered relevant or "interrelated". In addition, semantic union affects our understanding of a word. When we see the word "Aloha (Hawaiian greeting: Welcome, Goodbye)" We Think of "Hawaii (Hawaii)", why? This is because of the semantic union between words. According to Garcia, co-existing theory can be used to understand the semantic union between words, trademarks, products and services, and so on.

Dr Garcia then raised a question. Why do we have to focus on the syndication of words in search engines? His answer is: Consider the Keyword Trademark Union. This is important for search marketing implications.

If you want to learn more about Dr. Garcia's theory, visit the Search Engine Watch forum keywords co-occurrence and Semantic Connectivity.

The symposium ended in a lively question-and-answer format. What is the evolutionary trend of search engine algorithms? Grehan already has a ready-made answer: he expects to introduce probabilistic potential semantic indexing and probabilistic hypertext-guided topic searches. What do you mean by those full-mouthed jargon? You must attend the next search engine meeting to find the answer.

About

Christine Churchill is the president of Keyrelevance.com, a company that offers search engine marketing for all-round natural search engine optimization, strategic link building, usability testing and pay-per-click Management Services.

Research on several search engine algorithms