ArticleDirectory
- Since PageRank algorithms are widely known and popular, we will take PageRank algorithms as an example to describe the classic mix of "Parallel Computing + Data algorithms, in addition, the analysis process of "concurrent processing of massive data and convergence after multiple iterations" is similar to that of other data mining or machine learning algorithm applications, which can serve as a good reference.
When talking about parallel computing applications, someone may think of PageRankAlgorithmWe have thousands of web page analysis links to determine the ranking order. using parallel computing is a good scenario. For a long time, Google's founding invention PageRank algorithm attracted many people to study. It is said that Google's founders were excited to find Yahoo! The company said they found a better search engine algorithm, but it was reported by Yahoo! The company's technical staff poured cold water and said they were not concerned about better technology, but about the profit of search. Later, Google turned into a new generation of search engines with more advanced technologies, gradually replacing the market and making profits.
Since PageRank algorithms are widely known and popular, we will take PageRank algorithms as an example to describe the classic mix of "Parallel Computing + Data algorithms, in addition, the analysis process of "concurrent processing of massive data and convergence after multiple iterations" is similar to that of other data mining or machine learning algorithm applications, which can serve as a good reference.
The formula for PageRank is as follows:
In fact, we can directly describe the formula itself and introduce how to use parallel computing to apply the formula above to obtain the PageRank value of each web page. In this way, although PageRank computing is completed through parallel computing, however, we still don't understand how the PageRank formula was created.
We put this PageRank algorithm formula aside and look at a gambling game:
There are three members, A, B, and C, who have the following relationship:
A's money is lost to B and C
B's money is lost to C
C lost to
For example, Party A, Party B, and Party C each have a cost of 100 yuan. Based on the above-mentioned winning/losing relationship, let's play with it:
A lost 50 yuan to B and 50 yuan to C
B lost 100 yuan to C
C lost 100 yuan to
If you just play, it's easy to figure out who wins
But what would they do if they kept the winning or losing relationship and the winning money was invested in another round of gambling?
We can write a single machineProgramLook, for ease of computing, the initial cost is set to 1 RMB, with x1, x2, X3 representing a, B, c:
Double X1 = 1.0, X2 = 1.0, X3 = 1.0;
X1_income, x2_income, and x3_income represent the money each user wins after a bet. Based on the relationship between winning and losing:
Double X2Income = x1/2.0;
Double X3Income = x1/2.0 + x2;
Double X1 _ income = X3;
In the end, we will cover the money that each person wins and continue to calculate. The complete procedure is as follows:
// Gamble standalone Program
Public class gamble
{
Public static double X1 = 1.0, X2 = 1.0, X3 = 1.0;
Public static void playgame (){
Double x2_income = x1/2.0;
Double x3_income = x1/2.0 + x2;
Double x1_income = X3;
X1 = x1_income;
X2 = x2_income;
X3 = x3_income;
System. Out. println ("X1:" + X1 + ", X2:" + X2 + ", X3:" + X3 );
} Public static void main (string [] ARGs ){
For (INT I = 0; I <500; I ++ ){
System. Out. Print ("+" + I + "");
Playgame ();
}
}
}
After we run the 500 round, we can see the following results:
We found that after 107 rounds, each person's winning or losing results have always been
X1: 1.2000000000000002, X2: 0.6000000000000001, X3: 1.2000000000000002
......
Maybe you didn't even think there would be such a rule, so you keep making a bet. Although each round has a loss and a win, the results of winning and losing after multiple rounds have remained in balance and remain unchanged. In terms of technical terms, convergence is generated after multiple rounds of iterations. in the conventional sense, it is not a loss to play a and c, and B will continue to gamble if he does not agree to defeat, and there will be no chance to pull the book.
Let's change the relationship between winning and losing: C's money is lost to A and B.
Double x2_income = x1/2.0 + X3/2.0;
Double x3_income = x1/2.0 + x2;
Double x1_income = X3/2.0;
After 10000 rounds of operation, it is found that it has converged again:
X1: 0.6666666666666667, X2: 1.0, X3: 1.3333333333333333
...
But this time it turned into "A is a loss, B is a loss, and C is a win". We found that the result of convergence can be used for ranking. If you give them a gambling rank, apparently: "C ranks first, B is second, and a is third ".
So will this convergence happen in all circumstances, and what will happen?
Let's look back at the above relationship. The three members A, B, and C each have their own wins and losses. As a result, the money is not lost, so the three of them can continue to gamble. If we change the relationship between winning and losing, let Jia only lose money and do not win, as shown below:
Double x2_income = x1/2.0 + X3/2.0;
Double x3_income = x1/2.0 + x2;
Double x1_income = 0;
So what is the result?
We found that after many rounds, all are 0. Let's analyze the process. After the first round, Jia lost his money and did not win a penny. However, B and c both win or lose each other. When they have been gambling for more than 2000 rounds, B's money has all been lost, And A and B have no money to invest in it. As a result, C won't win any more, the final result for all users is 0.
Let's analyze the relationship between winning and losing. After all the money from Party A is lost to Party B, Party B and Party B bet on winning more and losing less, as a result, all the money is gradually won by C, which leads to the final failure to maintain a balance. Therefore, if we want to maintain balance and convergence, we must ensure that the people who win the money are not allowed to go, and we must lose to others, so that the money will not be lost in the three circles. In other words, if someone only wins, the game won't be able to continue.
After the gambling game is over, let's look at the PageRank algorithm formula:
The above L (B) indicates the number of connections that page B points to other pages. For example:
Assume there are three webpages, A, B, and C. Their links are as follows:
A contains links between B and C
B contains the C link
C contains a link
According to the formula above, the prvalues of each webpage are obtained as follows:
Pr (B) = Pr (a)/2;
Pr (B) = Pr (a)/2 + Pr (C );
Pr (A) = Pr (C );
Let's look back and compare it. Changing A, B, and C to A, B, and C is the gambling game example above.
So what is Q? Q in the formula is called an escape factor. The name is abstract and used to solve the problem of "only lose or win" and not converge in the above gambling games, 1-Q will ensure that when one of the prvalues is 0, the calculation will not all be 0, so such a (…) is added (...) * Will the overall PR value change after the q + 1-Q relationship?
When the initial PR value of each page is 1 and 0 <= q <= 1 (usually 0.8 in calculation), we can add the PR values of all pages. Suppose there are n webpages:
Pr (X1) + Pr (X2) +... + Pr (Xn)
= (Pr (X2)/L (X2) +... )Q + 1-Q) +... + (Pr (X1)/L (X1) +... )Q + 1-Q)
= (Pr (X1)L (X1)/L (X1) + Pr (X2)L (X2)/L (X2) +... + Pr (Xn)L (Xn)/L (Xn) q + N (1-Q)
= (Pr (X1) + Pr (X2) +... + Pr (Xn ))Q + n-nQ
= NQ + N-N * q
= N
Since the initial prvalue is 1, the result of adding the prvalues of all pages is still N, which remains unchanged, (...) * After the Relationship Between q + 1-Q, It is avoided that if the prvalue is 0, convergence can be sought for sorting.
Of course, in practical application, this formula can be designed to be more complex and can be solved through higher algebra Matrix Rotation. Here we only want to understand the principle, not to do search algorithms, so I will not go further.
Summary: many things in the world are zero-sum games, just like stock trading. The money that investors make is the money that organizations lose, and the money that organizations earn is the money that investors lose, maybe investors should study the PageRank algorithm to see if the stock market is converging. If it converges, it means they will never want to solve the problem and the institution will never lose money.
How to calculate the prvalue using parallel computing:
Here we design it through various parallel computing modes provided by fourinone. There are many ways of thinking.
The first use can refer to the distributed computing get started demo guide, Development Kit: http://www.skycn.com/soft/68321.html
Thought 1: we can adopt the mechanism of mutual merger of workers (for details about how workers merge and use receive, see sayhello demo). Each worker analyzes the current webpage link and performs a pr vote for each link, directly vote on the worker machine where the web page is located through the receive, so that after a round of mutual voting by workers, then count the number of votes on each web page of the machine to get a new PR value. However, in this way, for each link vote, a single receive request must be called to other worker machines, which consumes a lot of bandwidth. When there are a large number of links on the web page, multiple receive requests must be called, resulting in poor performance.
Train of Thought 2: The characteristics of prvalue calculation are that the input data is large, and the output data is small. That is, thousands of webpages occupy more space, but the calculated prvalue occupies less space, we can use the memory for installation. Therefore, we give priority to each worker to count the web pages on their respective machines, calculate the voting results of the corresponding web pages of each link, and then return to the foreman to merge them to obtain the PR values of each web page. You can use the most basic "Total-minute-total" parallel computing mode (see the distributed computing getting started demo guide ).
Parallel Computing splitting and merging are designed as follows:
You can see:
Workers are responsible for counting the PR votes for each link on the webpage on their respective machines.
The foreman is responsible for merging and accumulating the new PR values of each link corresponding to the webpage, and performing iterative calculation.
Program Implementation:
Pagerankworker: Implemented by a PageRank worker. In order to facilitate demonstration, it uses a string array to represent the included links (actually it should be obtained from a local webpage file)
Links = new string [] {"B", "C "};
Then, PR voting is performed for each link in the link set.
For (string P: links)
Outhouse. setobj (p, Pr/links. Length );
Pagerankctor: It is implemented by a PageRank package header. It sets the initial PageRank value of three webpages A, B, and C to 1.00, and then performs stage computing through dotaskbatch. dotaskbatch provides a barrier mechanism, the ticket is returned only after the calculation is complete for each worker. The ticket header merges and accumulates the voting results returned by each worker:
Pagepr = pagepr + (double) prwh. getobj (PAGE );
Obtain the new PR value of each webpage (here, the Q value is 1 for calculation), and then 500 rounds of continuous iteration calculation.
Procedure:
1. Start parkserverdemo (its IP port is already specified in the configuration file)
Java-CP fourinone. jar; parkserverdemo
2. Run pagerankworker A, B, and C to input different IP addresses and port numbers.
Java-CP fourinone. jar; pagerankworker localhost 2008
Java-CP fourinone. jar; pagerankworker localhost 2009 B
Java-CP fourinone. jar; pagerankworker localhost 2010 C
3. Run pagerankctor
Java-CP fourinone. jar; pagerankctor
We can see that the result is the same as that of the starting standalone program. At the same time, the worker windows output their respective prvalues in sequence:
The complete demo source code is as follows:
// Parkserverdemoimport com. fourinone. beancontext;
Public class parkserverdemo {
Public static void main (string [] ARGs ){
Beancontext. startpark ();
}
}
// Pagerankworker import com. fourinone. migrantworker;
Import com. fourinone. warehouse;
Import com. fourinone. workman;
public class pagerankworker extends migrantworker {
Public String page = NULL;
Public String [] links = NULL;
Public pagerankworker (string page, string [] links) {
This. page = page;
This. links = links;
}
Public warehouse dotask (warehouse inhouse ){
Double Pr = (double) inhouse. getobj (PAGE );
System. Out. println (PR );
Warehouse outhouse = new warehouse ();
For (string P: links)
Outhouse. setobj (p, Pr/links. Length); // vote for the included link PR
Return outhouse;
}
Public static void main (string [] ARGs ){
String [] links = NULL;
If (ARGs [2]. Equals (""))
Links = new string [] {"B", "C"}; // link included on the page
Else if (ARGs [2]. Equals ("B "))
Links = new string [] {"C "};
Else if (ARGs [2]. Equals ("C "))
Links = new string [] {""};
Pagerankworker Mw = new pagerankworker (ARGs [2], links );
MW. waitworking (ARGs [0], integer. parseint (ARGs [1]), "pagerankworker ");
}
}
// Pagerankctor
Import com. fourinone. contractor;
Import com. fourinone. warehouse;
Import com. fourinone. workerlocal;
Import java. util. iterator;
Public class pagerankctor extends contractor {
Public warehouse givetask (warehouse inhouse ){
Workerlocal [] wks = getwaitingworkers ("pagerankworker ");
System. Out. println ("wks. Length:" + wks. Length );
For (INT I = 0; I <500; I ++) {// 500 rounds
Warehouse [] hmarr = dotaskbatch (wks, inhouse );
Warehouse prwh = new warehouse ();
For (warehouse result: hmarr ){
For (iterator iter = result. keyset (). iterator (); ITER. hasnext ();){
String page = (string) ITER. Next ();
Double pagepr = (double) result. getobj (PAGE );
If (prwh. containskey (page ))
Pagepr = pagepr + (double) prwh. getobj (PAGE );
Prwh. setobj (page, pagepr );
}
}
Inhouse = prwh; // Iteration
System. Out. println ("no." + I + ":" + inhouse );
}
Return inhouse;
}
Public static void main (string [] ARGs ){
Pagerankctor A = new pagerankctor ();
Warehouse inhouse = new warehouse ();
Inhouse. setobj ("A", 1.00d); // The initial PR value of
Inhouse. setobj ("B", 1.00d); // The initial PR value of B
Inhouse. setobj ("C", 1.00d); // The initial PR value of C
A. givetask (inhouse );
A. Exit ();}
}