1: Overview
2: Introduction to the principle
3: Code implementation
4: Problem description
One: Overview
Graph-based models (graph-based model) are important content in the Recommender system. In fact, many researchers refer to neighborhood-based models as graph-based models because neighborhood-based models can be viewed as simple forms of graph-based models
Before studying the graph-based model, first of all, the user's behavior data should be expressed in the form of graphs, the following discussion of the user behavior data is a two-tuple array, where each two-tuple (u,i) represents the user U on item i's behavior, this data can easily be represented by a binary graph
G (V,e) represents the user item dichotomy, which consists of the user vertex set and the item vertex set. For each two-tuple (U, i) in the dataset, there is a corresponding edge in the graph, which is the vertex of the user U, which is the vertex of the item I. Figure 2-18 is a simple user item binary graph model, where the circular node represents the user, the square node represents the item, and the edge between the circular node and the square node represents the user's behavior on the item. The user node A and item nodes A, B, d are connected, indicating that user A has acted on items A, B and D.
Second: Introduction to the principle
The user's behavior data is represented as two, the next is based on the binary map for the user recommendation, then the user U recommended items can be converted to measure the user vertex vu and vu does not directly connected vertices on the graph of the correlation, the higher the relevance of the items on the recommended list of the weight of nine higher, the recommended location is more forward.
So how do you evaluate the correlation of two vertices? Generally depends on three factors
1: Number of paths between two vertices
2: Length of a path between two vertices
3: Vertices of paths passing between two vertices
A pair of vertices with a higher correlation typically has the following characteristics:
1: There are many paths connected between two vertices
2: The path length between connecting two vertices is shorter
3: The path between the connecting two vertices does not go through a large-scale vertex
As a simple example, 2-19 shows that user A and item C, e have no side connection, but user A and item C have two length 3 path connected, user A and item e have two length 3 path connected. Then, the correlation between vertex A and e is higher than vertex A and C, so item E should be in the list of user A's recommendation before item C, because there are two paths between vertices a and e-(a, B, C, E) and (A, D, D, E). Where the (a, B, C, E) path passes through the vertices with a degree of (3, 2, 2,2), and (A, D, D, e) The path of the vertices passing through is (3, 2, 3, 2). Thus, (A, D, D, E) passes through a larger vertex D, so (A, D, D, e) contributes less than (a, B, C, E) to the correlation between vertices a and e.
A Personalrank algorithm based on Random walk (similar to the Pangrank algorithm, PageRank algorithm reference) is described below.
Assuming that you want to personalize the user's recommendation, you can start the random walk on the user's item binary graph from the user U node vu. When moving to any node, the probability alpha is the first to continue the walk or stop the walk and start again from the VU node. If you decide to continue the walk, select a node randomly from the node pointed to by the current node as the next pass-through node of the walk. Thus, after many random walks, the probability that each item node is accessed will converge to a number. The weight of the item in the final recommendation list is the access probability of the item node.
If the above description is represented as a formula, the following formula can be obtained:
Alpha indicates the probability of a random walk the probability that the PR (V ') represents access V ' out (V ') represents the vertex set pointed to by V '
Three: Code implementation
#-*-coding:utf-8-*-"Created on June 16, 2016 @author:gamer Think" "" G: Binary graph Alpha: probability of random walk root: The initial node of the walk Max_step; Moving Steps ' def personalrank (G, alpha, Root, max_step): Rank = dict () rank = {x:0 for x in G.keys ()} Rank[root] = 1 #开始迭代 for K in range (max_step): tmp = {x:0 for x in G.keys ()} #取节点i和它的出边尾节点集合ri for I, RI in G.items (): #i是顶点. RI is the weight of the extreme edge of the vertex associated with it #取节点i的出边的尾节点j以及边E (i,j), the weight of the wij, the weight of the edge is 1, in this does not play a practical role for J, Wij in Ri.items (): #j是i的连 The vertex, Wij is the weight #i是j的其中一条入边的首节点, so you need to traverse the graph to find the first node of the J's Edge, #这个遍历过程就是此处的2层for循环, a traversal is a walk TMP[J] + = Alpha * Rank[i]/(1.0 * Len (RI)) #我们每次游走都是从root节点出发, so the root node weights need to be added (1-alpha) #在 "Recommended system Practice" , the author puts this sentence in the loop of for J, Wij in Ri.items (), which I think is problematic. Tmp[root] + = (1-alpha) rank = tmp #输出每次迭代后各个节点的权重 print ' iter: ' + str (k) + "\ T", For key, value in Rank.items (): print "%s:%.3f, \T "% (key, value), print return rank ' main function, g denotes a binary graph, ' A ' represents a node, the key behind the corresponding dictionary is the vertex of the connection, value represents the weight of the edge ' if __name__ = = ' __main__ ': G = {' A ': {' A ': 1, ' C ': 1}, ' B ': {' A ': 1, ' B ': 1, ' C ': 1, ' d ': 1}, ' C ': {' C ': 1, ' d ': 1}, ' a ': {' A ': 1, ' B ': 1}, ' B ': {' B ': 1}, ' C ': {' A ': 1, ' B ': 1, ' C ': 1}, ' d ': {' B ': 1, ' C ': 1}} personalrank (G, 0.85, ' A ', 100)Operation Result:
Result Description:
The highest degree of correlation is a (0.269), C (0.190), B (0.185), A (0.154), C (0.086), D (0.076), B (0.039), remove a already connected a,c, the remaining recommendations are B,a,c,d,b
Four: Problem description
Although the Personalrank algorithm can be better explained by the random walk, the algorithm has obvious shortcomings in the complexity of time. Because each user is recommended, it is necessary to iterate over the entire user's item dichotomy until the PR value of each vertex on the entire graph converges. The complexity of the process is very high, not only to provide real-time recommendations online, and even to generate recommendations offline is time-consuming.
In order to solve the problem that the Personalrank need to iterate over the full graph every time and thus create a high time complexity, there are two solutions. The first is easy to think of, which is to reduce the number of iterations and stop before convergence. This can affect the final accuracy, but generally the impact is not particularly large. Another method is to redesign the algorithm based on the matrix theory.
Readers who are familiar with matrix operations can easily convert Personalrank into matrix form. Make m a transfer probability matrix for the user's item binary graph, i.e.:
Then, the iterative formula can be translated into:
A reader who is slightly familiar with matrix theory can solve the above equation and get:
Recommended system based on Graph recommendation algorithm