the idea of itembased recommendation algorithm in map-reduce version of Mahout
recently wanted to write a map-reduce version of the userbased, so first study mahout in the implementation of the itembased algorithm. Itembased looks simple, but it's a bit complicated to go into the implementation details, and it's even more complicated with map-reduce implementations.
The essence of itembased:
Predict a user's rating for an item item,
Take a look at the user's rating for other item, and if the other item is more similar to the item, the higher the weight.
Last weighted average.
itembased Core steps:
1 calculating the item similarity matrix (multiplying by two matrices)
2 User score matrix multiplied by item similarity matrix = user scoring Prediction matrix
Of course, the so-called matrix multiplication is not a mathematical multiplication. In mathematical sense, the line vector of the front matrix and the following column vectors are the inner product. And here, often not only the inner product, it is possible to do a normalize, it is possible to do downsample and so on.
Input file Data format: Userid,itemid,pref
User1,item1,pref
User2,item1,pref
User2,item2,pref
User3,item1,pref
user_vectors:userid,vector[(ITEMID,PREF)]
user1,vector[(ITEM1,PREF)]
user2,vector[(Item1,pref), (ITEM2,PREF)]
user3,vector[(ITEM1,PREF)]
rating_matrix:itemid,vector[(USERID,PREF)]
item1,vector[(User1,pref), (User2,pref), (USER3,PREF)]
item2,vector[(USER2,PREF)]
Rating_matrix-> Similaritymatrix
By calculating the similarity between Rating_matrix rows and rows, it is concluded that itemsimilarity
MAPPER:
Similaritymatrix-> itemid,vector[(Itemid,sim)]
item1,vector[(Item2,sim)]
item2,vector[(Item1,sim)]
User_vectors-> ItemID, (userid,pref)
Item1, (USER1,PREF)
Item1, (USER2,PREF)
ITEM2, (USER2,PREF)
Item1, (USER3,PREF)
(The format is the same as the input file, but the data structure is stored differently)
Reducer:itemid, (vector[(Itemid,sim)], (Vector[userid],vector[pref]))
Current item, a list of item similar to the current item (take TOPK), a user list that is too much for the current item, and its score.
Item1, (vector[(Item2,sim)], (Vector[user1,user2,user3],vector[pref,pref,pref]))
Item2, (vector[(Item1,sim)], (Vector[user2],vector[pref]))
Mapper:userid, (Pref (Cur_item), vector[(Itemid,sim)])
Indicates that the UserID's rating for Cur_item is pref, and the item list similar to Cur_item and its similarity is vector.
User1, (Pref (item1), vector[(Item2,sim)])
User2, (Pref (item1), vector[(Item2,sim)])
User3, (Pref (item1), vector[(Item2,sim)])
User2, (Pref (ITEM2), vector[(Item1,sim)])
For example, the meaning of the first line, to predict User1 to the rating of the non-rated item, because ITEM1 is similar to her, so consider User1 to Item1 rating.
Reduce:userid,itemid,pref
With the mapper above, the same user data falls into a reducer,
You get a user-owned rating item and the item's similarity to the other item.
Userid |
Item1 |
Item2 |
Item1,pref |
Null |
Sim |
Item2,pref |
Sim |
Null |
User1 |
Item1 |
Item2 |
Item1,pref |
Null |
Sim |
ITEM2,unkownpref |
Sim |
Null |
User2 |
Item1 |
Item2 |
Item1,pref |
Null |
Sim |
Item2,pref |
Sim |
Null |
User3 |
Item1 |
Item2 |
Item1,pref |
Null |
Sim |
ITEM2,unkownpref |
Sim |
Null |
Based on this (User,item), item's matrix, you can predict the user's rating for an item that is not scored.
P (u,n) =sum (pref (u,i) *sim (n,i))/sum (SIM (n,i))
To predict the U-N preference,
I was seen before the U,
The reference to these I preferences,
And the similarity between these n and I,
Weighted average.
It is worth mentioning that mahout in order to consider performance, and did not really do a complete matrix multiplication.
For example, itemsimilarity, only retained the TOPK, the other is not saved (in fact, the similarity is too small, can be ignored).
Therefore, for a user, the item collection to be predicted is not the item collection minus the user's scored item. Instead, the user-scored item has the most similar TOPK item collection. For an item that is not in this collection, the similarity to the item that the current user has scored is too small to directly indicate that the user is not interested, so the predicted score is 0, so no calculation is necessary.
As for the most similar topk how to take, I did not study carefully, it may be agreed that K is a constant, it may be a threshold, the similarity is less than the threshold value. For both, the latter is more reliable and symmetrical.
This article link: http://blog.csdn.net/lingerlanlan/article/details/42656161 This article linger
The idea of itembased recommendation algorithm in map-reduce version of Mahout