The idea of itembased recommendation algorithm in map-reduce version of Mahout

Source: Internet
Author: User

the idea of itembased recommendation algorithm in map-reduce version of Mahout
recently wanted to write a map-reduce version of the userbased, so first study mahout in the implementation of the itembased algorithm. Itembased looks simple, but it's a bit complicated to go into the implementation details, and it's even more complicated with map-reduce implementations.

The essence of itembased:

Predict a user's rating for an item item,

Take a look at the user's rating for other item, and if the other item is more similar to the item, the higher the weight.

Last weighted average.

itembased Core steps:

1 calculating the item similarity matrix (multiplying by two matrices)

2 User score matrix multiplied by item similarity matrix = user scoring Prediction matrix

Of course, the so-called matrix multiplication is not a mathematical multiplication. In mathematical sense, the line vector of the front matrix and the following column vectors are the inner product. And here, often not only the inner product, it is possible to do a normalize, it is possible to do downsample and so on.

Input file Data format: Userid,itemid,pref

User1,item1,pref

User2,item1,pref

User2,item2,pref

User3,item1,pref

user_vectors:userid,vector[(ITEMID,PREF)]

user1,vector[(ITEM1,PREF)]

user2,vector[(Item1,pref), (ITEM2,PREF)]

user3,vector[(ITEM1,PREF)]

rating_matrix:itemid,vector[(USERID,PREF)]

item1,vector[(User1,pref), (User2,pref), (USER3,PREF)]

item2,vector[(USER2,PREF)]

Rating_matrix-> Similaritymatrix

By calculating the similarity between Rating_matrix rows and rows, it is concluded that itemsimilarity

Item1

Item2

Item1

Item2

MAPPER:

Similaritymatrix-> itemid,vector[(Itemid,sim)]

item1,vector[(Item2,sim)]

item2,vector[(Item1,sim)]

User_vectors-> ItemID, (userid,pref)

Item1, (USER1,PREF)

Item1, (USER2,PREF)

ITEM2, (USER2,PREF)

Item1, (USER3,PREF)

(The format is the same as the input file, but the data structure is stored differently)

Reducer:itemid, (vector[(Itemid,sim)], (Vector[userid],vector[pref]))

Current item, a list of item similar to the current item (take TOPK), a user list that is too much for the current item, and its score.

Item1, (vector[(Item2,sim)], (Vector[user1,user2,user3],vector[pref,pref,pref]))

Item2, (vector[(Item1,sim)], (Vector[user2],vector[pref]))

Mapper:userid, (Pref (Cur_item), vector[(Itemid,sim)])

Indicates that the UserID's rating for Cur_item is pref, and the item list similar to Cur_item and its similarity is vector.

User1, (Pref (item1), vector[(Item2,sim)])

User2, (Pref (item1), vector[(Item2,sim)])

User3, (Pref (item1), vector[(Item2,sim)])

User2, (Pref (ITEM2), vector[(Item1,sim)])

For example, the meaning of the first line, to predict User1 to the rating of the non-rated item, because ITEM1 is similar to her, so consider User1 to Item1 rating.

Reduce:userid,itemid,pref

With the mapper above, the same user data falls into a reducer,

You get a user-owned rating item and the item's similarity to the other item.

Userid

Item1

Item2

Item1,pref

Null

Sim

Item2,pref

Sim

Null

User1

Item1

Item2

Item1,pref

Null

Sim

ITEM2,unkownpref

Sim

Null

User2

Item1

Item2

Item1,pref

Null

Sim

Item2,pref

Sim

Null

User3

Item1

Item2

Item1,pref

Null

Sim

ITEM2,unkownpref

Sim

Null

Based on this (User,item), item's matrix, you can predict the user's rating for an item that is not scored.

P (u,n) =sum (pref (u,i) *sim (n,i))/sum (SIM (n,i))

To predict the U-N preference,

I was seen before the U,

The reference to these I preferences,

And the similarity between these n and I,

Weighted average.

It is worth mentioning that mahout in order to consider performance, and did not really do a complete matrix multiplication.

For example, itemsimilarity, only retained the TOPK, the other is not saved (in fact, the similarity is too small, can be ignored).

Therefore, for a user, the item collection to be predicted is not the item collection minus the user's scored item. Instead, the user-scored item has the most similar TOPK item collection. For an item that is not in this collection, the similarity to the item that the current user has scored is too small to directly indicate that the user is not interested, so the predicted score is 0, so no calculation is necessary.

As for the most similar topk how to take, I did not study carefully, it may be agreed that K is a constant, it may be a threshold, the similarity is less than the threshold value. For both, the latter is more reliable and symmetrical.



This article link: http://blog.csdn.net/lingerlanlan/article/details/42656161 This article linger

The idea of itembased recommendation algorithm in map-reduce version of Mahout

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.