Mahout Collaborative filtering itembase recommenderjob source Analysis

Source: Internet
Author: User

From: http://blog.csdn.net/heyutao007/article/details/8612906

Mahout supports 2 kinds of m/r jobs to implement itembase collaborative filtering
I.itemsimilarityjob
Ii. Recommenderjob

Below we analyze the Recommenderjob, the version is mahout-distribution-0.7

SOURCE Package Location: Org.apache.mahout.cf.taste.hadoop.item.RecommenderJob


Recommenderjob the first few stages and itemsimilarityjob are the same, but Itemsimilarityjob calculates the similarity matrix of the item is over, and Recommenderjob will continue to use the similarity matrix, For each user, the top N items that should be recommended to him are calculated. Recommenderjob input is also a userid, itemid[, Preferencevalue] format. Jobrecommenderjob is mainly composed of the following series of job:


1 Preparepreferencematrixjob (same as Itemsimilarityjob)
Input: (UserId, ItemId, pref)
1.1 Itemidindex converts the long itemid to an int type of index
1.2 Touservectors converts the input (userid, itemId, pref) to the USER vector user_vectors (userid, Vectorwritable<itemid, pref>)
1.3 toitemvectors using User_vectors to build the item vector Rating_matrix (itemid,vectorwritable<userid,pref>)


2 Rowsimilarityjob (same as Itemsimilarityjob)
2.1 Normsandtranspose

Calculates the norm of each item and turns it into a user vector
Input: Rating_matrix
(1) Use Similarity.normalize to process each item vector, use Similarity.norm to calculate each item's norm, write to HDFs;
(2) Transpose according to the item vector, i.e. input: item-(user,pref), Output: user-(item,pref). The purpose of this step is to find the same user favorite item, because only two item has the same user likes, we think they are intersect, the following is necessary to calculate the similarity.


2.2 pairwisesimilarity
Calculate the similarity between item pairs
Input: 2.1 (2) calculated user Vector user-(item,pref)
Map:cooccurrencesmapper
Using a two-layer loop, 22 item in the user vector, Itemm as key, and a vector of itemn and Itemm similarity.aggregate computed values after all Itemm is value.
Reduce:similarityreducer
(1) Overlay the same two item aggregate value between different users, get itemm-(item m+1, aggregate m+1), (item m+2, aggregate m+2), (item m+3, aggregate M +3) ... )
(2) then calculates the similarity between ITEMM and all item after. The similarity calculation uses similarity.similarity, the first parameter is the aggregate value of two item, the last two parameters are the norm value of two item, and the norm value is already obtained in the previous job.   The result is a Itemm key, all Itemm after the itemn and Itemm similarity of the vector is value, that is itemm-(item m+1, Simi m+1), (item m+2, Simi m+2), (item m+3, Simi m+ 3) ... )
Here we are actually getting the oblique half of the similarity matrix.


2.3 Asmatrix
Constructs a complete similarity matrix (only a diagonal half is obtained)
Input: 2.2reduce (2) output of ITEMM as key, all Itemm after the itemn and the similarity of the vector
Map:unsymmetrifymapper
(1) Invert, according to item m (item m+1,simim+1) record Item M+1-(item m,simim+1)
(2) Use a priority queue to find the Itemm top Maxsimilaritiesperrow (can be set parameters) a similar item, such as Maxsimilaritiesperrow = 2 o'clock, may output
itemm-(item m+1, Simi m+1), (item m+3, Simi M+3))
Reduce:mergetotopksimilaritiesreducer
(1) On the same item M, merge the above two vectors, thus forming the complete similarity matrix, itemm-(item 1, Simi 1), (item 2, Simi 2) ... , (item n, Simi N)).
(2) Use vectors.topkelements for each item to find the top Maxsimilaritiesperrow (parameters) similar to item. Visible TOPN in Map (2) is a pre-optimization of this step.
The final output is itemm-(item A, Simi a), (item B, Simi B) ... , (item N, Simi N), the number of a through n is maxsimilaritiesperrow.
At this rowsimilarityjob end. Below, we enter a different place from the itemsimilarityjob.


3 prePartialMultiply1 + prePartialMultiply2 + partialmultiply
The work of these three jobs is to aggregate the 1.2 generated user vectors and the 2.3reduce (2) similarity matrix with the same item as key, in effect preparing for the matrix multiplication mentioned below. Vectororprefwritable is the uniform structure of two value, which contains a column of one item in the similarity matrix and the corresponding item in the user vector (UserID, prefvalue).

[Java]View Plaincopy
    1. public final  class vectororprefwritable implements writable  {  
    2.   private vector  vector;  
    3.   private long userid;  
    4.   private  float value;  
    5. }&NBSP;&NBSP;



Here are the following:
3.1 PrePartialMultiply1
Input: 2.3reduce (2) The resulting similarity matrix.
With item as key, a line of the similarity matrix is wrapped into a vectororprefwritable value. Matrix multiplication should use columns, but for similarity matrices, rows and columns are the same.


3.2 PrePartialMultiply2
Input: 1.2 Generated user_vectors
The user, with each item as Key,userid and the prefvalue corresponding to this item, is wrapped into a vectororprefwritable value.


3.3 partialmultiply
The output of 3.1 and 3.2 is aggregated into the input, and the resulting item is key,vectorandprefswritable value as value. Vectorandprefswritable contains an item column in the similarity matrix and a list<long> userids, a list<float> values.

[Java]View Plaincopy
    1. Public Final class Vectorandprefswritable implements writable {
    2. private vector vector;
    3. private list<long> UserIDs;
    4. private list<float> values;
    5. }





4 itemfiltering
User settings Filter Some user, need to convert user/item pairs to (itemid,vectorandprefswritable) Form


5 Aggregateandrecommend
Once everything is ready, the recommendation vectors are calculated below. The recommended formula is as follows:
Prediction (U,i) = SUM (all n by n:similarity (i,n) * Rating (u,n))/sum (all n from N:abs (similarity (i,n)))
U = A user
i = an item not yet rated by U
N = All items similar to I
As you can see, the molecular part is a matrix multiplication of a similarity matrix and a user vector. For this matrix multiplication, the implementation code is not the same as the traditional matrix multiplication, its pseudo-code:


Assign R to be the zero vector
For each of the column I in the co-occurrence matrix
Multiply column vector I by the ith element of the user vector
Add this vector to R


Assuming that the size of the similarity matrix is n, then the above code is actually all item for a user, the item in the similarity matrix and the user to the prefvalue of the item, get n vectors, and then add the vectors, We get a recommended vector of n item for this user. To accomplish this, you first need to aggregate a user's prefvalue of all item and the item's corresponding column in the similarity matrix. Here is the implementation:
Inputs: Outputs of 3.3 and 4
Map:partialmultiplymapper
Converts the (itemid,vectorandprefswritable) Form to a userid of key,prefandsimilaritycolumnwritable as value. The prefandsimilaritycolumnwritable contains the user's prefvalue to an item and the item in the similarity matrix, In fact, the vector and value in the vectorandprefswritable are used.

[Java]View Plaincopy
    1. Public Final class Prefandsimilaritycolumnwritable implements writable {
    2. private float prefvalue;
    3. private Vector Similaritycolumn;
    4. }




Reduce:aggregateandrecommendreducer
After collecting all the prefandsimilaritycolumnwritable that belong to this user, the following is the work of multiplying the matrix.
The following two actions are available depending on whether the booleandata is set:
(1) Reducebooleandata
Just simply add all the prefandsimilaritycolumnwritable in the Similaritycolumn, no use to item-pref.
(2) Reducenonbooleandata
Using the Item-pref calculation method,
The molecular part is the result of multiplying the Matrix, which, according to the pseudocode above, multiplies the Similaritycolumn and prefvalue in each prefandsimilaritycolumnwritable, generates multiple vectors, and then adds the vectors. While the denominator is all similaritycolumn and. Here's a look at the code:

Code:


[Java]View Plaincopy
  1. for (prefandsimilaritycolumnwritable prefandsimilaritycolumn:values) {
  2. Vector simcolumn = Prefandsimilaritycolumn.getsimilaritycolumn ();
  3. float prefvalue = Prefandsimilaritycolumn.getprefvalue ();
  4. //Molecular part, each similaritycolumn and item-pref product produces multiple vectors, and then adds these vectors
  5. numerators = numerators = = null
  6. ? Prefvalue = = Boolean_pref_value? Simcolumn.clone (): Simcolumn.times (Prefvalue)
  7. : Numerators.plus (Prefvalue = = Boolean_pref_value SimColumn:simColumn.times (prefvalue));
  8. Simcolumn.assign (absolute_values);
  9. //Denominator is all the similaritycolumn and
  10. denominators = denominators = = null? SimColumn:denominators.plus (Simcolumn);
  11. }



By dividing the two, you get a value that reflects the likelihood of recommendation.
You will then use Writerecommendeditems to take the top recommendation using a priority queue, and then turn the index into a real itemid and finish.


In the above analysis, similarity is a vectorsimilaritymeasure interface implementation, it is a similarity algorithm interface, the main methods are:
(1) vector normalize (vector vector);
(2) Double norm (vector vector);
(3) Double aggregate (double Nonzerovaluea, double nonzerovalueb);
(4) Double similarity (double summedaggregations, double NormA, double normb, int numberofcolumns);
(5) Boolean consider (int numnonzeroentriesa, int numnonzeroentriesb, double Maxvaluea, double Maxvalueb,
Double threshold);
Many similarity algorithms implement this interface, such as the similarity implementation of Tanimotocoefficientsimilarity:
Public double similarity (double dots, double NormA, double normb, int numberofcolumns) {
return dots/(NormA + normb-dots);
}

Mahout Collaborative filtering itembase recommenderjob source Analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.