The recommended algorithm implemented in Mahout is collaborative filtering, and both USERCF and ITEMCF rely on user similarity or item similarity. This paper is an interpretation of some similarity algorithms in Mahout. Mahout Similarity related class relationships are as follows:
A little messy (^.^)
As can be seen from the above figure, Mahout is mainly for user similarity and item similarity calculation, and in addition to hybridsimilarity all can be used to calculate the similarity between user and item, only hybridsimilarity can only calculate the item similarity. Next three parts of the analysis: Inherit abstractsimilarity, no integration abstractsimilarity but also can be used for user, item similarity calculation, only inherit abstractitemsimilarity. implementation classes that inherit Abstractsimilarity
The following classes are used for user behavior data with no preference values.
Before introducing Euclideandistancesimilarity, Pearsoncorrelationsimilarity, Uncenteredcosinesimilarity, Let's take a look at abstractsimilarity first. In fact, Abstractsimilarity is the core of this part, three important methods: Usersimilarity (), itemsimilarity (), Computeresult (), where Computeresult () is an abstract method, is the supply of subclasses to achieve, usersimilarity (), itemsimilarity () in the calculation of the corresponding variable after the implementation of the sub-class Computeresult (), the results obtained after processing is the similarity degree.
In Usersimilarity (), the first calculation is
User1 products x preferences and sumx, and the sum of squares and sumX2, User2 preference values for item Y and Sumy as well as the sum of squares and sumY2, User1 and User2 's preference values Sumxy, User1 (item1) and User2 (item2 ) sumXYdiff2 The squared difference of the preference value. The code is as follows:
Sumxy + = x * y;
SumX + = x;
SumX2 + = x * x;
SumY + = y;
SumY2 + = y * y;
Double diff = x-y;
SUMXYDIFF2 + = diff * diff;
count++;
Where a user does not have a preference for an item and the other user has a preference for the item, then the Preferenceinferrer implementation class pair is inferred with the following code:
Only one user expressed a preference, but infer the other one's preference and tally
//As if the other user Expres Sed that preference
if (Compare < 0) {
//X have a value; infer Y ' s
X = Haspreftransform
? preftransform . Gettransformedvalue (Xprefs.get (Xprefindex))
: Xprefs.getvalue (xprefindex);
y = inferrer.inferpreference (userID2, Xindex);
} else {
//compare > 0
//Y has a value; infer x ' s
x = Inferrer.inferpreference (userID1, yindex);
y = Haspreftransform
? Preftransform.gettransformedvalue (Yprefs.get (Yprefindex))
: Yprefs.getvalue ( Yprefindex);
}
The Computeresult () method is then called and the Code is as follows:
"Center" the data. If My math is correct, the this ' ll does it.
double result;
if (centerdata) {
double meanx = sumx/count;
Double meany = sumy/count;
Double Centeredsumxy = sumxy-meany * Sumx-meanx * sumY + n * meanx * meany;
Double Centeredsumxy = Sumxy-meany * SumX;
Double centeredSumX2 = sumx2-2.0 * Meanx * sumX + n * meanx * MEANX;
Double centeredSumX2 = Sumx2-meanx * SumX;
Double centeredSumY2 = sumy2-2.0 * Meany * sumY + n * meany * meany;
Double centeredSumY2 = Sumy2-meany * SumY;
result = Computeresult (count, Centeredsumxy, centeredSumX2, centeredSumY2, sumXYdiff2);
} else {
result = Computeresult (count, Sumxy, sumX2, sumY2, sumXYdiff2);
}
Finally, after two steps of Similaritytransform and Normalizeweightresult (), result is the equivalent degree value of User1 and User2. The code is as follows:
if (similaritytransform! = null) {
result = similaritytransform.transformsimilarity (itemID1, itemID2, result);
}
if (! Double.isnan (Result) {
result = Normalizeweightresult (result, count, cachednumusers);
}
1.PearsonCorrelationSimilarity
Similarity calculation based on Pearson correlation: sumxy/sqrt (sumX2 * sumY2), the code is as follows:
@Override
double Computeresult (int n, double sumxy, double sumX2, double sumY2, double sumXYdiff2) {
if (n = = 0) {
return Double.NaN;
}
Note that sum of X and sum of Y, don ' t appear here since they is assumed to be 0;
The data is assumed to be centered.
Double denominator = math.sqrt (sumX2) * MATH.SQRT (sumY2);
if (denominator = = 0.0) {
//one or both parties has-all-the same ratings;
Can ' t really say much similarity under this measure
return Double.NaN;
}
return sumxy/denominator;
}
2.EuclideanDistanceSimilarity
The similarity calculation based on Euclidean distance: 1/(1 + distance), the code is as follows:
@Override
double Computeresult (int n, double sumxy, double sumX2, double sumY2, double sumXYdiff2) {
return 1.0/ (1.0 + MATH.SQRT (SUMXYDIFF2)/MATH.SQRT (n));
}
3.UncenteredCosineSimilarity
Based on the cosine similarity calculation, the code is as follows:
@Override
double Computeresult (int n, double sumxy, double sumX2, double sumY2, double sumXYdiff2) {
if (n = = 0) {
return Double.NaN;
}
Double denominator = math.sqrt (sumX2) * MATH.SQRT (sumY2);
if (denominator = = 0.0) {
//one or both parties has-all-the same ratings;
Can ' t really say much similarity under this measure
return Double.NaN;
}
return sumxy/denominator;
}
classes that inherit abstractitemsimilarity and implement Abstractitemsimilarity
The following classes are used for user behavior data with no preference values.
1.tanimotocoefficientsimilarity
Similarity based on Tanimoto coefficients: the intersection of userID1 preference items and userID2 preference items size/and set size (item is similar), the code is as follows:
public double usersimilarity (long userID1, long userID2)
Throws Tasteexception {Datamodel Datamodel = Getdatamodel ();
Fastidset xprefs = Datamodel.getitemidsfromuser (userID1);
Fastidset yprefs = Datamodel.getitemidsfromuser (userID2);
int xprefssize = Xprefs.size ();
int yprefssize = Yprefs.size ();
if (xprefssize = = 0 && yprefssize = = 0) {return double.nan;
} if (xprefssize = = 0 | | yprefssize = = 0) {return 0.0; }//computes the intersection of size int intersectionsize = Xprefssize < yprefssize?
Yprefs.intersectionsize (xprefs): Xprefs.intersectionsize (yprefs);
if (intersectionsize = = 0) {return double.nan;
}//compute and set size int unionsize = xprefssize + yprefssize-intersectionsize;
Return (double) intersectionsize/(double) unionsize; }
2.loglikelihoodsimilarity
Similarity based on logarithmic likelihood: an upgraded version of the similarity based on Tanimoto coefficients, the code is as follows:
Public double usersimilarity (long userID1, long userID2) throws Tasteexception {Datamodel Datamodel = Getdatamodel ();
Fastidset PREFS1 = Datamodel.getitemidsfromuser (userID1);
Fastidset PREFS2 = Datamodel.getitemidsfromuser (userID2);
Long prefs1size = Prefs1.size ();
Long prefs2size = Prefs2.size (); Compute intersection size Long intersectionsize = Prefs1size < prefs2size?
Prefs2.intersectionsize (PREFS1): Prefs1.intersectionsize (PREFS2);
if (intersectionsize = = 0) {return double.nan;
} Long NumItems = Datamodel.getnumitems ();
Double Loglikelihood = Loglikelihood.loglikelihoodratio (Intersectionsize,
Prefs2size-intersectionsize, Prefs1size-intersectionsize,
Numitems-prefs1size-prefs2size + intersectionsize);
return 1.0-1.0/(1.0 + Loglikelihood); }
3.CityBlockSimilarity
Based on the similarity of the block distance, the code is as follows:
Public double usersimilarity (long userID1, long userID2) throws Tasteexception {
Datamodel Datamodel = Getdatamodel () ;
Fastidset PREFS1 = Datamodel.getitemidsfromuser (userID1);
Fastidset PREFS2 = Datamodel.getitemidsfromuser (userID2);
int prefs1size = Prefs1.size ();
int prefs2size = Prefs2.size ();
int intersectionsize = Prefs1size < prefs2size? Prefs2.intersectionsize (PREFS1): Prefs1.intersectionsize (PREFS2);
Return dosimilarity (Prefs1size, Prefs2size, intersectionsize);
}
Calculate City Block Distance from total non-zero values and intersections and maps to a similarity value.
private static double dosimilarity (int pref1, int pref2, int intersection) {
int distance = PREF1 + pref2-2 * Inter Section;
return 1.0/(1.0 + distance);
}
Implementation classes that inherit only abstractitemsimilarity
1.TrackItemSimilarity
The code is as follows:
Public double itemsimilarity (long itemID1, long itemID2) {if (itemID1 = = itemID2) {return 1.0;
} TrackData data1 = Trackdata.get (itemID1);
TrackData data2 = Trackdata.get (itemID2);
if (data1 = = NULL | | data2 = = NULL) {return 0.0; }//arbitrarily decide that same album means "very similar" if (Data1.getalbumid ()! = trackdata.no_value_id &
;& Data1.getalbumid () = = Data2.getalbumid ()) {return 0.9; }//... and same artist means "fairly similar" if (Data1.getartistid ()! = trackdata.no_value_id && data1.
Getartistid () = = Data2.getartistid ()) {return 0.7; }//Tanimoto coefficient similarity based on genre, but maximum value of 0.25 fastidset genres1 = data1.getgenre
IDs ();
Fastidset genres2 = Data2.getgenreids ();
if (genres1 = = NULL | | genres2 = = NULL) {return 0.0;
} int intersectionsize = Genres1.intersectionsize (genres2); if (intersectionsize = = 0) {return0.0;
} int unionsize = genres1.size () + genres2.size ()-intersectionsize;
Return (double) intersectionsize/(4.0 * unionsize); }
2.HybridSimilarity
Similarity calculation based on blending: loglikelihoodsimilarity*trackitemsimilarity. The code is as follows:
Hybridsimilarity (Datamodel Datamodel, File datafiledirectory) throws IOException {
super (Datamodel);
cfsimilarity = new Loglikelihoodsimilarity (Datamodel);
contentsimilarity = new Trackitemsimilarity (datafiledirectory);
}
Public double itemsimilarity (long itemID1, long itemID2) throws Tasteexception {
return Contentsimilarity.itemsimilarity (itemID1, itemID2) * cfsimilarity.itemsimilarity (itemID1, itemID2);
}