Interpretation of some similarity algorithms in Mahout

Source: Internet
Author: User
Tags diff

The recommended algorithm implemented in Mahout is collaborative filtering, and both USERCF and ITEMCF rely on user similarity or item similarity. This paper is an interpretation of some similarity algorithms in Mahout. Mahout Similarity related class relationships are as follows:

A little messy (^.^)

As can be seen from the above figure, Mahout is mainly for user similarity and item similarity calculation, and in addition to hybridsimilarity all can be used to calculate the similarity between user and item, only hybridsimilarity can only calculate the item similarity. Next three parts of the analysis: Inherit abstractsimilarity, no integration abstractsimilarity but also can be used for user, item similarity calculation, only inherit abstractitemsimilarity. implementation classes that inherit Abstractsimilarity

The following classes are used for user behavior data with no preference values.
Before introducing Euclideandistancesimilarity, Pearsoncorrelationsimilarity, Uncenteredcosinesimilarity, Let's take a look at abstractsimilarity first. In fact, Abstractsimilarity is the core of this part, three important methods: Usersimilarity (), itemsimilarity (), Computeresult (), where Computeresult () is an abstract method, is the supply of subclasses to achieve, usersimilarity (), itemsimilarity () in the calculation of the corresponding variable after the implementation of the sub-class Computeresult (), the results obtained after processing is the similarity degree.
In Usersimilarity (), the first calculation is
User1 products x preferences and sumx, and the sum of squares and sumX2, User2 preference values for item Y and Sumy as well as the sum of squares and sumY2, User1 and User2 's preference values Sumxy, User1 (item1) and User2 (item2 ) sumXYdiff2 The squared difference of the preference value. The code is as follows:

Sumxy + = x * y;
    SumX + = x;
    SumX2 + = x * x;
    SumY + = y;
    SumY2 + = y * y;
    Double diff = x-y;
    SUMXYDIFF2 + = diff * diff;
    count++;

Where a user does not have a preference for an item and the other user has a preference for the item, then the Preferenceinferrer implementation class pair is inferred with the following code:

Only one user expressed a preference, but infer the other one's preference and tally
    //As if the other user Expres Sed that preference
    if (Compare < 0) {
    //X have a value; infer Y ' s
        X = Haspreftransform
            ? preftransform . Gettransformedvalue (Xprefs.get (Xprefindex))
            : Xprefs.getvalue (xprefindex);
        y = inferrer.inferpreference (userID2, Xindex);
     } else {
      //compare > 0
        //Y has a value; infer x ' s
        x = Inferrer.inferpreference (userID1, yindex);
        y = Haspreftransform
            ? Preftransform.gettransformedvalue (Yprefs.get (Yprefindex))
            : Yprefs.getvalue ( Yprefindex);
     }

The Computeresult () method is then called and the Code is as follows:

"Center" the data. If My math is correct, the this ' ll does it.
    double result;
    if (centerdata) {
      double meanx = sumx/count;
      Double meany = sumy/count;
      Double Centeredsumxy = sumxy-meany * Sumx-meanx * sumY + n * meanx * meany;
      Double Centeredsumxy = Sumxy-meany * SumX;
      Double centeredSumX2 = sumx2-2.0 * Meanx * sumX + n * meanx * MEANX;
      Double centeredSumX2 = Sumx2-meanx * SumX;
      Double centeredSumY2 = sumy2-2.0 * Meany * sumY + n * meany * meany;
      Double centeredSumY2 = Sumy2-meany * SumY;
      result = Computeresult (count, Centeredsumxy, centeredSumX2, centeredSumY2, sumXYdiff2);
    } else {
      result = Computeresult (count, Sumxy, sumX2, sumY2, sumXYdiff2);
    }

Finally, after two steps of Similaritytransform and Normalizeweightresult (), result is the equivalent degree value of User1 and User2. The code is as follows:

if (similaritytransform! = null) {
      result = similaritytransform.transformsimilarity (itemID1, itemID2, result);
    }

    if (! Double.isnan (Result) {
      result = Normalizeweightresult (result, count, cachednumusers);
    }

1.PearsonCorrelationSimilarity
Similarity calculation based on Pearson correlation: sumxy/sqrt (sumX2 * sumY2), the code is as follows:

@Override
double Computeresult (int n, double sumxy, double sumX2, double sumY2, double sumXYdiff2) {
    if (n = = 0) {
      return Double.NaN;
    }
    Note that sum of X and sum of Y, don ' t appear here since they is assumed to be 0;
    The data is assumed to be centered.
    Double denominator = math.sqrt (sumX2) * MATH.SQRT (sumY2);
    if (denominator = = 0.0) {
      //one or both parties has-all-the same ratings;
      Can ' t really say much similarity under this measure
      return Double.NaN;
    }
    return sumxy/denominator;
}

2.EuclideanDistanceSimilarity
The similarity calculation based on Euclidean distance: 1/(1 + distance), the code is as follows:

@Override
double Computeresult (int n, double sumxy, double sumX2, double sumY2, double sumXYdiff2) {
    return 1.0/ (1.0 + MATH.SQRT (SUMXYDIFF2)/MATH.SQRT (n));
}

3.UncenteredCosineSimilarity
Based on the cosine similarity calculation, the code is as follows:

@Override
double Computeresult (int n, double sumxy, double sumX2, double sumY2, double sumXYdiff2) {
    if (n = = 0) {
      return Double.NaN;
    }
    Double denominator = math.sqrt (sumX2) * MATH.SQRT (sumY2);
    if (denominator = = 0.0) {
      //one or both parties has-all-the same ratings;
      Can ' t really say much similarity under this measure
      return Double.NaN;
    }
    return sumxy/denominator;
}
classes that inherit abstractitemsimilarity and implement Abstractitemsimilarity

     The following classes are used for user behavior data with no preference values.
    1.tanimotocoefficientsimilarity
     Similarity based on Tanimoto coefficients: the intersection of userID1 preference items and userID2 preference items size/and set size (item is similar), the code is as follows:

public double usersimilarity (long userID1, long userID2)
    Throws Tasteexception {Datamodel Datamodel = Getdatamodel ();
    Fastidset xprefs = Datamodel.getitemidsfromuser (userID1);

    Fastidset yprefs = Datamodel.getitemidsfromuser (userID2);
    int xprefssize = Xprefs.size ();
    int yprefssize = Yprefs.size ();
    if (xprefssize = = 0 && yprefssize = = 0) {return double.nan;
    } if (xprefssize = = 0 | | yprefssize = = 0) {return 0.0; }//computes the intersection of size int intersectionsize = Xprefssize < yprefssize?
    Yprefs.intersectionsize (xprefs): Xprefs.intersectionsize (yprefs);
    if (intersectionsize = = 0) {return double.nan;

    }//compute and set size int unionsize = xprefssize + yprefssize-intersectionsize;
  Return (double) intersectionsize/(double) unionsize; }

    2.loglikelihoodsimilarity 
     Similarity based on logarithmic likelihood: an upgraded version of the similarity based on Tanimoto coefficients, the code is as follows: 

Public double usersimilarity (long userID1, long userID2) throws Tasteexception {Datamodel Datamodel = Getdatamodel ();
    Fastidset PREFS1 = Datamodel.getitemidsfromuser (userID1);

    Fastidset PREFS2 = Datamodel.getitemidsfromuser (userID2);
    Long prefs1size = Prefs1.size ();
    Long prefs2size = Prefs2.size (); Compute intersection size Long intersectionsize = Prefs1size < prefs2size?
    Prefs2.intersectionsize (PREFS1): Prefs1.intersectionsize (PREFS2);
    if (intersectionsize = = 0) {return double.nan;
    } Long NumItems = Datamodel.getnumitems ();
                                         Double Loglikelihood = Loglikelihood.loglikelihoodratio (Intersectionsize,
                                         Prefs2size-intersectionsize, Prefs1size-intersectionsize,
    Numitems-prefs1size-prefs2size + intersectionsize);
  return 1.0-1.0/(1.0 + Loglikelihood); }

3.CityBlockSimilarity
Based on the similarity of the block distance, the code is as follows:

Public double usersimilarity (long userID1, long userID2) throws Tasteexception {
    Datamodel Datamodel = Getdatamodel () ;
    Fastidset PREFS1 = Datamodel.getitemidsfromuser (userID1);
    Fastidset PREFS2 = Datamodel.getitemidsfromuser (userID2);
    int prefs1size = Prefs1.size ();
    int prefs2size = Prefs2.size ();
    int intersectionsize = Prefs1size < prefs2size? Prefs2.intersectionsize (PREFS1): Prefs1.intersectionsize (PREFS2);
    Return dosimilarity (Prefs1size, Prefs2size, intersectionsize);
  }
Calculate City Block Distance from total non-zero values and intersections and maps to a similarity value.
private static double dosimilarity (int pref1, int pref2, int intersection) {
    int distance = PREF1 + pref2-2 * Inter Section;
    return 1.0/(1.0 + distance);
}
Implementation classes that inherit only abstractitemsimilarity

1.TrackItemSimilarity
The code is as follows:

Public double itemsimilarity (long itemID1, long itemID2) {if (itemID1 = = itemID2) {return 1.0;
    } TrackData data1 = Trackdata.get (itemID1);
    TrackData data2 = Trackdata.get (itemID2);
    if (data1 = = NULL | | data2 = = NULL) {return 0.0; }//arbitrarily decide that same album means "very similar" if (Data1.getalbumid ()! = trackdata.no_value_id &amp
    ;& Data1.getalbumid () = = Data2.getalbumid ()) {return 0.9; }//... and same artist means "fairly similar" if (Data1.getartistid ()! = trackdata.no_value_id && data1.
    Getartistid () = = Data2.getartistid ()) {return 0.7; }//Tanimoto coefficient similarity based on genre, but maximum value of 0.25 fastidset genres1 = data1.getgenre
    IDs ();
    Fastidset genres2 = Data2.getgenreids ();
    if (genres1 = = NULL | | genres2 = = NULL) {return 0.0;
    } int intersectionsize = Genres1.intersectionsize (genres2); if (intersectionsize = = 0) {return0.0;
    } int unionsize = genres1.size () + genres2.size ()-intersectionsize;
Return (double) intersectionsize/(4.0 * unionsize); }

2.HybridSimilarity
Similarity calculation based on blending: loglikelihoodsimilarity*trackitemsimilarity. The code is as follows:

Hybridsimilarity (Datamodel Datamodel, File datafiledirectory) throws IOException {
    super (Datamodel);
    cfsimilarity = new Loglikelihoodsimilarity (Datamodel);
    contentsimilarity = new Trackitemsimilarity (datafiledirectory);
  }

Public double itemsimilarity (long itemID1, long itemID2) throws Tasteexception {
    return Contentsimilarity.itemsimilarity (itemID1, itemID2) * cfsimilarity.itemsimilarity (itemID1, itemID2);
}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.