International - English

Cart Console

Topic Center

Contact Sales

Home > Others

Interpretation of some similarity algorithms in Mahout

Last Update:2018-07-26 Source: Internet

Author: User

Tags diff

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The recommended algorithm implemented in Mahout is collaborative filtering, and both USERCF and ITEMCF rely on user similarity or item similarity. This paper is an interpretation of some similarity algorithms in Mahout. Mahout Similarity related class relationships are as follows:

A little messy (^.^)

As can be seen from the above figure, Mahout is mainly for user similarity and item similarity calculation, and in addition to hybridsimilarity all can be used to calculate the similarity between user and item, only hybridsimilarity can only calculate the item similarity. Next three parts of the analysis: Inherit abstractsimilarity, no integration abstractsimilarity but also can be used for user, item similarity calculation, only inherit abstractitemsimilarity. implementation classes that inherit Abstractsimilarity

The following classes are used for user behavior data with no preference values.
Before introducing Euclideandistancesimilarity, Pearsoncorrelationsimilarity, Uncenteredcosinesimilarity, Let's take a look at abstractsimilarity first. In fact, Abstractsimilarity is the core of this part, three important methods: Usersimilarity (), itemsimilarity (), Computeresult (), where Computeresult () is an abstract method, is the supply of subclasses to achieve, usersimilarity (), itemsimilarity () in the calculation of the corresponding variable after the implementation of the sub-class Computeresult (), the results obtained after processing is the similarity degree.
In Usersimilarity (), the first calculation is
User1 products x preferences and sumx, and the sum of squares and sumX2, User2 preference values for item Y and Sumy as well as the sum of squares and sumY2, User1 and User2 's preference values Sumxy, User1 (item1) and User2 (item2 ) sumXYdiff2 The squared difference of the preference value. The code is as follows:

Sumxy + = x * y;
    SumX + = x;
    SumX2 + = x * x;
    SumY + = y;
    SumY2 + = y * y;
    Double diff = x-y;
    SUMXYDIFF2 + = diff * diff;
    count++;

Where a user does not have a preference for an item and the other user has a preference for the item, then the Preferenceinferrer implementation class pair is inferred with the following code:

Only one user expressed a preference, but infer the other one's preference and tally
    //As if the other user Expres Sed that preference
    if (Compare < 0) {
    //X have a value; infer Y ' s
        X = Haspreftransform
            ? preftransform . Gettransformedvalue (Xprefs.get (Xprefindex))
            : Xprefs.getvalue (xprefindex);
        y = inferrer.inferpreference (userID2, Xindex);
     } else {
      //compare > 0
        //Y has a value; infer x ' s
        x = Inferrer.inferpreference (userID1, yindex);
        y = Haspreftransform
            ? Preftransform.gettransformedvalue (Yprefs.get (Yprefindex))
            : Yprefs.getvalue ( Yprefindex);
     }

The Computeresult () method is then called and the Code is as follows:

"Center" the data. If My math is correct, the this ' ll does it.
    double result;
    if (centerdata) {
      double meanx = sumx/count;
      Double meany = sumy/count;
      Double Centeredsumxy = sumxy-meany * Sumx-meanx * sumY + n * meanx * meany;
      Double Centeredsumxy = Sumxy-meany * SumX;
      Double centeredSumX2 = sumx2-2.0 * Meanx * sumX + n * meanx * MEANX;
      Double centeredSumX2 = Sumx2-meanx * SumX;
      Double centeredSumY2 = sumy2-2.0 * Meany * sumY + n * meany * meany;
      Double centeredSumY2 = Sumy2-meany * SumY;
      result = Computeresult (count, Centeredsumxy, centeredSumX2, centeredSumY2, sumXYdiff2);
    } else {
      result = Computeresult (count, Sumxy, sumX2, sumY2, sumXYdiff2);
    }

Finally, after two steps of Similaritytransform and Normalizeweightresult (), result is the equivalent degree value of User1 and User2. The code is as follows:

if (similaritytransform! = null) {
      result = similaritytransform.transformsimilarity (itemID1, itemID2, result);
    }

    if (! Double.isnan (Result) {
      result = Normalizeweightresult (result, count, cachednumusers);
    }

1.PearsonCorrelationSimilarity
Similarity calculation based on Pearson correlation: sumxy/sqrt (sumX2 * sumY2), the code is as follows:

@Override
double Computeresult (int n, double sumxy, double sumX2, double sumY2, double sumXYdiff2) {
    if (n = = 0) {
      return Double.NaN;
    }
    Note that sum of X and sum of Y, don ' t appear here since they is assumed to be 0;
    The data is assumed to be centered.
    Double denominator = math.sqrt (sumX2) * MATH.SQRT (sumY2);
    if (denominator = = 0.0) {
      //one or both parties has-all-the same ratings;
      Can ' t really say much similarity under this measure
      return Double.NaN;
    }
    return sumxy/denominator;
}

2.EuclideanDistanceSimilarity
The similarity calculation based on Euclidean distance: 1/(1 + distance), the code is as follows:

@Override
double Computeresult (int n, double sumxy, double sumX2, double sumY2, double sumXYdiff2) {
    return 1.0/ (1.0 + MATH.SQRT (SUMXYDIFF2)/MATH.SQRT (n));
}

3.UncenteredCosineSimilarity
Based on the cosine similarity calculation, the code is as follows:

@Override
double Computeresult (int n, double sumxy, double sumX2, double sumY2, double sumXYdiff2) {
    if (n = = 0) {
      return Double.NaN;
    }
    Double denominator = math.sqrt (sumX2) * MATH.SQRT (sumY2);
    if (denominator = = 0.0) {
      //one or both parties has-all-the same ratings;
      Can ' t really say much similarity under this measure
      return Double.NaN;
    }
    return sumxy/denominator;
}

classes that inherit abstractitemsimilarity and implement Abstractitemsimilarity

     The following classes are used for user behavior data with no preference values.
    1.tanimotocoefficientsimilarity
     Similarity based on Tanimoto coefficients: the intersection of userID1 preference items and userID2 preference items size/and set size (item is similar), the code is as follows:

public double usersimilarity (long userID1, long userID2)
    Throws Tasteexception {Datamodel Datamodel = Getdatamodel ();
    Fastidset xprefs = Datamodel.getitemidsfromuser (userID1);

    Fastidset yprefs = Datamodel.getitemidsfromuser (userID2);
    int xprefssize = Xprefs.size ();
    int yprefssize = Yprefs.size ();
    if (xprefssize = = 0 && yprefssize = = 0) {return double.nan;
    } if (xprefssize = = 0 | | yprefssize = = 0) {return 0.0; }//computes the intersection of size int intersectionsize = Xprefssize < yprefssize?
    Yprefs.intersectionsize (xprefs): Xprefs.intersectionsize (yprefs);
    if (intersectionsize = = 0) {return double.nan;

    }//compute and set size int unionsize = xprefssize + yprefssize-intersectionsize;
  Return (double) intersectionsize/(double) unionsize; }

2.loglikelihoodsimilarity
Similarity based on logarithmic likelihood: an upgraded version of the similarity based on Tanimoto coefficients, the code is as follows:

Public double usersimilarity (long userID1, long userID2) throws Tasteexception {Datamodel Datamodel = Getdatamodel ();
    Fastidset PREFS1 = Datamodel.getitemidsfromuser (userID1);

    Fastidset PREFS2 = Datamodel.getitemidsfromuser (userID2);
    Long prefs1size = Prefs1.size ();
    Long prefs2size = Prefs2.size (); Compute intersection size Long intersectionsize = Prefs1size < prefs2size?
    Prefs2.intersectionsize (PREFS1): Prefs1.intersectionsize (PREFS2);
    if (intersectionsize = = 0) {return double.nan;
    } Long NumItems = Datamodel.getnumitems ();
                                         Double Loglikelihood = Loglikelihood.loglikelihoodratio (Intersectionsize,
                                         Prefs2size-intersectionsize, Prefs1size-intersectionsize,
    Numitems-prefs1size-prefs2size + intersectionsize);
  return 1.0-1.0/(1.0 + Loglikelihood); }

3.CityBlockSimilarity
Based on the similarity of the block distance, the code is as follows:

Public double usersimilarity (long userID1, long userID2) throws Tasteexception {
    Datamodel Datamodel = Getdatamodel () ;
    Fastidset PREFS1 = Datamodel.getitemidsfromuser (userID1);
    Fastidset PREFS2 = Datamodel.getitemidsfromuser (userID2);
    int prefs1size = Prefs1.size ();
    int prefs2size = Prefs2.size ();
    int intersectionsize = Prefs1size < prefs2size? Prefs2.intersectionsize (PREFS1): Prefs1.intersectionsize (PREFS2);
    Return dosimilarity (Prefs1size, Prefs2size, intersectionsize);
  }
Calculate City Block Distance from total non-zero values and intersections and maps to a similarity value.
private static double dosimilarity (int pref1, int pref2, int intersection) {
    int distance = PREF1 + pref2-2 * Inter Section;
    return 1.0/(1.0 + distance);
}

Implementation classes that inherit only abstractitemsimilarity

1.TrackItemSimilarity
The code is as follows:

Public double itemsimilarity (long itemID1, long itemID2) {if (itemID1 = = itemID2) {return 1.0;
    } TrackData data1 = Trackdata.get (itemID1);
    TrackData data2 = Trackdata.get (itemID2);
    if (data1 = = NULL | | data2 = = NULL) {return 0.0; }//arbitrarily decide that same album means "very similar" if (Data1.getalbumid ()! = trackdata.no_value_id &amp
    ;& Data1.getalbumid () = = Data2.getalbumid ()) {return 0.9; }//... and same artist means "fairly similar" if (Data1.getartistid ()! = trackdata.no_value_id && data1.
    Getartistid () = = Data2.getartistid ()) {return 0.7; }//Tanimoto coefficient similarity based on genre, but maximum value of 0.25 fastidset genres1 = data1.getgenre
    IDs ();
    Fastidset genres2 = Data2.getgenreids ();
    if (genres1 = = NULL | | genres2 = = NULL) {return 0.0;
    } int intersectionsize = Genres1.intersectionsize (genres2); if (intersectionsize = = 0) {return0.0;
    } int unionsize = genres1.size () + genres2.size ()-intersectionsize;
Return (double) intersectionsize/(4.0 * unionsize); }

2.HybridSimilarity
Similarity calculation based on blending: loglikelihoodsimilarity*trackitemsimilarity. The code is as follows:

Hybridsimilarity (Datamodel Datamodel, File datafiledirectory) throws IOException {
    super (Datamodel);
    cfsimilarity = new Loglikelihoodsimilarity (Datamodel);
    contentsimilarity = new Trackitemsimilarity (datafiledirectory);
  }

Public double itemsimilarity (long itemID1, long itemID2) throws Tasteexception {
    return Contentsimilarity.itemsimilarity (itemID1, itemID2) * cfsimilarity.itemsimilarity (itemID1, itemID2);
}

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

list of algorithms analysis of algorithms book cosine similarity between two documents in java encryption algorithms in php algorithms in nutshell algorithms in c sedgewick structure and interpretation of computer programs 2nd edition

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Interpretation of some similarity algorithms in Mahout

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support