Reprinted from (http://www.geek521.com/?p=1423)
The Mahout recommendation algorithm is divided into the following major categories
Genericuserbasedrecommender
Algorithm:
1. User-based similarity
2. Similar user-defined and quantity
Characteristics:
1. Easy to understand
2. Fast calculation speed when the number of users is low
Genericitembasedrecommender
Algorithm:
1. Item-based similarity
Characteristics:
1.item less time even faster
2. It is very useful when the external concept of item is easy to understand and get
Slopeonerecommender (itembased)
Algorithm:
1 based on Slopeone algorithm (scoring variance rule)
Characteristics
Fast speed
Pre-calculation required
It works when the number of item is very small.
Need to limit the number of diffs storage otherwise memory grows too fast
Svdrecommender (item-based)
Algorithm
Support Vector machines (item features are represented by vectors, evaluated for each dimension)
Characteristics
Need to estimate
Recommended effect is good
Knnitembasedrecommender (item-based)
Similar user-based implementations in Genericuserbasedrecommender (based on similar item)
The main difference from Genericitembasedrecommender is that the weights are calculated differently (but, the weights is not the results of some similarity metric. Instead, the algorithm calculates the optimal set of weights to use between all pairs of items=> see the laborious)
Treeclusteringrecommender
Algorithm
A recommendation algorithm based on tree-type clustering
Characteristics
Very suitable when the number of users is low
Fast calculation speed
Pre-calculation required
Model-based recommendation algorithm and recommendation algorithm based on satisfactory degree (not implemented)
Data entry in the Mahout
Datamodel
The following include
Genericdatamodel
The data interface class is based on memory
Internal use Fastbyidmap save Preferencearray, save the user->item evaluation value within Preferencearray
Genericbooleanprefdatamodel.
Memory-based Data interface classes
But no user preference value
Use fastbyidmap<fastidset> to save the relevant item or user for the user or item.
Filedatamodel
Within the file-based data interface, internal use of Genericdatamodel to save actual user evaluation data
Added support for file types such as compressed files (. zip. gz)
Support for dynamic update (update file filename must be saved in a certain format such as foo.txt.gz subsequent update file must be foo.1.txt.gz)
The following code seems to be updated after a custom time interval, but it seems to be all updated (see Code later)
Jdbcdatamodel
Database-based data interfaces are now implemented Mysqljdbcdatamodel (support MySQL 5.x) can be generated using Mysqldatasource Mysqljdbcdatamodel
Note: The 0.7 version did not find the Mysqljdbcdatamodel class more than a Mysqljdbcidmigrator
I don't know what the relationship is.
Plusanonymoususerdatamodel.
Data classes recommended for anonymous users treat all anonymous users as one user (internal wrapper other Datamodel types)
The calculation of similarity degree in Mahout
Primarily based on user, item-based, etc.
Genericitemsimilarity contains inner class genericitemsimilarity.itemitemsimilarity
Genericusersimilarity contains inner class genericusersimilarity.userusersimilarity
Save the results of calculations using fastbyidmap<fastbyidmap<double>> by saving the similarity calculation results in memory mode
Cachingitemsimilarity
Cachingusersimilarity
Save the similarity calculation with cache to prevent each request from repeating calculation
Internal use of cache<longpair,double> Similaritycache to preserve similarity
With genericusersimilarity usage and difference for the time being don't understand
Based on the similarity measure of different algorithms implemented in Mathout:
Pearsoncorrelationsimilarity Pearson Distance
euclideandistancesimilarity Euclidean distance
Cosinemeasuresimilarity cosine distance (0.7 becomes uncenteredcosinesimilarity)
Spearmancorrelationsimilarity Spearman Level related
tanimotocoefficientsimilarity Tanimoto correlation coefficient
Loglikelihoodsimilarity generally better than tanimotocoefficientsimilarity (not understand)
Cityblocksimilarity based on Manhattan distance
Typical uses of similarity
Usersimilarity similarity = new Cachingusersimilarity (
New Spearmancorrelationsimilarity (model), model);
Processing of missing data
Preferenceinferrer data loss or too little data may be implemented with Averagingpreferenceinferrer to fill missing data with an average value
Generally speaking, Preferenceinferrer does not have any effect on the recommended results except for increasing the amount of computation (the missing values are based on existing data) so it is generally used only in research areas.
The similarity of clustering
Clustersimilarity
Cluster similarity is used for distances between two different clusters (similar to the distance within a coordinate system)
The current distance calculation between clusters contains only the following two implementations (no better implementation algorithm for the time being)
nearestneighborclustersimilarity calculating the minimum distance from all item distances in two clusters
farthestneighborclustersimilarity calculating the maximum distance from all item distances in two clusters
Mahout Recommended Algorithm Basics