Mahout Recommended Algorithm Basics

Last Update:2014-11-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprinted from (http://www.geek521.com/?p=1423)

The Mahout recommendation algorithm is divided into the following major categories

Genericuserbasedrecommender

Algorithm:

1. User-based similarity

2. Similar user-defined and quantity

Characteristics:

1. Easy to understand

2. Fast calculation speed when the number of users is low

Genericitembasedrecommender

Algorithm:

1. Item-based similarity

Characteristics:

1.item less time even faster

2. It is very useful when the external concept of item is easy to understand and get

Slopeonerecommender (itembased)

Algorithm:

1 based on Slopeone algorithm (scoring variance rule)

Characteristics

Fast speed

Pre-calculation required

It works when the number of item is very small.

Need to limit the number of diffs storage otherwise memory grows too fast

Svdrecommender (item-based)

Algorithm

Support Vector machines (item features are represented by vectors, evaluated for each dimension)

Characteristics

Need to estimate

Recommended effect is good

Knnitembasedrecommender (item-based)

Similar user-based implementations in Genericuserbasedrecommender (based on similar item)

The main difference from Genericitembasedrecommender is that the weights are calculated differently (but, the weights is not the results of some similarity metric. Instead, the algorithm calculates the optimal set of weights to use between all pairs of items=> see the laborious)

Treeclusteringrecommender

Algorithm

A recommendation algorithm based on tree-type clustering

Characteristics

Very suitable when the number of users is low

Fast calculation speed

Pre-calculation required

Model-based recommendation algorithm and recommendation algorithm based on satisfactory degree (not implemented)

Data entry in the Mahout

Datamodel

The following include

Genericdatamodel

The data interface class is based on memory

Internal use Fastbyidmap save Preferencearray, save the user->item evaluation value within Preferencearray

Genericbooleanprefdatamodel.

Memory-based Data interface classes

But no user preference value

Use fastbyidmap<fastidset> to save the relevant item or user for the user or item.

Filedatamodel

Within the file-based data interface, internal use of Genericdatamodel to save actual user evaluation data

Added support for file types such as compressed files (. zip. gz)

Support for dynamic update (update file filename must be saved in a certain format such as foo.txt.gz subsequent update file must be foo.1.txt.gz)

The following code seems to be updated after a custom time interval, but it seems to be all updated (see Code later)

Jdbcdatamodel

Database-based data interfaces are now implemented Mysqljdbcdatamodel (support MySQL 5.x) can be generated using Mysqldatasource Mysqljdbcdatamodel

Note: The 0.7 version did not find the Mysqljdbcdatamodel class more than a Mysqljdbcidmigrator

I don't know what the relationship is.

Plusanonymoususerdatamodel.

Data classes recommended for anonymous users treat all anonymous users as one user (internal wrapper other Datamodel types)

The calculation of similarity degree in Mahout

Primarily based on user, item-based, etc.

Genericitemsimilarity contains inner class genericitemsimilarity.itemitemsimilarity

Genericusersimilarity contains inner class genericusersimilarity.userusersimilarity

Save the results of calculations using fastbyidmap<fastbyidmap<double>> by saving the similarity calculation results in memory mode

Cachingitemsimilarity

Cachingusersimilarity

Save the similarity calculation with cache to prevent each request from repeating calculation

Internal use of cache<longpair,double> Similaritycache to preserve similarity

With genericusersimilarity usage and difference for the time being don't understand

Based on the similarity measure of different algorithms implemented in Mathout:

Pearsoncorrelationsimilarity Pearson Distance

euclideandistancesimilarity Euclidean distance

Cosinemeasuresimilarity cosine distance (0.7 becomes uncenteredcosinesimilarity)

Spearmancorrelationsimilarity Spearman Level related

tanimotocoefficientsimilarity Tanimoto correlation coefficient

Loglikelihoodsimilarity generally better than tanimotocoefficientsimilarity (not understand)

Cityblocksimilarity based on Manhattan distance

Typical uses of similarity

Usersimilarity similarity = new Cachingusersimilarity (

New Spearmancorrelationsimilarity (model), model);

Processing of missing data

Preferenceinferrer data loss or too little data may be implemented with Averagingpreferenceinferrer to fill missing data with an average value

Generally speaking, Preferenceinferrer does not have any effect on the recommended results except for increasing the amount of computation (the missing values are based on existing data) so it is generally used only in research areas.

The similarity of clustering

Clustersimilarity

Cluster similarity is used for distances between two different clusters (similar to the distance within a coordinate system)

The current distance calculation between clusters contains only the following two implementations (no better implementation algorithm for the time being)

nearestneighborclustersimilarity calculating the minimum distance from all item distances in two clusters

farthestneighborclustersimilarity calculating the maximum distance from all item distances in two clusters

Mahout Recommended Algorithm Basics

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Mahout Recommended Algorithm Basics

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support