Item-based collaborative filtering recommendation algorithm -- read "item-based collaborative filtering recommendation algorithms"

Source: Internet
Author: User

Recently, I participated in the KDD cup 2012 competition and chose track1 for Weibo recommendation. I found a recommendation-related paper. "Item-based collaborative filtering recommendation algorithms" is a classic recommendation paper. Many popular recommendation algorithms are improved based on the algorithms proposed in this paper.

1. Collaborative Filtering Algorithm Description

The recommendation system uses data analysis technology to find out what users are most likely to like and recommend it to users. Many e-commerce websites now have this application. Currently, many mature recommendation algorithms are used.Collaborative Filtering(Collaborative
FilteringCf) Recommendation algorithm. The basic idea of CF is to recommend items to users based on their preferences and the choices of users with similar interests.


1. in CF, the m × n matrix is used to indicate the user's preferences for the item. Generally, the score indicates the user's preferences for the item. The higher the score, the more he prefers the item, 0 indicates that this item has not been purchased. In the figure, the row indicates a user, the column indicates an item, and the uij indicates the user I scores the item j. Cf is divided into two processes:PredictionProcess, the other isRecommendationProcess. The prediction process is used to predict the user's possible score for items that have not been purchased. The recommendation is to recommend the user's most likely one or top-N items based on the results of the prediction phase.

Ii. Comparison between user-based and item-based algorithms

Cf algorithms are classified into two categories: memory-based (Memory-based), The other is model-based (Model-based), User-based and item-based algorithms belong to the memory-based type. For detailed classification, see the Wikipedia description.

The basic idea of user-based is that if user a prefers item A, user B prefers item A, B, and C, and user C prefers item A and item C, therefore, user a is similar to user B and user C because they both like user a and user a also like user C, so C is recommended to user. This algorithm uses the nearest-neighbor (nearest-neighbor) algorithm to identify a user's neighbor set. The user of this set has similar preferences with this user, the algorithm predicts the user based on the neighbor's preferences.

The user-based algorithm has two major problems:

1. data sparsity. A large e-commerce recommendation system generally has a large number of items. Users may buy less than 1% of the items, and the items bought by different users are less overlapping, as a result, algorithms cannot find a user's neighbor, that is, users with similar preferences.

2. algorithm scalability. Recently, the calculation workload of the neighbor algorithm increases with the increase of the number of users and items, which is not suitable for large data volumes.

The basic idea of iterm-based is to calculate similarity between items based on historical preferences of all users, and then recommend items similar to your favorite items to users. Taking the previous example as an example, we can know that item A is very similar to item C, because users who like item A also like item C, and user a like item A, so we recommend item C to user.

Because the direct similarity of items is relatively fixed, you can calculate the similarity between different items online in advance and store the results in a table. When recommended, you can perform a look-up table to calculate possible user scores, the preceding two problems can be solved simultaneously.

III,Detailed process of the item-based algorithm

(1) similarity calculation

Item-based algorithms use the following methods to calculate similarity between items:

1. calculate similarity between items based on cosine-based similarity by calculating the cosine of the angle between two vectors. The formula is as follows:


The numerator is the inner product of two vectors, that is, the numbers at the same position of the two vectors are multiplied.

2. Calculate the Pearson-r correlation between two vectors based on correlation-based similarity. The formula is as follows:


It indicates the user U's score on item I, indicating the average value of the I-th item.

3. adjusted cosine (adjusted cosine) similarity calculation, because the cosine-based similarity calculation does not consider the scoring of different users, some users may prefer to give high scores, while some users prefer to give low scores, this method removes the average value of user scores to eliminate the impact of different users' scoring habits. The formula is as follows:


It indicates the average value of user U scores.

(2) prediction value calculation

Based on the similarity between items previously calculated, the following two prediction methods are available:

1. weighted sum.

The weighted sum of the scores of items that user U has scored is used. The weight is the similarity between each item and item I, and then the sum of the similarity of all items is averaged, calculate the user U's score for item I. The formula is as follows:


This is the similarity between item I and item n, which is used to score item n by user U.

2. regression.

Similar to the weighted sum method above, the regression method does not directly use the scoring value of similar item n, because there is a misunderstanding when using the cosine method or Pearson correlation method to calculate similarity, that is, the two scoring vectors may be far apart (Euclidean distance), but there may be high similarity. Because different users have different scoring habits, some tend to score high, and some tend to score low. If two users like the same item, because the scoring habits are different, their European distance may be relatively far, but they should have a high similarity. In this case, the score of the user's original similar items is calculated, resulting in poor prediction results. Use linear regression to re-estimate a new value and use the same method as above for prediction. The re-calculation method is as follows:


Item N is a similar item of item I. It is obtained by linear regression calculation of the scoring vectors of item n and I, which is the error of the regression model. The article on how to conduct linear regression is not described. You need to refer to other relevant documents.

Iv. Conclusion

The results of the experiment show that the prediction result of the item-based algorithm is higher than that of the user-based algorithm. 2. Because the item-based algorithm can calculate item similarity in advance, the online prediction performance is higher than that of the user-based algorithm. 3. A small subset of an item can also produce high-quality prediction results.

Reprinted please indicate the source, original address: http://blog.csdn.net/huagong_adu/article/details/7362908

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.