Problems and solutions of collaborative filtering algorithm

Last Update:2018-08-02 Source: Internet

Author: User

Tags manual

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Article reprint: http://blog.csdn.net/cserchen/article/details/5838333

1 Problems existing in the application of collaborative filtering

Although the application of collaborative filtering in the e-commerce recommendation system has achieved great success, with the increasing of site structure, content complexity and user number, the development of recommendation system based on collaborative filtering is facing two major challenges:

1) Improve the scalability of the collaborative filtering algorithm

Collaborative filtering algorithm can easily provide quantitative recommendations for tens of millions of users, but for e-commerce sites, often need to provide recommendations to thousands of users, which on the one hand need to improve response speed, can provide users with real-time recommendations, on the other hand should also consider the requirements of storage space, Minimize the burden on the system by recommending system operation.

2) Improve the quality of personalized recommendations

Users need to get trustworthy recommendations to help him find the products he likes. If the user believes that it is recommended to buy a product and then discovers that it does not, it will reduce the user's trust in the recommended results and will not be willing to use the recommendation system again.

In a certain sense, there is contradiction between the two challenges that the recommender system faces, the system should improve the expansibility of the algorithm and the response time, and the quality will inevitably lose. Therefore, how to coordinate these two aspects of the requirements, so that the recommendation system is not only useful and practical, is to realize the collaborative filtering technology need to consider the important factors.

In order to better improve the collaborative filtering technology and adapt to the development of recommender system, we must first analyze the problems existing in the implementation process of collaborative filtering, and then make a targeted improvement. Through the research of collaborative filtering technology and recommendation system, we find that there are some problems in the implementation of collaborative filtering technology.

1. 1 sparsity Issues

The implementation of collaborative filtering technology first needs to use user-item evaluation matrix to express user information, although it is simple in theory, but in fact, many e-commerce recommender system to deal with a large number of data information, and in these systems the general user purchase of goods accounted for about 1% of the total amount of the site, Therefore, the evaluation matrix (user-item matrix) is very sparse. In the case of large and sparse data, it is difficult to find the nearest neighbor user set on the one hand, and the cost of similarity calculation will be very large.

At the same time, because the data is very sparse, in the formation of the target user's nearest neighbor user set, the loss of information, resulting in the reduction of the recommended effect. For example, the loss of neighbor user relationship transitivity. User A is highly relevant to User B, and User B is highly relevant to user C, but because user A and user C seldom evaluate a common product, they think that the correlation between the two is low and the potential association between user A and user C is lost due to the sparsity of the data.

1. 2 Cold start problem

Cold start problem is also called the first evaluation Problem (First-rater), or the new item problem (New-item), from a certain angle can be regarded as the extreme situation of sparse problem. Because the traditional collaborative filtering recommendation is based on similar user/item calculation to get the target user's recommendation, when a new project first appeared, because no user has evaluated it, so simple collaborative filtering can not be predicted to score and recommend it. [25] Moreover, due to the early development of new projects, users were less evaluated and the accuracy of the recommendations was poor. Similarly, the recommendation system is poorly recommended for new users. The extreme case of a cold start problem is when a collaborative filtering recommendation system is just beginning to run, each user faces a cold boot problem on each project.

1. 3 Extensibility Issues

In the collaborative filtering recommendation algorithm, the global numerical algorithm can timely use the latest information to generate relatively accurate user interest prediction or recommendation, but in the face of the increasing number of users, the rapid increase in the amount of data, the expansion of the algorithm (that is, the problem of adapting to the scale of the system) becomes an important factor restricting the implementation of the recommendation system. Although compared with model-based algorithms, the global numerical algorithm saves the training time spent to build the model, but the computation used to identify the "nearest neighbor" algorithm increases greatly with the increase of the number of users and items, and for millions, the usual algorithm encounters a severe extensibility bottleneck. The problem is not solved, directly affect the recommendation system based on collaborative filtering technology to provide users with real-time recommendations to solve the problem, and the better the real-time recommendation system, the higher the accuracy, the system will be accepted by users.

Although the model-based algorithm can solve the problem of scalability of the algorithm to some extent, this kind of algorithm tends to be more stable for users ' interests and hobbies, because it should consider the learning process of the user model and the updating process of the model, which is worse than the global numerical algorithm for the use of the latest information.

This paper analyzes the two problems faced by the above collaborative filtering in the implementation of the recommendation system, and their common point is that they take into account the problem of recent neighbor formation (including the adequacy of user information, calculation cost, etc.). But it should be seen that the collaborative filtering in the implementation of the recommendation system, in order to get the nearest neighbor users, we must obtain the similarity between users by certain calculation, then determine the best neighbor number and form the neighbor user set. In this process, if the entire data set for similarity calculation, although direct, but the computation and time costs are very large, unable to adapt to the real business system. If you experiment with the training set data (a subset of the entire dataset), although you do not have to calculate the entire dataset, you have to count the number of experimental results to be possible, which undoubtedly increases the cost and error of the recommended results. And if the dynamic changes of the data set are taken into account, the actual application value of the nearest neighbor user set technology is less and less. Therefore, it is necessary to consider the use of a more effective neighbor user formation method for collaborative filtering applications.

2 Methods for solving sparsity problems in collaborative filtering

2. 1 Content-based collaborative filtering approach

Content-based recommendation (content-based recommendation) is a recommendation technology based on Content Extraction Project feature attributes, and is the continuation and development of information filtering technology.

In a content-based recommender system, an item or object is defined by a related feature attribute. For example, a text recommendation system such as a newsgroup filtering system newsweeder uses their textual vocabulary as a feature. The fundamental idea of this approach is that a user will be more likely to like those projects that are similar to the ones he has already purchased. In such methods, historical information is used to reflect the relationship between projects, such as the purchase of a project that often leads to another project or a group of items. Therefore, the method is to use the user-project matrix to analyze the similarity of each project, on this basis to calculate the recommended first n items.

The content-based recommender system is based on the characteristics of the user evaluation object and learns the user's interest. Schafer, Konstan and Riedl called this approach a project-project-related relationship law. Since this method does not need to identify the neighboring users, the recommended algorithm is much faster.

The advantage of content-based recommendations for learning historical information about the properties of a project is to improve the recommended scalability and to provide a better explanation of the recommended results. Content-based recommendations can discover items that are of interest to the user, but cannot discover new content.

The following deficiencies exist in content-based recommendations:

1) Information mining is not comprehensive. Generally, content-based recommendations can only be a simple analysis of certain content. In some areas, project properties do not reflect some of the hidden features, such as movies, music, or restaurants. Even a text file, the recommendation can only get information about the limited aspects of the content, but there are many other aspects that affect the user's experience. For example, according to the recommendation of Web content, it completely ignores the quality of aesthetics, all the multimedia information (including the text embedded in the picture) and network factors such as loading time.

2) The recommended content is limited. Not only are content-based recommendations, but many of the recommended technologies have the so-called "beyond specialization" issue. When the system is only recommended based on user profiles or project descriptions, the user is limited to items that are similar to what was previously familiar. This is not conducive to tapping the potential interest of users.

3) Lack of user feedback. This is the universal issue of the referral system. Evaluating a project is a heavy task for the user, so less evaluation is better. In content-based recommendations, the description of a project's properties is the only factor that affects future recommendations, which means that the recommended performance is reduced while the number of evaluations is reduced.

2. 2 Item-based collaborative filtering algorithm

The item-based (item-based) Collaborative filtering recommendation predicts the user's scoring of the target item based on the user's score on the similarity, based on the assumption that if most users have a similar score for some items, the current user will have a similar rating for those items. An item-based collaborative filtering recommendation system uses statistical techniques to find several nearest neighbors to a target item, because the current user's rating of the nearest neighbor is similar to that of the target item, so the current user can predict the score of the target item based on the rating of the nearest neighbor, [22] Then select the first few items with the highest forecast score as the recommended results for feedback to the user.

Table 2 user ratings data

Star Wars Titanic Lord of the Rings Bridge Dream
Methyl 4 2 4 4
B 3 5 3 3
C 2 2 3 2
D 5 1 5.

The core of the Item-based collaborative filtering recommendation algorithm is the final recommendation result of the user's scoring of the target item's nearest neighbor, and the user's score on the target item is approximated by the weighted average of the user's nearest neighbor score. For example, in the user scoring data shown in table 2, the item-based collaborative filtering recommendation algorithm needs to predict the score of user Ding's "Bridge Dream". According to the data analysis, the user group's score on "Bridge Dream" is very similar to that of "Star Wars", "Star Wars" is the best neighbor of the "Bridge Dream", so Ding's score on ' Star Wars ' has the greatest impact on the predicted value. Second, user groups on the "Lord of the Rings" rating and users of the "Titanic" score is similar, so ding on the "Titanic" score on the predicted value of the impact is also relatively large. The "Bridge of the bridge" is not a good neighbor of "Titanic", because the user groups of their ratings conflict, so Ding on the "Bridge Dream" score on the impact of relatively small. In the actual prediction process, only the first few neighbors with the highest similarity to the target item are searched, and the root-digging similarity size predicts the user's scoring of the target item.

Different from the user-based collaborative filtering recommendation algorithm, the item-based collaborative filtering recommendation algorithm selects the nearest neighbor slaughter set of the target item by calculating the similarity between the items, and predicts the scoring of the user goal item based on the current user's score to the closest neighbour. Then select the top number of items with the highest forecast score to give feedback to the user as recommendations. item-based Collaborative filtering recommendation algorithm can be divided into the following two stages:

1) Nearest Neighbor query: Search for the nearest neighbour of the target item.

2) Recommendation: According to the user to the goal item nearest neighbor's scoring information predicts the user to the goal item's scoring, produces the Top-n product recommendation.

2. 3 Combinatorial recommendation algorithm based on content recommendation and collaborative recommendation

Although the results of traditional collaborative filtering recommendations are significant, cold start and sparse problems are still important issues affecting their performance. The key problem is that collaborative filtering is characterized by the need for many users to evaluate the project, so that users can help each other and select projects together. However, for a new project, since no one has ever evaluated it before, it cannot participate in the recommendation and evaluation, so the recommendation system loses its role. The more people you evaluate, the more the system can play its maximum role. Therefore, for a new project, it is necessary to study and solve the initial situation of how to participate in the recommendation and evaluation, so that the whole recommendation system in the direction of a virtuous circle.

Because of the different angle and method of solving the problem, content-based recommendation can effectively solve the cold start and sparse problem of collaborative filtering. It can be described by a simple example: Suppose a user has made an evaluation of the likes of the NBA Web page from the site ESPN.com, while another user has made a hobby of the NBA Web page from the website cnnsi.com, and if only using collaborative filtering, they will not be able to discover the similarities of the two users. However, content-based analysis is based on the characteristics of the user-evaluated object (project), learning the user's interest, so that the two items can be found to be similar. Then, based on the properties of similar items and the user's scoring of similar items, the user's scoring on the new project can be predicted preliminarily, thus obtaining a solution to the cold start problem, and the sparse problem can be solved effectively.

Based on the above analysis, a researcher proposes a combination recommendation method using Content-based recommendation to improve collaborative filtering, and makes use of content-based recommendations to research similar projects to compensate for the shortcomings of collaborative filtering recommendations in new project recommendations, thus effectively solving cold start and sparse problems. The proposed algorithm first uses content-based recommendations to analyze the feature attributes of a project, identify similar items for a new project, and then predict the scoring of a new project with the user's evaluation of the similar project, and finally use the traditional collaborative filtering recommendation to calculate the neighborhood user and give the final forecast score in the context of a similar project. Involve the new project in the recommendation.

2. 4 Clustering-based collaborative filtering recommendation algorithm

With the further expansion of e-commerce system, the real-time requirement of collaborative filtering recommendation algorithm is faced with great challenges. [4] In a system of tens of thousands of users and goods, it is increasingly difficult to provide real-time referral services to tens of thousands of users.

Grouplens is the first automated System filtering recommendation system for handling large-scale datasets to provide real-time news referral services to system users. In the Grouplens recommendation system, the news must first be sorted by hand, and different news groups should be divided into different newsgroups. Because each user is in a particular newsgroup at a particular point in time, the nearest neighbor query is also restricted within that newsgroup, and the recommendations to that user are also restricted within that newsgroup. This method can effectively reduce the search space and effectively solve the real-time challenge faced by the collaborative filtering recommendation algorithm.

This solution mainly has the following deficiencies. First of all, according to the content information on the manual classification of items There is a large subjective factor, and therefore is not accurate. Second, in many large e-commerce systems, the number of items and their magnitude, the manual classification of items is very time-consuming, and therefore unrealistic. Finally, this recommendation method can only recommend a certain range of product information, thereby losing the opportunity to provide users with other valuable recommendations.

In order to solve the above problems in Recommender system, an effective method is to classify items automatically according to content information. This scheme also has some drawbacks, the main problem is to enter all the content information, in many cases, the contents of the item information can not be obtained, such as graphics, images, video and other information.

Another solution is proposed based on clustering (cluster-based) Collaborative filtering recommendation algorithm. The entire user space is divided into several different clusters according to the user's buying habits and scoring characteristics, so that the internal users of the cluster can score as closely as possible, while the users of different clusters have different scores for the products as much as possible. Based on the user's rating information of each cluster to generate a virtual user, the virtual user represents the user's typical rating of the product, the score of all virtual users as a new search space, query the current user in the virtual user space of the nearest neighbor, to produce the corresponding results of the vertebral recommendation. Compared to the original user space, the virtual user space is much smaller, so the nearest neighbor query efficiency is much higher, can effectively improve the real-time response of the recommended algorithm.

Cluster analysis has been studied in the field of data mining. The K-means clustering algorithm is the simplest and most effective clustering algorithm. The main steps for clustering the entire user space using the K-means clustering algorithm are as follows:

1) randomly selects K users as seed nodes, and the scoring data of K users is used as the initial clustering center.
2) for the remaining user collections, calculate the similarity between each user and the K cluster center, assigning each user to the cluster with the highest similarity.
3) for the newly generated cluster, calculate the average score of all users in the cluster and generate a new cluster.
4) Repeat the above 2 to 3 steps until the cluster no longer changes the master.

After clustering is generated, the cluster-based collaborative filtering recommendation algorithm can be divided into the following two steps:

1) Virtual user set main: According to different clusters to generate the corresponding cluster center, the cluster center and other users in the cluster of the minimum distance, representing the cluster of users of the typical product rating. All cluster centers are used as virtual user collections.

2) Recommendation: Use a variety of similar animal metrics on the virtual user collection to search for several recent neighbors of the current user, and then produce the corresponding recommendations based on the nearest neighbor's scoring information for the product. The nearest neighbor search and recommendation method is similar to the collaborative filtering recommendation algorithm, which is not mentioned here.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More