Collaborative Filtering Algorithm Problems and Solutions

Source: Internet
Author: User
1. Problems with collaborative filtering in applications

Although the application of collaborative filtering in e-commerce Recommendation Systems has achieved great success, with the increasing site structure, content complexity, and number of users, the development of collaborative filtering-based recommendation systems faces two major challenges:

1) Improve the scalability of collaborative filtering algorithms

Collaborative filtering algorithms can easily provide quantitative recommendations to tens of millions of users. However, e-commerce websites often need to provide recommendations to thousands of users. This requires a higher response speed, users can be recommended in real time. On the other hand, the requirements of storage space should be taken into account to minimize the burden on the system when the recommendation system runs.

2) Improve the quality of Personalized recommendations

The user needs to get trustworthy recommendations to help him find his favorite products. If the user believes that the recommendation product is purchased and finds that the user does not like it, the user's trust in the recommendation result will be reduced and the user will not want to use the recommendation system again.

In a sense, there is a conflict between the two challenges faced by the recommendation system. To improve algorithm scalability and response time, the system will inevitably lose quality. Therefore, it is important to consider how to coordinate these two requirements so that the recommendation system is both useful and practical.

In order to better improve the collaborative filtering technology and adapt to the development needs of the recommendation system, we must first analyze the problems existing in the implementation process of collaborative filtering, so as to carry out targeted improvements. Through the study of collaborative filtering technology and recommendation system, we found that the main problems in the implementation of collaborative filtering technology are as follows.

1.1 sparsity

To implement collaborative filtering, you must first use the user-item evaluation matrix to represent user information. Although this is theoretically simple, in fact, many e-commerce recommendation systems need to process a large amount of data information. In these systems, the total number of items purchased by users accounts for about 1% of the total number of items on the website, therefore, the evaluation matrix (User-item matrix) is very sparse. When the data volume is large and sparse, on the one hand it is difficult to find the nearest neighbor user set, and on the other hand, it is very expensive to perform similarity calculation.

At the same time, because the data is very sparse, the information will be lost when the target user's nearest neighbor user set is formed, resulting in a reduction in the recommendation effect. For example, the passing loss of neighbor user relationships. User A is highly correlated with user B, and user B is highly correlated with user C. However, because user a and user C seldom evaluate the common products, the relationship between user a and user C is considered to be low because of data sparsity.

1.2 cold start Problems

The cold start problem is also called the First-rater problem or the new-item problem, which can be viewed as an extreme case of the sparse problem from a certain perspective. Because the traditional collaborative filtering recommendation is based on the calculation of similar users/items to obtain the recommendation of target users, when a new project first appeared, because no users made comments on it, therefore, collaborative filtering alone cannot predict and rate and recommend it. In addition, because of the early appearance of new projects, fewer user reviews, and poor recommendation accuracy [25]. Similarly, the recommendation system has poor recommendation performance for new users. An extreme case of cold start is that when a collaborative filtering recommendation system starts running, each user is faced with cold start problems in every project.

1.3 scalability problems

In collaborative filtering recommendation algorithms, the global numeric algorithm can use the latest information to generate relatively accurate user interest prediction or recommendations. However, with the increasing number of users, the sharp increase in data volume and the scalability of algorithms (that is, the problem of adapting to the increasing scale of the system) have become an important factor restricting the implementation of recommendation systems. Compared with model-based algorithms, the global numerical algorithm saves the training time required for model creation, however, the amount of computing used to identify the "Nearest Neighbor" algorithm increases greatly with the increase of users and items. For the number of millions, the general algorithm will encounter serious expansion bottleneck problems. The solution to this problem directly affects the recommendation system based on collaborative filtering technology to provide users with a solution to the Recommendation problem in real time. The real-time performance of the recommendation system is better, and the accuracy is higher, this system is accepted by users.

Although model-based algorithms can solve the scalability of algorithms to a certain extent, these algorithms are often suitable for scenarios where users' interests and interests are relatively stable, because it takes into account the learning process of the user model and the updating process of the model, the use of the latest information is worse than the global numerical algorithm.

Analysis of the two problems faced by the above collaborative filtering in the implementation of the recommendation system, their commonalities are all taken into account the formation of recent neighbors (including the adequacy of user information acquisition, computing costs, etc ). However, we should see that in the implementation of collaborative filtering in the recommendation system, to obtain the nearest neighbor user, we must obtain the similarity between users through a certain calculation, and then determine the optimal number of neighbors, form a neighbor user set. In this process, if similarity calculation is performed on all datasets, though direct, the computation amount and time consumption are extremely high, and they cannot adapt to the real business system. If the training set data is obtained through an experiment (a certain subset of the entire dataset), although the entire dataset does not have to be calculated, the experiment results must be calculated multiple times before the results can be obtained, this undoubtedly increases the cost and error of Recommendation results. In addition, considering the dynamic changes in data sets, the practical application value of this new neighbor user set technology is becoming smaller and smaller. Therefore, it is necessary for collaborative filtering applications to consider using a more effective approach to the formation of recent neighbor users.

2. Solution to the sparse problem in Collaborative Filtering

2.1 Content-Based Collaborative Filtering

Content-based recommendation is a recommendation technology based on the feature attributes of content extraction projects. It is a continuation and development of information filtering technology.

In a content-based recommendation system, items or objects are defined by relevant feature attributes. For example, text recommendation systems such as newsgroup filtering system newsweeder use their text words as features. The fundamental idea of this method is that a user will prefer those projects that are similar to those that he has purchased. In this method, historical information is used to reflect the relationship between projects. For example, the purchase of a project often leads to the purchase of another project or a group of projects. Therefore, this method uses the user-Project Matrix to analyze the similarity between each project and calculate the first n recommended projects.

The content-based recommendation system learns users' interests based on the characteristics of user evaluation objects. Schafer, konstan, and Riedl call this Method Project-Project correlation method. Because this method does not need to identify nearby users, the recommendation algorithm is much faster.

Content-based recommendations are used to learn historical information about the properties of a project. The advantage of this feature is that it improves the testability of recommendations and provides a better explanation of the Recommendation results. Content-based recommendations can discover projects of interest to users, but cannot discover new content.

Content-based recommendations have the following shortcomings:

1) incomplete information mining. Generally, content-based recommendations can only analyze a certain amount of content. In some fields, project properties do not reflect some hidden characteristics, such as movies, music, or restaurants. Even for text files, we recommend that you only obtain limited information in the content, but there are many other aspects that affect your experience. For example, the recommendation based on Web content completely ignores aesthetic quality, all multimedia information (including text embedded in images), and network factors such as loading time.

2) the recommended content is limited. Not only is content-based recommendation, but many recommendation technologies have the so-called "beyond specialization" problem. When the system only recommends recommendations based on user data or project descriptions, users are restricted to only get projects similar to those familiar with the past. This is not conducive to mining potential user interests.

3) Lack of user feedback. This is a general issue of Recommendation Systems. Evaluating a project is a heavy task for users, so the less the evaluation, the better. In content-based recommendation, the description of project attributes is the only factor that affects the future recommendation performance. This means that the recommendation performance is also reduced when the number of evaluations is reduced.

2.2 item-based collaborative filtering algorithms

Item-based collaborative filtering recommendation predicts the user's score on the target item based on the user's score on similar items. It is based on the assumption that: if most users have similar scores for some items, the current user has similar scores for these items. The item-based collaborative filtering recommendation system uses statistics technology to find the nearest neighbor of the target item. Because the current user scores the nearest neighbor and scores the target item Similarly, therefore, you can predict the score of the current user on the target item based on the score of the current user on the nearest neighbor, and then select the first several items with the highest prediction score as the recommendation result to feedback to the user [22].

Table 2 user rating data

Star Wars Titanic
Jia 4 2 4 4
B 3 5 3 3
C 2 2 3 2
Ding 5 1 5?

The core of the item-based collaborative filtering recommendation algorithm is to generate the final recommendation result by scoring the nearest neighbor of the target item, the user's score on the target item is approached by the weighted average value of the user's score on the nearest neighbor of the target item. For example, in the user rating data shown in table 2, the item-based collaborative filtering recommendation algorithm needs to predict the score of user ding on the item "Langfang Dream. Through data analysis, it is found that the user group's score for the "Langfang Dream" is very similar to that for "star wars". "Star Wars" is the best neighbor of the "Langfang Dream, therefore, Ding's score for 'Star Wars has the greatest impact on the predicted values. Second, the user group's score on the "Lord of the Rings" is similar to the user's score on the "Titanic". Therefore, Ding's score on the "Titanic" has a great impact on the predicted value. The "Langfang Dream" is not a good neighbor of the "Titanic" because user groups have conflicting scores on them. Therefore, Ding's score on the "Langfang dream" has less influence on the predicted values. In the actual prediction process, only the first several neighbors with the highest similarity of the target item are searched, and then the root mining similarity is used to predict the user's scoring of the target item.

Unlike the user-based collaborative filtering recommendation algorithm, the item-based collaborative filtering recommendation algorithm selects the nearest neighbor slaughter set of the target item based on the similarity between calculated items, predict the user's target score based on the current user's rating of the nearest neighbor, and then select the first several items with the highest prediction score as the recommendation result to feedback to the user. Item-based collaborative filtering recommendation algorithms can be divided into the following two phases:

1) nearest neighbor query: searches for the nearest neighbor of the target item.

2) Recommendation generation: prediction of the user's rating on the nearest neighbor of the target item based on the user's scoring information, resulting in top-N commodity recommendations.

2.3 combined recommendation algorithms based on content recommendation and collaborative recommendation

Although traditional collaborative filtering recommendations have remarkable results, the cold start and sparse problems still affect their performance. The key issue is that collaborative filtering requires many users to help each other and select projects collaboratively after evaluating the project. However, since no one has evaluated a new project in the past, it cannot participate in the recommendation and evaluation. Therefore, the recommendation system has no effect. The more people are evaluated, the more people the system can play its biggest role. Therefore, for a new project, we need to study and solve how to make it participate in the recommendation and evaluation in the initial situation, so that the entire recommendation system will develop in a virtuous circle.

Because of the different problem-solving perspectives and methods, content-based recommendation can effectively solve cold start and sparse problems of collaborative filtering. A simple example can be used to describe the role of a user. Assume that a user has made a hobby comment on the NBA web page from the website ESPN.com, another user also made a hobby evaluation of the NBA web page from the website cnnsi.com. If only collaborative filtering is used, the similarities between the two users will not be found. However, content-based analysis is based on the feature attributes of user evaluation objects (projects) to learn users' interests. Therefore, the two items are similar. Then, based on the attributes of similar projects and users' scores on similar projects, You Can preliminarily predict users' scores on new projects. This gives you a solution to the cold start problem, the sparse problem can also be effectively solved.

Based on the above analysis, some researchers have proposed a combination recommendation method that uses Content-based recommendation to improve collaborative filtering, content-based recommendation is used to make up for the shortcomings of collaborative filtering recommendation in the recommendation of new projects, so as to effectively solve cold start and sparse problems. This combination recommendation algorithm uses Content-based recommendation analysis to analyze the feature attributes of a project, identify similar projects of a new project, and predict the scores of new projects by evaluating similar projects, finally, the traditional collaborative filtering recommendation is used to calculate neighboring users within the scope of similar projects and give the final prediction score, so that new projects can participate in the recommendation.

2.4 clustering-based collaborative filtering recommendation algorithm

With the further expansion of the E-commerce system, the real-time requirements for collaborative filtering recommendation algorithms face huge challenges. In a system with tens of thousands of users and products, it is increasingly difficult to provide real-time recommendation services for tens of thousands of users [4].

Grouplens is the first automated system to filter Recommendation Systems for processing large-scale datasets. It is used to provide real-time news recommendation services to system users. In the grouplens recommendation system, you must manually classify news into different news groups. Because each user is in a specific newsgroup at a specific time, the nearest neighbor query is restricted within the newsgroup, and the recommendation results provided to the user are restricted within the newsgroup. This method can effectively reduce the search space and effectively solve the real-time challenges faced by collaborative filtering recommendation algorithms.

This solution mainly has the following shortcomings. First, Manual classification of items based on content information has a large number of subjective factors, so it is not accurate. Second, in many large e-commerce systems, the number of items is huge, and manual classification of items is very time-consuming, so it is unrealistic. Finally, this recommendation method can only recommend product information within a certain range, thus losing the opportunity to provide users with other valuable recommendations.

To solve the preceding problems in the recommendation system, an effective method is to automatically classify items based on the content information. This solution also has some drawbacks. The main problem is that the content information of all items must be input. In many cases, the content information of items cannot be obtained, such as examples, images, and videos.

A cluster-based collaborative filtering recommendation algorithm provides another solution. The entire user space is divided into several different clusters based on the user's purchase habits and scoring characteristics, so that the internal user's score on the items within the cluster is as similar as possible, however, the user scores the product as much as possible for different clusters. A virtual user is generated based on the user's scoring information about the product in each cluster. The virtual user represents the typical scoring of the product in the cluster, all virtual users are rated as the new search space, and the nearest neighbor of the current user in the virtual user space is queried to generate the corresponding cervical recommendation results. Compared with the original user space, the virtual user space is much smaller, so the nearest neighbor query efficiency is much higher, which can effectively improve the real-time response speed of Recommendation algorithms.

Clustering Analysis is deeply studied in the field of data mining. K-means is the simplest and most effective clustering algorithm. The main steps for clustering the entire user space using the K-means clustering algorithm are as follows:

1) k users are randomly selected as the seed nodes, and the score data of k users on the item is used as the initial cluster center.
2) for the remaining user set, calculate the similarity between each user and K cluster centers, and assign each user to the cluster with the highest similarity.
3) calculate the average score of all users in the new cluster to generate a new cluster.
4) Repeat the two to three steps until the cluster does not change the primary node again.

After clustering is generated, the cluster-based collaborative filtering recommendation algorithm can be divided into the following two steps:

1) master of the virtual user set: the corresponding cluster center is generated based on different clusters. The sum of the distance between the cluster center and other users in the cluster is the smallest, representing the typical scoring of products in the cluster. All clustering centers are used as virtual user sets.

2) Recommendation generation: Search for the nearest neighbor of the current user using various similarity measurement methods in the virtual user set, and then generate the corresponding recommendation result based on the scoring information of the goods by the nearest neighbor. The method used for Nearest Neighbor Search and recommendation is similar to the collaborative filtering recommendation algorithm.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.