Sina Weibo @ Kokoko This article by 36 Big Data Translation Group-Sea translation, reprint must obtain this site, the original author, the translator's consent, refuses any does not indicate the translator and the source reprint!
How to improve the validity of the recommendation algorithm? The main methods are feature transformation, model selection, data processing and so on. Dimensionality reduction is an important part of feature processing.
This blog post mainly describes how to use the dimensionality reduction method to improve the user-based collaborative filtering approach, collaborative filtering relies on the measurement of the similarity of users to recommend items. I'll explain each of the methods used in the next section.
1. Reference value
The validity of this value is derived from the calculation of R (root mean square error). I set the system reference value by three means of average calculation. First, the score for Project J is the average of all user ratings for Project J.
Nu is the number of users, rateing (IJ) is the user I score for Project J.
Second, the user I rating is the average value of the user's rating for all items.
NM is the number of items (film); rateing (IJ) is the user I score for Project J.
Thirdly, user I scored for item J is "User I score mean for all items" + "All users scored mean for Project J"-"overall score mean"
The three methods are scored as follows:
As you can see, the third method considers both the project and the user preference, which is better than the other two methods.
2, "Simple method": User-based collaborative filtering
This "easy approach" uses Pearson-related similarity measurements (PCS) to discover similar users, using their scoring mean as a score for the project.
The first approach is to use the user who most closely resembles the user I want to predict and to average the scores of these most similar users on the project.
But the best results will not be better than the best benchmark values. So I changed the first method, just to calculate the correlation coefficient greater than 0 users, this is the second method.
The SIM (UI) is a similar user group for U users, where we do not use the clustering method, but the count (Sim (UI)), or the number of similar users, will affect the score.
The third way, I also use the user who is most similar to the user I want to predict, but I use the similarity that the user gives the product scoring weight.
In the fourth approach, we use a standardized approach, "User I scored for Project J"-"average score for the project"-"average score for users" + "mean of overall score". Then using the standardized matrix to calculate the user's similarity, other practices are the same as the second method, from the standardized matrix to calculate the score, "User I give the project J standardized Score" + "project average score" + "user's average score"-"overall score mean". (This method is excerpted from (Big Data Mining)). The results of the four methods are as follows:
Overall, these four approaches are getting better, and the fourth approach is optimal, which is better than the best baseline method. Looking at the chart "scoring vs. Top n", we see a change from high to high variance, although Gaofangcha does not seem to be a serious problem.
3. Kmeans Project Cluster
We use the fourth method in the "Easy Method" as the benchmark value, use Kmeans to cluster the items, and then use the clustering information to calculate the user similarity. Here are two methods, the first method we use raw data < users, projects, scoring > To calculate the forecast score, the second method we use clustering < users, clustering, scoring mean > (the score mean is the score mean of the cluster) to calculate the forecast score.
The number of clusters, the number of similar users, the number of training times, and many other execution details will affect the results. I only consider the values of the cluster and similar user values, the chart of the scoring information is as follows, each sub-graph of the title of the value is a cluster value, the x-axis is similar to the user or the value of the cluster, the y-axis is the value of the mean square error.
As we can see, the greater the TOPN, the higher the accuracy of the method, then the smoother and somewhat worse; Method 2 is smoother than Method 1, but the best result is still method 1. Most are not better than the best simple collaborative filtering method, but the best results are better than that, such as the parameter topn=50 of Method 1 and the Clusternumber =150,rmse (root mean square error) is 0.932186048.
4, we use the simple method of the fourth method as the benchmark value, using the EM method for the project clustering, and then use the cluster information to calculate the user similarity, here we can choose a number of methods:
(1) Only the class with the maximum probability is used as the initial probability of predicting clustering or using clustering information;
(2) Use < users, projects, ratings > matrices to calculate forecast scores or use clustering information < users, clusters, scoring mean > to calculate budget scores.
(3) The EM algorithm has a number of different parameters to choose from, such as the type of variance, which may be spherical, striped, rectangular, or full-shaped.
Because of the limited time here, we only introduce two methods. First, we use the highest probability clustering as a predictive cluster, using the initial < user, project, scoring > Matrix to calculate the predictive score. In the second approach, we use the clustering probability information to find the highest n similar users, and use the < user, cluster, score mean > (mean score is the score mean of the cluster) matrix to calculate the forecast score. The results are as follows:
We also see that the greater the accuracy from the TopN, the more accurate the two methods are, and then the smoother and somewhat worse; Method 1 is good enough that their best score is better than the benchmark for easy methods, for example: Topn=50/clusternumber = (rmse:0.925359902), Topn=50/clusternumber =150 (rmse:0.926167057). The second method is more stable, some results are also very good, such as Topn=100/clusternumber =150 (rmse:0.931907), as the clustering values are increasing, the accuracy is higher, but the time cost is higher.
5. Similarity calculation
In the previous section, I just used Pearson-related similarity measurement (PCS) to find similar users, but there are many other ways to try, such as:
(1) The cosine similarity (cosine similarity) is a measurement of the similarity between two vectors in an inner product space, and the cosine of the angle between them is measured. We can consider a user's score as a vector.
(2) European distance (Euclidean distance), the data as a point, the distance between the data x and the data x is the length of the connecting part of X and Y.
In the first step, we do the normalization (the fourth method is the same as the simple method), the user I standardized scoring project J is user I to Project J's initial score-the average score of the project-the average user score + all the total average score.
In the second step, we can also choose the calculation method of different evaluation weights, where we use the same score as the weight of the same user or treat it as 1. Above that we call the SWM (the same weight method), the following one we call OWM ("1" weight method). Specific as follows:
I compare the Pearson correlation coefficient similarity strategy (pcs)/cosine similarity (cosine similarity)/Euclidean distance (Euclidean distance) similarity method, which considers the weights of each of the same users as 1 (OWM method). Here's a chart of scoring vs former N (calculating similar users), and we see that the cosine similarity method is the best, and the Euclidean distance method is the worst.
Next, I use the cosine similarity method to compare the two different weights calculation methods.
We see that the SWM is more stable, but the highest score is OWM, the parameter topn=100 (the value is 0.924696436).
6. Dimension reduction method
In this chapter, I will explain the deeper dimensionality reduction methods:
(1) PCA (Principal component analysis), which uses singular values of data to decompose a linear dimensionality reduction method, which only maintains the most significant eigenvector to project the vector into a low-dimensional space.
(2) Independent component analysis (independent component analyses), which divides a multivariable symbol into an additional sub-component with maximum independence.
I used cosine as a computational method of similarity, and tried two methods of combining SWM and OWM. The results will be better. We see that the best score for both methods for SWM,PCA is 0.917368073 (TOPN==5/SWM), and the best score for ICA is 0.916354841 (TON==5/SWM), which is the best of all the methods in my report.
7. Conclusion
From the above chapters, we can see that clustering and direct dimensionality reduction are effective in the small computational system recommended by collaborative filtering. It can help us deal with the problem of poor generosity. There are also many different techniques that affect the accuracy of the recommendation engine, such as the LDA topic model, which combines many methods using combinatorial algorithms and logistic regression. I will go further and do these introductions.
Code-managed: HTTPS://GITHUB.COM/WANGKOBE88/MERCURY/TREE/MASTER/UBCF
Original address: http://www.wangke.me/?p=142 thank the original author @ Kokoko for our great support.
End.
Originally from: http://www.36dsj.com/archives/26773
User-based collaborative filtering optimization using dimensionality reduction method