Http://yidianzixun.com/n/09vv1FRK?s=1
Completely excerpt from the Web page
1 Collective intelligence and collaborative filtering 1.1 what is collective wisdom (social computing)?
Collective Wisdom (collective Intelligence) is not unique to the Web2.0 era, but in the Web2.0 era, we use collective intelligence to build more interesting applications or better user experiences in Web applications. Collective wisdom is the collection of answers in the behavior and data of a large number of people to help you get a statistical conclusion about the entire population that we cannot get on a single individual, which is often a trend or a common part of the population.
Wikipedia and Google are two typical WEB 2.0 applications that use collective intelligence:
- Wikipedia is an encyclopedia of knowledge management, and Wikipedia allows end users to contribute knowledge as compared to traditional encyclopedias edited by domain experts, and as the number of participants increases, Wikipedia becomes an unparalleled comprehensive knowledge base covering all areas. Maybe someone will question its authority, but if you think about it from another side, you may be able to solve it. In the issue of a book, although the author is authoritative, but inevitably there are some errors, and then through a version of a version of the revision, the content of the book more and More perfect. On Wikipedia, such revisions and corrections are turned into things that everyone can do, and anyone who discovers errors or imperfections can contribute their ideas, even if some of the information is wrong, but it will be corrected by others as soon as possible. From a macroscopic point of view, the whole system in accordance with a virtuous circle of the trajectory of continuous improvement, which is also the charm of collective wisdom.
- Google: Currently the most popular search engine, unlike Wikipedia, it does not require users to make explicit contributions, but think carefully about Google's core PageRank thinking, it takes advantage of the relationship between Web pages, How many other pages are linked to the number of current pages as a measure of the importance of the current page; if that doesn't make sense, then you can think of it as an election process, each WEB page being a voter and a voter, PageRank A relatively stable score is obtained by a certain number of iterations. Google actually takes advantage of the collective wisdom of the links on all Web pages on the Internet, and it's important to find which pages are.
1.2 What is collaborative filtering?
Collaborative filtering is a typical method of using collective intelligence. To understand what is collaborative filtering (collaborative Filtering, abbreviated CF), first think of a simple question, if you want to see a movie now, but you don't know exactly which part to look at, what would you do? Most people ask their friends around to see what good movie recommendations they have recently, and we generally prefer to get referrals from friends who have more similar tastes. This is the core idea of collaborative filtering.
Collaborative filtering is generally found in a large number of users with a small fraction of your taste is similar, in collaborative filtering, these users become neighbors, and then according to their favorite other things organized into a sort of directory as recommended to you. There is, of course, one of the core issues:
- How do you determine if a user has similar tastes to you?
- How do you organize your neighbors ' preferences into a sorted directory?
Collaborative filtering in relation to collective intelligence, it retains the individual's characteristics to a certain extent, it is your taste preference, so it can be more as a personalized recommendation of the algorithm thought. As you can imagine, this recommendation strategy is important in the long tail of WEB 2.0, and recommending popular things to people in the long tail can get good results, and it goes back to one of the core issues of the Recommender system: knowing your users and then giving better recommendations.
2 core of in-depth collaborative filtering
As background knowledge, we introduce the basic idea of collective intelligence and collaborative filtering, this section we will analyze the principle of collaborative filtering, introduce the multi-recommendation mechanism based on collaborative filtering, advantages and disadvantages and practical scenarios.
First, to implement collaborative filtering, you need a few steps
- Collect User Preferences
- Find a similar user or item
- Calculation recommendations
2.1 Collecting User Preferences
In order to find the rule from the user's behavior and preference, and based on this recommendation, how to collect the user's preference information becomes the most fundamental determinant of the system recommendation effect. Users have many ways to provide their preferences to the system, and different applications may vary greatly, the following examples are described:
Table 1 user behavior and user preferences
User Behavior |
type |
features |
function |
Score |
An explicit |
The preference for integer quantization, the possible value is [0, N];n general value is 5 or 10 |
Users ' preferences can be accurately obtained by rating the items. |
Vote |
An explicit |
Boolean quantization preference, with a value of 0 or 1 |
Users ' preferences can be more accurately obtained by voting on items. |
Forward |
An explicit |
Boolean quantization preference, with a value of 0 or 1 |
Through the user's vote on the item, the user's preference can be accurately obtained. If it is inside the station, it can be inferred that the preference of the forwarded person (imprecise) |
Save Bookmark |
Show |
Boolean quantization preference, with a value of 0 or 1 |
Through the user's vote on the item, the user's preference can be accurately obtained. |
Tagged tag (tag) |
Show |
Some words, need to analyze the words, get preference |
By analyzing the user's tags, users can get the understanding of the project, and can analyze the user's emotion: like or hate |
Comments |
Show |
A piece of text that needs text analysis to get preference |
By analyzing the user's comments, you can get the user's feelings: like or hate |
Click Stream |
Implicit |
A group of user clicks, users interested in items, need to analyze, get preferences |
The user's click to a certain extent reflects the user's attention, so it can also reflect the user's preferences to a certain extent. |
Page Dwell time |
Implicit |
A set of time information, noise, need to be de-noising, analysis, get preference |
The user's page dwell time to a certain extent reflects the user's attention and preferences, but the noise is too large, not good use. |
Buy |
Implicit |
Boolean quantization preference, with a value of 0 or 1 |
The user's purchase is very clear and it is interesting to note this item. |
The above enumerated user behavior is more general, the recommendation engine designers can according to their own application characteristics to add special user behavior, and use them to express the user's preference for items.
In general applications, we extract more than one user behavior, about how to combine these different user behavior, there are basically the following two ways:
- Grouping different behaviors: Generally can be divided into "view" and "buy" and so on, and then based on different behavior, calculate the different user/item similarity. Like Dangdang or Amazon, "the person who bought the book also bought ...", "the person who viewed the book also viewed ..."
- They are weighted according to the extent to which the different behaviors reflect user preferences, resulting in a user's overall preference for items. In general, explicit user feedback is larger than implicit weights, but relatively sparse, after all, the number of users who display feedback is small, and the purchase behavior reflects a greater degree of user preference than "view", but this also varies by application.
Collecting user behavior data, we also need to do some preprocessing of the data, the core of which is: noise reduction and normalization.
- Noise reduction: User behavior data is generated by the user in the application process, it may have a lot of noise and user's misoperation, we can filter out the noise in the behavior data through the classical data mining algorithm, this can be our analysis more accurate.
- Normalization: As mentioned earlier, it may be necessary to weighting different behavioral data when calculating user preference for items. However, it can be imagined that the different behavior of the data value may vary greatly, for example, the user's viewing data is necessarily larger than the purchase data, how to unify the data of each behavior in a same value range, so that the weighted sum of the overall preferences more accurate, we need to be normalized. The simplest normalization is to divide all kinds of data by the maximum value in this class to ensure that the normalized data is evaluated in the [0,1] range.
After preprocessing, according to different application behavior analysis method, can choose to group or weighted processing, then we can get a user preference of two-dimensional matrix, one-dimensional is the user list, the other dimension is a list of items, the value is the user's preference for items, is generally [0,1] or [-1, 1] floating point value.
2.2 Find similar users or items
After the user's behavior has been analyzed by user preferences, we can calculate similar users and items according to user preferences, and then based on similar users or items to recommend, this is the most typical CF two branches: User-based CF and item-based CF. Both methods need to calculate similarity, let's take a look at some of the most basic methods of calculating similarity.
The calculation of similarity degree
On the calculation of similarity, the existing basic methods are based on vector (vectors), in fact, the distance between two vectors is calculated, the closer the similarity of the greater. In the recommended scenario, in a two-dimensional matrix of user-item preferences, we can use a user's preference for all items as a vector to calculate the similarity between users, or to calculate the similarity between items by a vector of all users ' preferences for an item. Here we describe in detail several commonly used similarity calculation methods:
- Euclidean distance (Euclidean Distance)
Originally used to calculate the distance between two points in Euclidean space, suppose that x, Y is two points in an n-dimensional space, the Euclidean distance between them is:
As you can see, Euclidean distance is the distance of two points on the plane when n=2.
When using Euclidean distance to denote similarity, the following formula is generally used to convert: the smaller the distance, the greater the similarity
- Pearson correlation coefficient (Pearson Correlation coefficient)
Pearson correlation coefficients are generally used to calculate the tightness of the connections between the two fixed-distance variables, and its value is between [ -1,+1].
SX, SY is the standard deviation of the sample for x and Y.
- Cosine similarity (cosine similarity)
The similarity of cosine is widely used to calculate the similarity of document data:
- Tanimoto coefficient (Tanimoto coefficient)
Tanimoto coefficients, also known as Jaccard coefficients, are extensions of cosine similarity and are used to calculate the similarity of document data:
Computation of similar neighbors
After the introduction of the calculation method of similarity, we see how to find the user-item neighbor according to the similarity, the common principle of selecting neighbors can be divided into two categories: Figure 1 shows the point set on the two-dimensional planar space.
- Fixed number of neighbors: K-neighborhoods or fix-size neighborhoods
Regardless of the neighbor's "near and far", take only the nearest K, as its neighbors. 1 in a, suppose to calculate point 1 of 5-neighbor, then according to the distance between points, we take the nearest 5 points, respectively, point 2, point 3, point 4, point 7 and point 5. But obviously we can see that this method is not good for outliers, because to take a fixed number of neighbors, when it is not near enough to compare similar points, it is forced to take some of the less similar points as neighbors, which affect the neighbor similar degree, than 1, point 1 and point 5 is not very similar.
- Neighbor based on similarity threshold: threshold-based neighborhoods
Unlike the principle of calculating a fixed number of neighbors, neighbor computation based on the threshold of similarity is the limit of the maximum proximity of neighbors, falling at the center of the current point, and all the points in the area of K as the neighbors of the current point, the method calculates the number of neighbors is indeterminate, but the similarity does not have a large error. 1 in B, starting from point 1, calculate the similarity in the K neighbor, get point 2, point 3, point 4 and point 7, this method calculates the similarity degree of the neighbor than the previous advantage, especially the processing of outliers.
Figure 1:2.3 Calculation recommendations for similar neighbor calculations
After the previous calculation has obtained the adjacent users and adjacent items, the following describes how to make recommendations for users based on this information. The previous review article in this series has briefly introduced the recommendation algorithm based on collaborative filtering can be divided into user-based CF and item-based CF, below we delve into the two methods of computing, use scenarios and advantages and disadvantages.
Users-based CF (user CF)
The basic idea of the user-based CF is quite simple, based on the user's preference for the item to find the neighboring neighbor user, then the neighbor user likes the recommendation to the current user. In the calculation, it is a user's preference for all items as a vector to calculate the similarity between users, after finding K neighbors, according to the neighbor's similarity weight and their preference for items, predict the current user does not have a preference for items, calculate a sorted list of items as a recommendation. Figure 2 shows an example, for user A, based on the user's historical preferences, here only to get a neighbor-user C, and then the user C-like item D is recommended to user A.
Figure 2: Fundamentals of the user-based CF
Item-based CF (item CF)
The principle of the item-based CF is similar to the user-based CF, except that the item itself is used in the calculation of the neighbor, not from the user's point of view, that is, based on the user's preference for the item to find similar items, and then according to the user's historical preference, recommend similar items to him. From the point of view of computing, it is the preference of all users of an item as a vector to calculate the similarity between items, to obtain similar items, according to the user's historical preferences to predict the current user has not expressed the preferences of the items, calculated to get a sorted list of items as a recommendation. Figure 3 shows an example, for item A, according to the historical preferences of all users, like item a users like item C, the article A and item C is similar, and User C likes item A, then you can infer that user C may also like item C.
Figure 3: Fundamentals of item-based CF
User CF vs. Item CF
The basic principles of User CF and Item CF are described earlier, and we'll look at a few different angles to see their pros and cons and the scenarios that apply:
The item CF and user CF are the recommended two most basic algorithms based on collaborative filtering, and the user CF was introduced long ago, and the item CF was popular since Amazon's papers and patents were published (around 2001), and everyone felt that the Item CF was more performance and complexity than User CF is better, one of the main reasons is that for an online site, the number of users is often much more than the number of items, while the data of the item is relatively stable, so the calculation of the similarity of the item is not only a small amount of calculation, but also do not have to update frequently. But we tend to ignore this situation only to provide goods for the e-commerce site, for news, blog or micro-content recommendation system, the situation is often the opposite, the number of items is huge, but also updated frequently, so single from the point of view of complexity, the two algorithms in different systems have advantages, The designer of the recommendation engine needs to choose a more appropriate algorithm based on the characteristics of his application.
In non-social network sites, the internal link of content is an important recommendation principle, which is more effective than the recommendation principle based on similar users. For example, on the purchase of a book site, when you read a book, the recommendation engine will give you recommendations related to the book, the importance of this recommendation far more than the homepage of the user's comprehensive recommendation. As you can see, in this case, the Item CF recommendation becomes an important means of navigating the user. At the same time, the Item CF makes it easy to explain the recommendation, to recommend a book to a user on a non-social network site, and to explain that someone with similar interests has read the book, which is hard to convince the user, because the user may not know the person at all. But if the explanation is that the book is similar to a book you've read before, users may find it reasonable to adopt the recommendation.
On the contrary, in today's popular social networking sites, user CF is a better choice, and user CF plus social Networking information can increase users ' confidence in the referral interpretation.
- Recommended diversity and Accuracy
The researchers who study the recommendation engine use the User CF and Item CF to calculate the recommended results on the same data set, and find that only 50% of the recommendation lists are the same, and 50% are completely different. But the two algorithms have similar precision, so it can be said that the two algorithms are very complementary.
There are two ways to measure the diversity of recommendations:
The first measure is measured from the perspective of a single user, that is, given a user, to see if the system gives a variety of recommendations, that is, to compare the recommended list of items between the 22 similarity, it is not difficult to think of this measure, the diversity of the item CF is obviously not as good as the User CF, because the item CF's recommendation is the most similar to what you have seen before.
The second measure is to consider the diversity of systems, also known as coverage (coverage), which refers to whether a referral system can provide a rich choice for all users. In this indicator, the diversity of the item CF is much better than the user CF, because the user CF always tends to recommend the hot, from the other side, that is, the item CF recommendation has a very good novelty, is very good at recommending long tail items. Therefore, although the accuracy of the item CF is slightly smaller than the user CF in most cases, the item CF is much better than the user CF if the diversity is considered.
If you're still wondering about the diversity of recommendations, let's take another example to see what the difference is between User CF and Item cf. First of all, suppose that each user's interests are broad, like several areas of things, but each user must also have a major area, the field will be more than other areas of concern. Given a user, assuming he likes 3 domains A,b,c,a is the main area he likes, this time we see what the user CF and Item CF tend to recommend: If you use the user CF, it will a,b,c the more popular things in the three fields to the user; EMCF, it will basically only recommend A field of stuff to the user. So we see that because the user CF is only recommended for hot, so it has insufficient ability to recommend long tail projects, and the item CF only recommends a domain to the user, so that his limited list of recommendations may contain a certain number of non-popular long tail items, and the item CF recommendation for this user, obviously many Lack of sample. But for the whole system, because different users of the main points of interest are different, so the system coverage will be better.
From the above analysis, it can be clearly seen that both recommendations have their rationality, but are not the best choice, so their accuracy will be lost. In fact, the best choice for this kind of system is, if the system to the user recommended 30 items, not each field pick 10 the hottest to him, also not recommended 30 a field to him, but for example, recommended 15 A field to him, the remaining 15 from the b,c choice. Therefore, the combination of user CF and Item CF is the best choice, the basic principle is that when the use of the item CF causes the system to the diversity of individual recommendations, we add the user CF to increase the diversity of individual recommendations, thereby improving the accuracy, and when the use of user CF to the system's entire When the volume diversity is insufficient, we can increase the overall diversity by adding the Item CF, as well as improving the recommended accuracy.
- User's adaptability to the recommended algorithm
Most of us consider which algorithm is better from the point of view of the recommendation engine, but in fact we should consider as the end user of the recommendation engine-the application user's adaptability to the recommendation algorithm.
For the user CF, the recommended principle is to assume that users will like those who have the same preferences of the user like things, but if a user does not have the same preferences of friends, that the user CF algorithm of the effect will be very poor, so a users of the CF algorithm's fitness and how much he has a common preference for users in proportion.
The item CF algorithm also has a basic assumption that users will like something similar to what he used to like, so we can calculate the self-similarity of a user's favorite item. A user like the object of the self-similarity, it means that he likes the things are more similar, that he is more consistent with the basic assumptions of the item CF method, then his adaptation to the item CF is naturally better; Conversely, if the self-similarity is small, it means that the user's preferences do not meet the Item CF The basic assumptions of the method, then the possibility of making good recommendations with the Item CF method is very low for such users.
3. KNN-based collaborative filtering recommendation algorithm MATLAB implementation
The neighbor model is also often referred to as the K-nearest neighbor model, or KNN for short. KNN model can obtain accurate recommendation results and give reasonable explanations for the results, they are the earliest used in the CF recommendation system and up to now the most popular type of model.
PS: The following formulas and pictures are transferred from Blogger's own Csdn blog.
In order to obtain a user's rating of the product, the KNN model generally consists of the following three steps:
1. Calculation of similarity
This step calculates the similarity between each pair of products (similarity). Some of the widely used similarities
Measures include:
Pearson Correlation:
where ¯rm and ¯rn respectively represent the average of the scores scored by the film M and N, and PMN represents a collection of users who scored on both the film M and N, which is PMN = Pm∩pn. Cosine: Where ¯RV represents the scoring average of user v. 2. Select neighbor to predict the user U's rating for the movie m, we first select a specific number of films with the highest similarity to the film m from the Pu, which form u−m pairs of neighbors (neighborhood), recorded as N (m; u). 3. Generate forecasts user U's score for movie M is predicted as a weighted average of the scores in neighbor N (m; u) obtained in the previous step: where Bu;n is the user U's baseline forecast score for movie N. The benchmark model here can be any model that can produce a predictive score. According to the above process, the rmse=1.0776 obtained in MATLAB simulation environment, the number of neighbors is 10. The influence of the number of neighbors (0~20) on the RMSE curve: percent load training data load g:\matlab\ Collaborative filtering recommendation \dataset\movielens\u1.base%% data preprocessing% extract the first three columns of data, that is, the user number, By the user evaluation of the film serial number, evaluation score [M,n]=size (U1); Test=zeros (m,3); Fori=1:3test (:, i) =u1 (:, i); end%% establishing the scoring matrix Number_user=max (test (:, 1)); Number_movies=max (Test (:, 2)); Score_matrix=zeros (number_user,number_movies);% scoring Matrix 943*1682 fori=1:m Score_matrix (Test (i,1), Test (i,2)) =test (i,3 ); End Sim_matrix=zeros (number_movies,number_movies);% similarity matrix 1642*1642 tic;% compute scoring matrix fori=1:number_movies-1forj=i+1: Number_movies Sim_matrix (i,j) =similarity_ab (SCORE_MATRIX,I,J); End end toc;%% Establish similarity matrix% function Neibor=neibor_select (sim_matrix,a,n) neibor_num=10;% neighbor size Sim_matrix=sim_maTrix ' +sim_matrix;% to complete the similarity matrix%neibor_sim_matrix_temp for the sequence of similarity matrix%neibor_matrix_temp each similarity of the corresponding film, that is, we are looking for the neighbor value_1_ Index=find (sim_matrix>=0.9999);% finds all values in the Sim_matrix matrix that have a similarity of 1,% because it may be the wrong value, late-choice neighbors should not be considered. % Why not Value_1_index=find (sim_matrix==1) So there are some 1 can not be correctly found, you can try Sim_matrix (Value_1_index) =0;% will all the similarity of 1 value with 0 instead of% [neibor_ Sim_matrix_temp,neibor_matrix_temp]=sort (sim_matrix,2 ' descend ');% Neibor_sim_matrix=zeros (number_movies,neibor_ num);% neibor_matrix=%fori=1:neibor_num% Neibor_sim_matrix (:, i) =neibor_sim_matrix_temp (:, i);% of each neighbor corresponding to the similarity% Neibor_ Matrix (:, i) =neibor_matrix_temp (:, i);% neighbor%end%% load test set load G:\matlab\ co-filtering make recommendations \dataset\movielens\u1.test%% make predictions [m,n]= Size (U1); Test=zeros (m,3); Fori=1:3test (:, i) =u1 (:, i); End Predict_score=zeros (m,1); Forj=1:m p_u=find (Score_matrix (Test (j,1),:) ~=0);% find the movie collection for that user Rating [~,num]=size (p_u);% Calculate the number of movies that the user evaluates%%%%%%%%%%% neighbor%%%%%%%%%%%%neibor_num=10;% maximum is 4 p_u_sim=sim_matrix (Test (j,2), P_u); [Temp,index]=sort (p_u_sim,2 ' descend '); [~,num1]=size (index); Ifnum1>=neibor_num neibor= (P_u (index) (1:neiboR_num)); elseneibor= (P_u (index)); NEIBOR_NUM=NUM1; Endsum_score=sum (Score_matrix (Test (j,1),:), 2);% The user's total score for all movies aver_score=sum_score/num;% the user's average rating for the movie sum1=0; sum2=0; Fori=1:neibor_num Sum1=sum1+sim_matrix (Test (j,2), Neibor (i)) * (Score_matrix (Test (j,1), Neibor (i))-aver_score); Sum2=sum2+sim_matrix (Test (j,2), Neibor (i)); Endifsum2==0predict_score (j,1) =round (aver_score);% excludes the case that the denominator is zero Elsepredict_score (j,1) =round (aver_score+sum1/sum2); % ensure that the predicted value is the number of scores of Ifpredict_score (j,1) >5predict_score (j,1) = 5; ElseIf Predict_score (j,1) <1predict_score (j,1) = 1; End End end%% calculates Rmse Eval=zeros (m,3); Eval (:, 1) =test (:, 3); Eval (:, 2) =predict_score (:, 1); Eval (:, 3) =abs (Test (:, 3)-predict_score (:, 1)); Rmse=sqrt (Eval (:, 3) ' *eval (:, 3)/m);
The disadvantages of collaborative filtering are:(1) The user's evaluation of the product is very sparse, so the user-based evaluation of the similarity between users may be inaccurate (that is, the problem of sparsity); (2) with the increase of users and goods, the performance of the system will be lower; (3) If a product is never evaluated by a user, It is not possible to recommend this product (i.e. the initial evaluation question). 4. Summary
One of the core ideas of Web2.0 is "collective intelligence", the basic idea of the recommendation strategy based on collaborative filtering is based on the mass behavior, providing individual recommendations for each user, thus enabling users to find the information they need more quickly and accurately. From the perspective of application analysis, today's more successful recommendation engine, such as Amazon, watercress, when the use of collaborative filtering, it does not need to be a product or user rigorous modeling, and does not require the description of the item is machine understandable, is not a field-independent recommendation method, At the same time this method calculates the recommendation is open, can share other people's experience, very good supports the user to discover the latent interest preference. The recommendation strategy based on collaborative filtering also has different branches, they have different practical scenarios and recommendations, the user can choose the appropriate method according to the actual situation of their application, different or combined methods to get better recommendations.
ZZ [recommendation System] recommended systems for collaborative filtering (CF) algorithms to understand and implement