Transferred from: http://www.ibm.com/developerworks/cn/web/1103_zhaoct_recommstudy2/index.html
The first article in this series provides an overview of the recommendation engine, and the following articles provide an in-depth introduction to the recommended Engine's algorithms and help readers implement them efficiently. In Today's recommended technology and algorithms, the most widely recognized and adopted is based on collaborative filtering recommendation Method. It has a simple model, low data dependence, Convenient data collection, The recommended effect is more than a number of advantages to become the public eye of the recommended algorithm "the best." This article will take you deep into the secret of collaborative filtering, and give an efficient implementation of a collaborative filtering algorithm based on Apache Mahout. Apache Mahout is a new open source project for ASF, which originates from Lucene and is built on top of Hadoop to focus on the efficient implementation of machine learning classic algorithms on massive amounts of Data.
Collective intelligence and collaborative filtering what is collective wisdom
Collective Wisdom (collective Intelligence) is not unique to the Web2.0 era, but in the Web2.0 era, we use collective intelligence to build more interesting applications or better user experiences in Web applications. Collective wisdom is the collection of answers in the behavior and data of a large number of people to help you get a statistical conclusion about the entire population that we cannot get on a single individual, which is often a trend or a common part of the Population.
Wikipedia and Google are two typical WEB 2.0 applications that use collective intelligence:
- Wikipedia is an encyclopedia of knowledge management, and Wikipedia allows end users to contribute knowledge as compared to traditional encyclopedias edited by domain experts, and as the number of participants increases, Wikipedia becomes an unparalleled comprehensive knowledge base covering all Areas. Maybe someone will question its authority, but if you think about it from another side, you may be able to solve it. In the issue of a book, although the author is authoritative, but inevitably there are some errors, and then through a version of a version of the revision, the content of the book more and more Perfect. On Wikipedia, such revisions and corrections are turned into things that everyone can do, and anyone who discovers errors or imperfections can contribute their ideas, even if some of the information is wrong, but it will be corrected by others as soon as Possible. From a macroscopic point of view, the whole system in accordance with a virtuous circle of the trajectory of continuous improvement, which is also the charm of collective wisdom.
- Google: currently the most popular search engine, unlike Wikipedia, it does not require users to make explicit contributions, but think carefully about Google's core PageRank thinking, It takes advantage of the relationship between Web pages, How many other pages are linked to the number of current pages as a measure of the importance of the current page; if that doesn't make sense, then you can think of it as an election process, each WEB page being a voter and a voter, PageRank A relatively stable score is obtained by a certain number of iterations. Google actually takes advantage of the collective wisdom of the links on all Web pages on the Internet, and it's important to find which pages are.
What is collaborative filtering
Collaborative filtering is a typical method of using collective intelligence. To understand what is collaborative filtering (collaborative Filtering, abbreviated CF), First think of a simple question, if you want to see a movie now, but you don't know exactly which part to look at, what would you do? Most people ask their friends around to see what good movie recommendations they have recently, and we generally prefer to get referrals from friends who have more similar tastes. This is the core idea of collaborative filtering.
Collaborative filtering is generally found in a large number of users with a small fraction of your taste is similar, in collaborative filtering, these users become neighbors, and then according to their favorite other things organized into a sort of directory as recommended to YOU. There is, of course, one of the core issues:
- How do you determine if a user has similar tastes to you?
- How do you organize your neighbors ' preferences into a sorted directory?
Collaborative filtering in relation to collective intelligence, it retains the individual's characteristics to a certain extent, it is your taste preference, so it can be more as a personalized recommendation of the algorithm Thought. As you can imagine, this recommendation strategy is important in the long tail of Web 2.0, and recommending popular things to people in the long tail can get good results, and it goes back to one of the core issues of the Recommender system: knowing your users and then giving better recommendations.
The core of deep collaborative filtering
As background knowledge, We introduce the basic idea of collective intelligence and collaborative filtering, This section we will analyze the principle of collaborative filtering, introduce the multi-recommendation mechanism based on collaborative filtering, advantages and disadvantages and practical scenarios.
first, to implement collaborative filtering, you need a few steps
- Collect User Preferences
- Find a similar user or item
- Calculation recommendations
Collect User Preferences
In order to find the rule from the User's behavior and preference, and based on this recommendation, how to collect the User's preference information becomes the most fundamental determinant of the system recommendation Effect. Users have many ways to provide their preferences to the system, and different applications may vary greatly, The following examples are described:
Table 1 user behavior and user preferences
User Behavior |
type |
features |
function |
Score |
An explicit |
The preference for integer quantization, the possible value is [0, N];n general value is 5 or 10 |
Users ' preferences can be accurately obtained by rating the Items. |
Vote |
An explicit |
Boolean quantization preference, with a value of 0 or 1 |
Users ' preferences can be more accurately obtained by voting on Items. |
Forward |
An explicit |
Boolean quantization preference, with a value of 0 or 1 |
Through the User's vote on the item, the User's preference can be accurately obtained. If it is inside the station, it can be inferred that the preference of the forwarded person (imprecise) |
Save Bookmark |
Show |
Boolean quantization preference, with a value of 0 or 1 |
Through the User's vote on the item, the User's preference can be accurately obtained. |
Tag tags (Tag) |
Show |
Some words, need to analyze the words, get preference |
By analyzing the User's tags, users can get the understanding of the project, and can analyze the User's emotion: like or hate |
Comments |
Show |
A piece of text that needs text analysis to get preference |
By analyzing the User's comments, you can get the User's feelings: like or hate |
Click Stream View |
Implicit |
A group of user clicks, users interested in items, need to analyze, get preferences |
The User's click to a certain extent reflects the User's attention, so it can also reflect the User's preferences to a certain extent. |
Page Dwell time |
Implicit |
A set of time information, noise, need to be de-noising, analysis, get preference |
The User's page dwell time to a certain extent reflects the user's attention and preferences, but the noise is too large, not good use. |
Buy |
Implicit |
Boolean quantization preference, with a value of 0 or 1 |
The User's purchase is very clear and it is interesting to note this Item. |
The above enumerated user behavior is more general, the recommendation engine designers can according to their own application characteristics to add special user behavior, and use them to express the User's preference for Items.
In general applications, we extract more than one user behavior, about how to combine these different user behavior, there are basically the following two ways:
- Grouping different behaviors: generally can be divided into "view" and "buy" and so on, and then based on different behavior, calculate the different User/item Similarity. Like Dangdang or Amazon, "the person who bought the book also bought ...", "the person who viewed the book also viewed ..."
- They are weighted according to the extent to which the different behaviors reflect user preferences, resulting in a user's overall preference for Items. In general, Explicit user feedback is larger than implicit weights, but relatively sparse, after all, the number of users who display feedback is small, and the purchase behavior reflects a greater degree of user preference than "view", but this also varies by application.
Collecting user Behavior data, We also need to do some preprocessing of the data, the core of which is: noise reduction and Normalization.
- Noise reduction: user behavior data is generated by the user in the application process, it may have a lot of noise and user's misoperation, We can filter out the noise in the behavior data through the classical data mining algorithm, This can be our analysis more Accurate.
- Normalization: As mentioned earlier, it may be necessary to weighting different behavioral data when calculating user preference for Items. however, It can be imagined that the different behavior of the data value may vary greatly, for example, the User's viewing data is necessarily larger than the purchase data, how to unify the data of each behavior in a same value range, so that the weighted sum of the overall preferences more accurate, we need to be normalized. The simplest normalization is to divide all kinds of data by the maximum value in this class to ensure that the normalized data is evaluated in the [0,1] Range.
After preprocessing, according to different application behavior Analysis method, can choose to group or weighted processing, then we can get a user preference of two-dimensional matrix, one-dimensional is the user list, the other dimension is a list of items, the value is the User's preference for items, is generally [0,1] or [-1, 1] floating point Value.
Find a similar user or item
After the user's behavior has been analyzed by user preferences, we can calculate similar users and items according to user preferences, and then based on similar users or items to recommend, this is the most typical CF two branches: user-based CF and item-based cf. Both methods need to calculate similarity, let's take a look at some of the most basic methods of calculating Similarity.
The calculation of similarity degree
On the calculation of similarity, the existing basic methods are based on vector (vectors), in fact, the distance between two vectors is calculated, the closer the similarity of the Greater. In the recommended scenario, in a two-dimensional matrix of user-item preferences, we can use a User's preference for all items as a vector to calculate the similarity between users, or to calculate the similarity between items by a vector of all users ' preferences for an Item. Here we describe in detail several commonly used similarity calculation methods:
- Euclidean distance (Euclidean Distance)
Originally used to calculate the distance between two points in Euclidean space, Suppose that x, y is two points in an n-dimensional space, the Euclidean distance between them is:
As you can see, Euclidean distance is the distance of two points on the plane when n=2.
When using Euclidean distance to denote similarity, The following formula is generally used to convert: the smaller the distance, the greater the similarity
- Pearson correlation coefficient (Pearson Correlation coefficient)
Pearson correlation coefficients are generally used to calculate the tightness of the connections between the two fixed-distance variables, and its value is between [ -1,+1].
sx, Sy is the standard deviation of the sample for X and Y.
- Cosine similarity (cosine Similarity)
The similarity of cosine is widely used to calculate the similarity of document Data:
- Tanimoto coefficient (tanimoto Coefficient)
Tanimoto coefficients, also known as Jaccard coefficients, are extensions of cosine similarity and are used to calculate the similarity of document Data:
Computation of similar neighbors
After the introduction of the calculation method of similarity, We see how to find the User-item neighbor according to the similarity, the common principle of selecting neighbors can be divided into two categories: Figure 1 shows the point set on the two-dimensional planar space.
- Fixed number of neighbors: k-neighborhoods or fix-size neighborhoods
Regardless of the Neighbor's "near and far", take only the nearest K, as its neighbors. 1 in a, suppose to calculate point 1 of 5-neighbor, then according to the distance between points, we take the nearest 5 points, respectively, Point 2, Point 3, Point 4, Point 7 and Point 5. But obviously we can see that this method is not good for outliers, because to take a fixed number of neighbors, when it is not near enough to compare similar points, it is forced to take some of the less similar points as neighbors, which affect the neighbor similar degree, than 1, point 1 and point 5 is not very similar.
- Neighbor based on similarity threshold: threshold-based neighborhoods
Unlike the principle of calculating a fixed number of neighbors, neighbor computation based on the threshold of similarity is the limit of the maximum proximity of neighbors, falling at the center of the current point, and all the points in the area of K as the neighbors of the current point, The method calculates the number of neighbors is indeterminate, but the similarity does not have a large error. 1 in B, starting from Point 1, calculate the similarity in the K neighbor, get point 2, point 3, point 4 and Point 7, This method calculates the similarity degree of the neighbor than the previous advantage, especially the processing of outliers.
Figure 1: Similar neighbor calculations
Calculation recommendations
After the previous calculation has obtained the adjacent users and adjacent items, the following describes how to make recommendations for users based on this Information. The previous review article in this series has briefly introduced the recommendation algorithm based on collaborative filtering can be divided into user-based CF and item-based cf, below we delve into the two methods of computing, use scenarios and advantages and Disadvantages.
users-based CF (user Cf)
The basic idea of the user-based CF is quite simple, based on the User's preference for the item to find the neighboring neighbor user, then the neighbor user likes the recommendation to the current User. In the calculation, it is a user's preference for all items as a vector to calculate the similarity between users, after finding K neighbors, according to the Neighbor's similarity weight and their preference for items, predict the current user does not have a preference for items, calculate a sorted list of items as a Recommendation. Figure 2 shows an example, for user a, based on the User's historical preferences, here only to get a neighbor-user c, and then the user C-like item D is recommended to user A.
Figure 2: Fundamentals of the user-based CF
item-based CF (item Cf)
The principle of the item-based CF is similar to the user-based cf, except that the item itself is used in the calculation of the neighbor, not from the User's point of view, that is, based on the User's preference for the item to find similar items, and then according to the User's historical preference, recommend similar items to Him. From the point of view of computing, it is the preference of all users of an item as a vector to calculate the similarity between items, to obtain similar items, according to the User's historical preferences to predict the current user has not expressed the preferences of the items, calculated to get a sorted list of items as a Recommendation. Figure 3 shows an example, for item a, according to the historical preferences of all users, like item a users like item c, the article A and item C is similar, and User C likes item a, then you can infer that user C may also like item C.
Figure 3: Fundamentals of item-based CF
User CF vs. Item CF
The basic principles of User CF and Item CF are described earlier, and we'll look at a few different angles to see their pros and cons and the scenarios that apply:
The item CF and user CF are the recommended two most basic algorithms based on collaborative filtering, and the user CF was introduced long ago, and the item CF was popular since Amazon's papers and patents were published (around 2001), and everyone felt that the Item CF was more performance and complexity than User CF is better, one of the main reasons is that for an online site, the number of users is often much more than the number of items, while the data of the item is relatively stable, so the calculation of the similarity of the item is not only a small amount of calculation, but also do not have to update frequently. But we tend to ignore this situation only to provide goods for the E-commerce site, for news, Blog or micro-content recommendation system, The situation is often the opposite, the number of items is huge, but also updated frequently, so single from the point of view of complexity, the two algorithms in different systems have advantages, The designer of the recommendation engine needs to choose a more appropriate algorithm based on the characteristics of his Application.
In Non-social network sites, the internal link of content is an important recommendation principle, which is more effective than the recommendation principle based on similar users. For example, on the purchase of a book site, when you read a book, the recommendation engine will give you recommendations related to the book, the importance of this recommendation far more than the homepage of the User's comprehensive Recommendation. As you can see, in this case, the Item CF recommendation becomes an important means of navigating the User. At the same time, the Item CF makes it easy to explain the recommendation, to recommend a book to a user on a non-social network site, and to explain that someone with similar interests has read the book, which is hard to convince the user, because the user may not know the person at All. But if the explanation is that the book is similar to a book you've read before, users may find it reasonable to adopt the Recommendation.
On the contrary, in Today's popular social networking sites, user CF is a better choice, and user CF plus social Networking information can increase users ' confidence in the referral Interpretation.
- Recommended diversity and Accuracy
The researchers who study the recommendation engine use the User CF and Item CF to calculate the recommended results on the same data set, and find that only 50% of the recommendation lists are the same, and 50% are completely different. But the two algorithms have similar precision, so it can be said that the two algorithms are very complementary.
There are two ways to measure the diversity of recommendations:
The first measure is measured from the perspective of a single user, that is, given a user, to see if the system gives a variety of recommendations, that is, to compare the recommended list of items between the 22 similarity, it is not difficult to think of this measure, the diversity of the item CF is obviously not as good as the User cf, because the item CF's recommendation is the most similar to what you have seen Before.
The second measure is to consider the diversity of systems, also known as coverage (coverage), which refers to whether a referral system can provide a rich choice for all Users. In this indicator, the diversity of the item CF is much better than the user cf, because the user CF always tends to recommend the hot, from the other side, that is, the item CF recommendation has a very good novelty, is very good at recommending Long tail Items. therefore, Although the accuracy of the item CF is slightly smaller than the user CF in most cases, the item CF is much better than the user CF if the diversity is Considered.
If you're still wondering about the diversity of recommendations, let's take another example to see what the difference is between User CF and Item cf. First of all, suppose that each user's interests are broad, like several areas of things, but each user must also have a major area, the field will be more than other areas of Concern. Given a user, assuming he likes 3 domains A,b,c,a is the main area He likes, this time we see what the user CF and Item CF tend to recommend: if you use the user cf, it will a,b,c the more popular things in the three fields to the user; emcf, it will basically only recommend A field of stuff to the User. So we see that because the user CF is only recommended for hot, so it has insufficient ability to recommend long tail projects, and the item CF only recommends a domain to the user, so that his limited list of recommendations may contain a certain number of non-popular long tail items, and the item CF recommendation for this user, obviously many Lack of Sample. But for the whole system, because different users of the main points of interest are different, so the system coverage will be better.
From the above analysis, it can be clearly seen that both recommendations have their rationality, but are not the best choice, so their accuracy will be lost. In fact, the best choice for this kind of system is, if the system to the user recommended 30 items, not each field pick 10 the hottest to him, also not recommended 30 a field to him, but for example, recommended 15 A field to him, the remaining 15 from the B,c choice. therefore, the combination of user CF and Item CF is the best choice, the basic principle is that when the use of the item CF causes the system to the diversity of individual recommendations, we add the user CF to increase the diversity of individual recommendations, thereby improving the accuracy, and when the use of user CF to the System's entire When the volume diversity is insufficient, we can increase the overall diversity by adding the Item CF, as well as improving the recommended Accuracy.
- User's adaptability to the recommended algorithm
Most of us consider which algorithm is better from the point of view of the recommendation engine, but in fact we should consider as the end user of the recommendation Engine-the application User's adaptability to the recommendation Algorithm.
For the user cf, the recommended principle is to assume that users will like those who have the same preferences of the user like things, but if a user does not have the same preferences of friends, that the user CF algorithm of the effect will be very poor, so a users of the CF Algorithm's Fitness and how much he has a common preference for users in Proportion.
The item CF algorithm also has a basic assumption that users will like something similar to what he used to like, so we can calculate the self-similarity of a User's favorite Item. A user like the object of the self-similarity, it means that he likes the things are more similar, that he is more consistent with the basic assumptions of the item CF method, then his adaptation to the item CF is naturally better; conversely, if the self-similarity is small, it means that the User's preferences do not meet the Item CF The basic assumptions of the method, then the possibility of making good recommendations with the Item CF method is very low for such users.
Through the above introduction, I believe that we have a collaborative filtering recommended various methods, principles, characteristics and application scenarios have in-depth understanding, the following we enter the actual combat phase, focusing on how to implement a collaborative filtering recommendation algorithm based on Apache Mahout.
Explore the secrets inside the recommendation engine, part 2nd: in-depth recommendation engine-related algorithms-collaborative filtering