Recommended systems are widely used in various types of websites, E-commerce products recommended, blog site articles recommended, as well as to help people find music and movies of various applications. But how to get from scratch to the site with a recommendation system? I have been searching the search engine for this problem for a long time, but I have never found a satisfactory answer. During this period I also joined the recommendation system experts gathered the recommended system mailing list, including when, excellent Amazon, watercress and other industry in the recommended system leading products, technology experts, but has been immersed in the mind for many days has been unable to form a content recommendation for the ultimate purpose Product framework or product roadmap. This continued until I purchased the collective Wisdom Program (programming Collective Intelligence), and now I've sorted out some of this book's reading notes in the hope of sorting out an operable, Suitable for content-oriented Web site recommendation System Product framework.
—————— – Body Split Line —————— –
We know that in order to understand the content of the site's recommendations, the least technical content of the method is to ask a friend. We also know that some of these people have a higher grade than others, and can gradually understand these things by looking at whether they usually like the same thing as we do. But with more choices, it becomes more impractical to ask a small group of people what we want, because they may not know all the options. That's why people have developed a technology called Collaborative filtering (collaborative filtering). From the actual situation, we have access to the leading recommendation system, including NETFILX, watercress, Amazon and so on are using collaborative filtering technology to achieve. Collaborative filtering is divided into several kinds: Collaborative filtering based on user, collaborative filtering based on project, collaborative filtering based on model.
So what exactly is collaborative filtering? What does it require the product designer to do to achieve it? (in order to simplify the problem, this article focuses on user-based collaborative filtering)
A user-based collaborative filtering algorithm It is common practice to search a large group of people and find a small group of people with similar tastes. The algorithm examines other content that these people prefer, and combines them to form a list of recommended rankings. So the product designer needs to understand that your site needs to do these things in turn:
1. Collection preferences (collecting Preferences)
Collecting preferences means finding a way to express different people and their preferences. For example, watercress would require users to rate each movie with 1 to 5 stars to reflect the degree to which each film critic, including himself, is interested in a given film. If you are designing a shopping site, you might as well use numbers to represent someone who has purchased a certain item in the past, using a digital 0来 to represent a product that has never been purchased. And for a news story polling site, we can use the number 1, 0 and one to express "dislike", "No vote", "like". regardless of how the preference is expressed, what you have to do is to create a way for your users to participate in the expression and to map their content to numbers to form the appropriate data set.
2. Looking for similar users (finding Similar users)
With the data sets that people prefer, we need a way to determine how similar people are in terms of taste. To do this, we can compare each person to all others and calculate their similarity evaluation value . There are several ways to achieve this: Euclidean distance (Euclidean Distance Score), Pearson correlation (person correlation coefficient), Cosine similarity (cosine-based similarity), adjustable cosine similarity (adjusted cosine similarity),jaccard coefficient or manhattan distance algorithm and so on. Please keep in mind that the various similarity calculation methods have their own strengths, according to the specific application scenarios to select one or several integrated use.
Here are two simple examples of the following:
Euclidean distance (Euclidean Distance Score): It is the axis of an object that has been consistently evaluated, then draws the person who participates in the evaluation to the graph and examines the distance between them. The X and Y axes represent the film Dupree and Snake respectively, whereas in the first quadrant preference space is the score of each of the two films.
It's not hard to find out that Toby scored 4.5 and 1.0 on the two films of Snakes and Dupree, while LaSalle were 4.0 and 2.0. According to Euclid's conclusion, the more similar the preference, the shorter the distance in the preference space. As for how to calculate the distance between the two, using your first middle school geometry knowledge on the line, calculate the difference of two points per coordinate, and then add the square, and finally the sum of square root. It is worth mentioning that this method also applies to scores of more than two items. Therefore, you can design a function to calculate the similarity between 2 users, of course, if the two need to have a certain coincidence of the score.
Pearson Correlation (Pearson correlation Score): Its principle is to judge the similarity by judging the two sets of data and the degree of fitting of a straight line. When the data is not very canonical (normalized), critics tend to give better results when they are always judged in relation to the average level of deviation.
The following figure is Mick LaSalle and Gene Seymour scored on 5 films (unlike the above, the X and y axes correspond to two people), and the dotted line is called the best Fit (best-fit line), Its drawing principle is as close to all coordinate points on the diagram as possible. If the two reviewers scored the same for all the films, the line would be diagonal and intersect with all the coordinate points on the diagram.
The following illustration shows an example of a higher correlation coefficient, which means that Lisa Rose and Jack Matthews have a higher degree of similarity in these films (the dots are closer to the best fit curve).
The Pearson method can be used to revise the situation of " exaggerated score (grade inflation)". In the picture above, although Jack tends to give a higher score than Lisa, the final line still fits higher because they have a relatively similar preference. That is, if someone is always inclined to give a higher score than the other, and the difference between the two is always consistent, they may still have a good correlation. The Euclidean distance evaluation method mentioned earlier would be judged by one person to be more "rigorous" than the other (leading to a relatively low evaluation), and to conclude that the two were not close, even if their tastes were similar. Whether this behavior is the result we want depends on the specific application scenario.
Pearson's correlation algorithm first finds the items that two reviewers have evaluated, then calculates the sum of the sums and squares of the two, and then evaluates the product of the score. At last, the correlation coefficients are calculated by using these calculation results:
PS: The formula can be understood, but I have not been able to understand the derivation of this formula mathematically, ashamed-_-
3. Rating reviewers (ranking the critics)
After understanding the previous step, this step is simple. Now you just have to rate each person according to the person you designate, to find the closest match, which is called the nearest neighbor . Back to the example above, our goal is to look for a critic who has a similar taste, so all you have to do is benchmark yourself, calculate each person's similarity to you, and then sort out the first few items. Now, assuming you're Toby, you'll get a list of your nearest neighbors after this step, which means you'll probably know that Lisa, Mick and Claudia are probably the closest 3 people to your taste.
4. Recommended items (recommending items)
It would be nice to find a film critic who is just as interesting as you, but our ultimate goal is to have a list of recommended videos (as mentioned above for the final purpose of the content recommendation). Of course, the simple way is to find the person with the closest taste, and from his favorite film to find a film he has not yet seen, but this is a bit arbitrary or rough. Because if the person hasn't commented on some of the films, they may be what we like. Or another situation is to recommend a film that someone is particularly keen on, but there are other reliable data that all the other reviewers are not bullish on the film.
In order to solve the above problem, we need a weighted evaluation value for the film rating:
The critic column in the above figure is the name of the similarity compared to the Toby, and the similarity column indicates their similarity coefficient with the Toby. Night, lady and Luck are the names of the films, and they are the ones who scored these films. The s.x columns give the similarity coefficient and the result of the score multiplication. In this way, those who are close to us will have more contribution to the overall evaluation than those of us who are not close.
Some people would ask why not use total this line directly, but need Total/sim.sum? This is because a film that is commented on by more people will have a greater impact on the result, so we have to divide it by Sim.sum, which represents the similarity of all the reviewers who have commented on the film. Like night, Total was 12.89, 5 people rated it, while Lady was 8.38 and 4. Assuming that the movie night has the same total score as the current one, it is not necessarily better than the film Lady's total/sim.sum that the final result is scored.
Well, now that we've got a list of top-ranked movies, you can decide whether you want to see one of them or not. In fact, sometimes what is not recommended is a recommendation.
Finally, I used the lovelycharts to draw the flowchart (it is not supported in Chinese OH). If you want to design a recommendation system, you should probably know what to do now: