A brief review of collaborative filtering algorithms for Quest recommendation engines

Source: Internet
Author: User

The Great God of mathematics, the statistical great god and the data mining recommend the Great God please pay attention.

I. Understanding of mathematical expectations

earlier, France had two large mathematicians, one called Blays Pasca, and one called the pony. Pascal knew two gamblers, and the two gamblers asked him a question. They said that after they had gambled, they agreed that whoever won the 5 innings would get all the money. Gambling for a long time, a won 4 innings, B won 3 innings, the hours are very late, they do not want to gamble anymore. So, how should this money be divided? is not to divide the money into 7, win 4 innings take 4, win 3 innings take 3 copies? Or, because the first to say is full of 5 innings, and who did not reach, so the one half of it? Neither of the two methods is correct. The correct answer is: win 4 innings to take this money 3/4, won 3 innings to take this money 1/4.

Why is it? Assuming they're betting on another game, a 1/2 of the chance of winning his 5th inning, and B having a 1/2 chance of winning his 4th inning. If a WINS 5 innings, the money should be his own, and if B wins his 4th inning, then the probability of A and B winning their respective 5th innings in the next inning is 1/2. So, if you have to win 5 innings, a can win all the money is 1/2+1/2X1/2=3/4, of course, B should be 1/4.

The expectation of mathematics comes from this.

Why do you write this? Because I found that many of the formula in the collaborative filter is the Lianga of weights and values, the denominator is the sum of weights (actually weighted average), if separate to see, is not the probability (as the weight of the normalized processing, so that the sum of 1 of the properties, then certainly is the probability) multiplied by the value of the Lianga and.

Expectation (Expected-value) is the summation of the product of probability and value quantity.

Understanding of the misunderstanding.
1. Mathematical expectation is the most willing or realistic profit of the parties. Confused with Chinese understanding, such as: a free throw hit rate 0.6, 1 points, not 0 points, then cast once, the goal is expected to be 1 or 0 (I would like to vote, I am sure not to vote). Resolution: The probability value will affect the desired, that is, the probability is not considered. The mathematical expectation is the abbreviation expectation, later still is only to call the mathematics expectation.
2. The actual probability of the greatest profit. Our ABCD generally chooses BC as the correct answer. 0.1a,0.9b, then you must choose b?
3. The average value of the profit that may occur.

I have always thought that is the third, why so understanding? The definition says the expectation reflects the average level of the value of the discrete random variable, perhaps my understanding of the third is wrong, perhaps just mean, not multiplied by probability.

Analysis of examples
1. A bookstore plan to order a new version of the book, based on past experience to predict that the new book sales of 40, 100, 120 of the probability of 0.2,0.7,0.1, the book price of 6 yuan, the sales price of 8 yuan, if the beast is not out, only the price of 5 yuan per copy as the remainder of the book processing. Please help the shopkeeper decide how many new books to order should be more reasonable.

Answer: If I should do in high school, now see unexpectedly no idea, Baidu to answer after I think I overlooked an important implied condition, that is 40,100,120 of the probability and for 1, this shows what problem? That's the only way to sell the three cases, not sell 41,101,119, etc.

After the answer, I thought the first explanation was the desired explanation, which seemed to be counterproductive.

2. Throw a dice, the desired point is 1/6* (6* (1+6)/2) = 3.5, why do decimals appear?

For a long while, still did not solve my problem.

I went to the community to check, get the following question, and related reply.

Mathematical expectations have also been translated into "expectations", and in some studies, such as asset pricing theory, the mathematical "expectation" is almost equal to the "expectations" of a person's psychology about asset prices. But look at this example: throw a uniform coin, front + 1 points on the opposite side-1 points, then the math "expected" is 0, but everyone knows the result is only +1 or 1, not 0, naturally no one "expected" result is 0 points.
In short, who can give mathematical expectations an intuitively acceptable explanation?

1. In short, the center of the doughnut is not in the doughnut.
2. It can be understood that the results of the experiment are weighted by the probability of the expected. After a large number of experiments, the average value of the results will be closer to expectations.
3. First, the expectation is based on the probability basis, is the expectation of the unknown. TZ should distinguish between the actual results and the results you expect. Take the discrete case as an example.

You are first known to value X_{i} under each state I, as well as the probability p_{i}. Then you can infer the expectations. In most cases, the probability is approximated by the frequency. Frequency is the number of times the event occurred/the total number of experiments. In this definition, the condition of the large sample has been hidden. Thus, expectation is the result you expect after many experiments. Not the next time, or the result of an experiment.

4. The sample capacity is equal to the probability value of the population, that is, the expectation is the value under probability, regardless of one time.
5. I think @he Jingyu's answer is not entirely accurate, and the first sentence misleads the friend. The expectation expectation, is defined by the probability density function, the cell phone does not give the formula. is to tell the characteristics of the distribution that a variable conforms to. Mean value mean, is the characteristic of the sample, assuming a group of unknown distribution of samples, the mean is still can be calculated. More extreme, these sample distribution characteristics are not the same, you can still find the mean.
Why are these two concepts often confused?
6. Read the answer above, there is no very essential answer to this question, such as average AH what, think that middle school students can understand, the Lord certainly also understand. Based on my mobile phone, I can only talk about this problem briefly. The mathematical expectation itself is not an average, but also a random variable. From a statistical perspective, a sample-based unbiased estimate. From a probabilistic perspective, you need to define the set of events, Sigma fields, measures, and so on. It is expected that this random variable is the best approximation element of each random variable on the generated sigma domain.

The feeling is that it belongs to the high-end community, everyone is their own opinion, not CTRL + C and CTRL + V.

Am I a naïve question? I don't think so, look at PCA and LDA when I looked at the matrix, found that Google csdn a MVP and matrix linked together, the article called "Understanding Matrix", the matrix as a transformation, a total of three pieces, before and after spent 2 years to complete (PS: After reading, I still do not understand, Mainly the variance or not know its engineering significance). Isn't the two very similar?

Ii. understanding of Pearson's correlation coefficient

What I want to know is why the Pearson equation is why the range is 1 to 1 (how do you just look at it or understand it)? Only by understanding, can we put forward our own correlation coefficient formula.

I think of high school physics words u-i Map, the teacher said the closest to two times the point of the straight line, the university knows is the least squares, then the least squares of the two sides of the point of the Pearson correlation coefficient should be the slope of the line with some kind of connection, or one can be used as a measure of another standard, the mathematical god please pay attention

When the standard deviation of two variables is not zero, the correlation coefficients are defined, and the Pearson correlation coefficient applies To:

(1), two variables are linear relations, are continuous data.

(2), two variables are generally normal, or nearly normal single-peak distribution.

The observed values of (3) and two variables are paired, and each pair of observations is independent of each other.

As can be seen from the above, user-based collaborative filtering is not the Pearson correlation coefficient, except the second condition is uncertain (the total number of items purchased per item is normally distributed).

The Corr (x, Y) function is used in MATLAB.

Theorem: | Ρxy | The necessary and sufficient condition for = 1 is that there is a constant, a, B, which makes p{y=a+bx}=1; you can see that Pearson measures the linear correlation between the two sets of data.

It should be noted that the correlation coefficient has an obvious disadvantage, that is, it is close to 1 of the degree of data Group N correlation, which is easy to give a false impression. Because, when n is small, the correlation coefficient fluctuates greatly, the absolute value of some sample correlation coefficients is easy to be close to 1; When n is large, the absolute value of the correlation coefficient is easy to be small. In particular, when n=2, the absolute value of the correlation coefficient is always 1. Therefore, when the sample capacity n is small, it is inappropriate to determine the close linear relationship between the variable x and y based on the correlation coefficient.

Third, user-based collaborative filtering algorithm

For collaborative filtering is not very understanding, please see the author of this one (hyperlink).

The user (user-based)-based collaborative filtering algorithm first looks for other users who are similar to the new user based on the user's historical behavior information, and predicts the items that the current new user might like based on the evaluation information of the other items by these similar users. Given the user scoring data matrix R, the user-based collaborative filtering algorithm needs to define the similarity function s:uxu→r to calculate the similarity between users, and then calculate the recommended results based on the scoring data and the similarity matrix.

In collaborative filtering, an important link is how to choose the appropriate similarity calculation method, the two commonly used similarity calculation methods include Pearson correlation coefficient and cosine similarity.

Can these two methods be used in the object-based collaborative filtering algorithm? I think not, why? Think about where to get two sets of vectors about an item? Items at most is the sale of how much, to loud words must be and specific users to contact, I feel like this, but never seen someone so use, haha.

Another important step is to calculate the user U's forecast score for the outstanding items. First, based on the similarity calculation in the previous step, look for the neighbor set n∈u of user U, where n represents the neighbor set and U represents the user set. Then, in combination with the user scoring dataset, predict user U's scoring of item I, the formula is as follows:

My understanding of the above formula: User U to the non-purchased item J score is User u to the other than J of the item set {N-J} score average (then the question comes, if you also do not score other items in N, how to do?) The author of that piece of article used is 0, but the author's intuitive feeling is that there should be other effective algorithms, next to the following step one (looking for user C's neighbor) do not know the words do not use the data, hey hey, really not rigorous) plus to TOPK user u ' and user U ' similarity to the weighted average user u ' The difference between the score of J and the mean value of user U ' score.

I feel the author of the above sentence is very cumbersome, six months later, I do not have the confidence to write a formula based on this sentence, haha.

This figure is not very clear, but it doesn't matter, just to see whether to take the apostrophe, and then combined with the following example on the line, after reading this formula I know that my article for what is wrong, is that my final request for the order of the recommended items directly to the TOPK users bought this item of the user and to be recommended between the user and the relevance of And then sort it out.

where S (U, U ') represents the similarity of user u and user u '.

Suppose you have the following e-commerce scoring data set to predict User C's rating for item 4

Users products 1 Products 2 Goods 3 Goods 4
User A 4? 3 5
User B? 5 4?
User C 5 4 2?
User D 2 4? 3
User E 3 4 5?

In the table? Indicates that the rating is unknown. Based on the user-based collaborative filtering algorithm step, calculate user C's rating for item 4, as shown in the steps below.

(1) Find the neighbor of User C

As you can see from the data set, only user A and user D are overly good at item 4, so there are only 2 candidate neighbors, user A and User D, respectively. User A has an average rating of 4, User C has an average rating of 3.667, and User D has a average rating of 3. According to the Pearson correlation coefficient formula, the similarity of user C and user A is:

Similarly, S (C, D) =-0.515.

(2) Predict User C's rating for item 4

Based on the above scoring prediction formula, the user C rating for Item 4 is calculated as follows:

And so on, you can calculate other unknown scores.

Four, the object-based collaborative filtering algorithm

The collaborative filtering algorithm based on the project (item-based) is another common algorithm. Unlike the user-based collaborative filtering algorithm, the item-based collaborative filtering algorithm calculates the similarity between Item to predict user ratings. This means that the algorithm can pre-calculate the similarity between the item, which can improve performance. The item-based collaborative filtering algorithm is used to predict the target item by the user scoring data and the calculated item similarity matrix.

Similar to the user-based collaborative filtering algorithm, the similarity between item needs to be calculated first. Moreover, the method of calculating similarity can also use Pearson relation coefficient or cosine similarity, here gives an electronic commerce system common similarity computation method, namely calculates the similarity degree between item based on conditional probability, the formula is as follows:

The author of this formula is not very understanding, or very recognized, because I do not know whether his value range between 0 to 1, you crossing, do you agree?

wherein, S (i, j) represents the similarity between the item I and J, Freq (IJ) represents the frequency of the common occurrence of I and J, Freq (i) indicates the frequency of the occurrence of I, Freq (j) represents the frequency of J appearance, and the resistance factor, which is mainly used to balance control of popular and popular item, For example, e-commerce in the hot goods and so on.

Next, based on the similarity matrix between the item calculated above, the unknown score is predicted based on the user's score. The prediction formula is as follows:

V. Practical application

Listed here only item-based,user-based in that it already introduced very detailed.

In the electronic Commerce recommendation system, the commodity similarity computation has the very important function. It can be used for a number of specific recommendations, such as directly based on the current product, for users to recommend the highest similarity top N products. At the same time, it can be applied to personalized recommendations, so as to recommend products for users. E-commerce sites collect a large number of user logs, such as user click Logs.

Based on the item-based collaborative filtering algorithm, the author proposes an incremental product similarity calculation solution.

The specific calculation steps are as follows.

1) Get the user click Behavior Data, filter out some noise data, such as the lack of product information. In order to get the user session SessionID, Product ID (product identification), browse time and other information, as shown in table 5-1. because A4 's browse time and A1, A2, A3 difference is large, so it is filtered out, here is defined as 1800 seconds, as shown in table 5-2.

Table 5-1 User click Behavior Log table
User session ID time to browse items item Pairs
A1, 20:12 A1, A2 A1, A3
A2, 20:13 a2,a1 A2, A3
A3, 20:15 a3,a1 A3, A2
A4, 23:30

Table 5-2 Filtered user click Behavior Log table
Time to browse items item Pairs
A1, 20:12 A1, A2 A1, A3
A2, 20:13 a2,a1 A2, A3
A3, 20:15 a3,a1 A3, A2

2) First, calculate the number of common clicks between any two items. Then, the similarity of commodities is calculated according to the method of commodity similarity calculation based on conditional probability. The commodity similarity formula is as follows.

wherein, S (i, j) represents the similarity between the item I and J, Freq (IJ) represents the frequency of the common occurrence of I and J, Freq (i) indicates the frequency of I appearing, and Freq (j) indicates the frequency of J appearing.

3) Combining the commodity similarity data calculated from the previous day, voting judgment, and selecting a large similarity as a new commodity similarity, so as to achieve incremental commodity similarity calculation.

Vi. Reference Documents

1. Community Awareness

2. Data mining and data operation Combat: ideas, methods, techniques and applications

3. Baidu

4. Red and Black Alliance

A brief review of collaborative filtering algorithms for Quest recommendation engines

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.