Collaborative filtering
Preference information (preference), such as scoring, is easily collected in the context of a user-object (user-item) data relationship. Using these decentralized preference information, based on the potential relevance behind it, the method of recommending items to users is collaborative filtering, or collaborative filter (collaborative filtering).
The effectiveness of this filtering algorithm is based on:
The user's preferences are similar, that is, users can be categorized. The more obvious the characteristics of the classification, the higher the accuracy of the recommendation
There is a relationship between objects, that is, anyone who prefers an item may also prefer another item at the same time.
The validity of these two theories is different in different environment, and the application needs to be adjusted accordingly. such as the literary works on the watercress, the user's preference degree and the user's own grade correlation is stronger, but for the E-commerce website, the intrinsic link between the commodity has the influence to the user's purchase behavior to be more remarkable. When used in recommendations, these two directions are also referred to as user based and object based. The content of this article is user based.
recommended examples of film review
The main content of this article is based on the similarity of user preferences to recommend items, the use of data sets for the Grouplens research collection from the late 1990s to the beginning of 21st century by the Movielens users provided by the film scoring data. The data contains about 6,000 users of about 4,000 movies of 1 million score, five points system. Data packets can be downloaded from the Internet, which contains three data sheets--users, movies, ratings. Because the subject of this article is based on user preference, only the ratings file is used. The other two files contain the user and movie meta information separately.
This article uses the data analysis package for the pandas, the environment is IPython, therefore actually also carries the numpy and the matplotlib by default. The prompts in the following code do not appear to be IPython environments because the Idle format is more attractive on blogs.
Data Normalization
First, the scoring data is read from the Ratings.dat into a dataframe:
>>> import pandas as PD
>>> from pandas import series,dataframe
>>> rnames = [' user_id ', ' movie_id ', ' rating ', ' timestamp ']
>>> ratings = pd.read_table (R ' Ratings.dat ', sep= ':: ', Header=none, Names=rnames)
>>> Ratings[:3]
user_id movie_id rating timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
[3 rows x 4 columns]
The only three columns that are useful to us in the ratings table are user_id, movie_id, and rating, so we take these three columns out and place them in a table data that takes user as row, movie as column, and rating as value. (In fact, the relationship between user and movie is a more scientific method, but because the run again too much trouble, here is not changed.) )
>>> data = Ratings.pivot (index= ' user_id ', columns= ' movie_id ', values= ' rating ')
>>> Data[:5]
movie_id 1 2 3 4 5 6
user_id
1 5 nan-
nan nan ... 2 nan-nan, Nan, Nan, nan nan ...
3 nan-nan, Nan, Nan, nan nan ...
4 nan-nan, Nan, Nan, nan nan ...
5 Nan-nan, Nan, and Nan nan 2 ...
You can see that the table is fairly sparse, with a padding rate of about 5%, and the first step to implementing the recommendation is to compute the correlation coefficient between the user, and the Dataframe object has a very cordial. Corr (method= ' Pearson ', Min_periods=1) method, You can calculate the correlation coefficients for all columns. Method defaults to Pearson correlation coefficient, this OK, we use this. The problem is only the Min_periods parameter, which is designed to set the minimum sample size when the correlation coefficient is calculated, and a pair of columns below this value will not be computed. The trade-off of this value is related to the accuracy of the calculation of correlation coefficients, so it is necessary to determine this parameter first.
Correlation coefficient is a value used to evaluate the linear relationship between two variables, the range of which is [-1, 1],-1 represents negative correlation, 0 is irrelevant, 1 is positive correlation. Among them, 0~0.1 is generally considered to be weakly correlated, 0.1~0.4 is correlated, and 0.4~1 is strongly correlated.
Determination of Min_periods Parameters
The basic method for determining such a parameter is to count the different values in Min_periods, the standard deviation size of the correlation coefficient, the smaller the better; but at the same time, we also consider that our sample space is very sparse, min_periods set too high will cause the result set is too small, so can only select a compromise value.
Here we measure the standard deviation of the scoring system by selecting a pair of users with the most overlapping ratings in data and using the standard deviation of the correlation coefficient between them to estimate the overall standard deviation. On this premise, the correlation coefficients of the two users under different sample sizes are statistically analyzed and their standard deviation changes are observed.
First, find a single user with the most overlapping scores. Let's create a new square foo with user, then fill in the number of overlapping scores among different users:
>>> foo = dataframe (Np.empty (Len (Data.index), Len (Data.index), dtype=int), index=data.index,columns= Data.index)
>>> for I-Foo.index: for
J in Foo.columns:
foo.ix[i,j] = data.ix[i][data.ix[j]. Notnull ()].dropna (). Count ()
This code is particularly time-consuming because the last line of the statement executes 4000*4000 = 16 million times; (half of which is a duplicate, because Foo is a symmetric square) there is another reason for the Python GIL, which makes it only use one CPU thread. After one hours of execution, can not help but to test the total time, found to be decisive three hours after the Ctrl + C, in a little bit of Foo, I found the maximum number of columns corresponding to the row is 424 and 4169, the overlap between the two users of the rating score of 998:
>>> for I in Foo.index:
foo.ix[i,i]=0# First set the diagonal value to 0
>>> ser = Series (Np.zeros (len (foo.index)))
>>> for I in Foo.index:
Ser[i]=foo[i].max () #计算每行中的最大值
>>> ser.idxmax () #返回 ser The maximum value of the line number
4169
>>> ser[4169] #取得最大值
998
>>> Foo[foo==998][4169].dropna () #取得另一个 user_id
424 4169
name:user_id, Dtype:float64
We took the 424 and 4169 of the scoring data separately, put it in a table named Test, and calculated that the correlation between these two users is 0.456, which is pretty good, in addition to the column chart to understand their score distribution:
>>> Data.ix[4169].corr (data.ix[424])
0.45663851303413217
>>> test = Data.reindex ([ 424,4169],columns=data.ix[4169][data.ix[424].notnull ()].dropna (). Index)
>>> test
movie_id 2 6 Each ...
424 4 4 4 4 1 5 ...
4169 3 4 4 4 2 5 ...
>>> test.ix[424].value_counts (Sort=false). Plot (kind= ' bar ')
>>> test.ix[4169].value_counts (Sort=false). Plot (kind= ' bar ')
For the correlation coefficients of these two users, we randomly sampled 20, 50, 100, 200, 500 and 998 sample values, each of which was smoked 20 times. and statistical results:
>>> periods_test = Dataframe (Np.zeros ((20,7), columns=[10,20,50,100,200,500,998]) >>> for i in Periods_test.index:for J in periods_test.columns:sample = Test.reindex (columns=np.random.permutation (test.columns) [ : j]) Periods_test.ix[i,j] = Sample.iloc[0].corr (sample.iloc[1]) >>> Periods_test[:5] 10 20 50 100 20 0 500 998 0-0.306719 0.709073 0.504374 0.376921 0.477140 0.426938 0.456639 1 0.386658 0.607569 0.434761 0.471930 0.4372 22 0.430765 0.456639 2 0.507415 0.585808 0.440619 0.634782 0.490574 0.436799 0.456639 3 0.628112 0.628281-0.452331 0.38007 3 0.472045 0.444222 0.456639 4 0.792533 0.641503 0.444989 0.499253 0.426420 0.441292 0.456639 [5 rows x 7 columns] >& Gt;> periods_test.describe () #998略 count 20.000000 20.000000 20.000000 20.000000 20.00 20.000000 mean 0.346810 0.464726 0.458866 0.450155 0.467559 0000 std 0.398553 0.181743 0.103820 0.093663 0.036 439 0.029758 min-0.444302 0.087370 0.192391 0.242112 0.412291, 0.399875 25% 0.174531 0.320941 0.434744 0.375643 0.439228 0.435290 50% 0.487157 0.525 217 0.476653 0.468850 0.472562 0.443772 75% 0.638685 0.616643 0.519827 0.500825 0.487389 0.465787 Max 0.850963 0.70907
3 0.592040 0.634782 0.546001 0.513486 [8 rows x 7 columns]
From the STD line, the ideal min_periods parameter value should be around 200. Some people may think that 200 is too big, this recommendation algorithm for new users is simply meaningless. However, it must be said that the random calculation of a large error coefficient of correlation, and then take to do not reliable recommendations, what is the significance of it.
algorithm Test
In order to confirm the min_periods=200 of this recommendation algorithm, it is best to do a test first. The method is as follows: 1000 users are randomly selected in the evaluation number of more than 200, each one randomly extracts an evaluation to save into an array, and delete this evaluation in the datasheet. Then based on the castrated data table to calculate the expected value of the 1000 points extracted, and finally compared with the real evaluation of the array of correlation, to see the results.
>>> check_size = 1000 >>> check = {} >>> check_data = Data.copy () #复制一份 data for testing to avoid tampering with the original data >>> Check_data = Check_data.ix[check_data.count (Axis=1) >200] #滤除评价数小于200的用户 > >> for user in Np.random.permutation (check_data.index): Movie = Np.random.permutation (Check_data.ix[user). Dropna (). index) [0] check[(user,movie)] = Check_data.ix[user,movie] Check_data.ix[user,movie] = Np.nan check_size-= 1 if not check_size:break >>> Corr = check_data. T.corr (min_periods=200) >>> Corr_clean = Corr.dropna (how= ' all ') >>> Corr_clean = Corr_clean.dropna ( axis=1,how= ' all ') #删除全空的行和列 >>> check_ser = Series (check) #这里是被提取出来的 1000 real scores >>> Check_ser[:5] (15, 593) 4 (555) 3 (3363) 4 (4, 2355) 5 (3605) Dtype:float64
The
Next is based on Corr_clean to the 1000 users in Check_ser-the movie is expected to evaluate the score. The calculation method is as follows: Weighted average of the other user's score with the user correlation coefficient greater than 0.1, the weight value is the correlation coefficient:
>>> result = Series (Np.nan,index=check_ser.index) >>> to User,movie in Result.index: #这个循环看着很乱, The actual content is weighted average prediction = [] if user in Corr_clean.index:corr_set = Corr_clean[user][corr_clean[user]>0.1].drop Na () #仅限大于 0.1 user else:continue for others in Corr_set.index:if not Np.isnan (Data.ix[other,movie)) and other!= use R: #注意bool (Np.nan) ==true prediction.append ((data.ix[other,movie],corr_set[other)) if prediction:result[(User,mov IE] = SUM ([Value*weight for value,weight into prediction])/sum ([pair[1] for pair in prediction]) >>> result.dr Opna (inplace=true) >>> len (Result) #随机抽取的 1000 users also have been min_periods=200 brush off 862 >>> Result[:5] (23, 555) 3.967617 (3363) 4.073205 3.903497 (3605) 2.948003 1488 2.606582 >>> Dtype:float64 (2355); T.corr (Check_ser.reindex (result.index)) 0.436227437429696 >>> (Result-check_ser.reindex (Result.index)). ABS (). Describe () #推荐期望与实际评价之差的绝对值 count 862.000000 mean 0.785337 std 0.605865 min 0.000000 25% 0.290384 50% 0.686033 75% 1.132256 Max 3.629720 Dtype:float64
862 of the sample size can reach 0.436 of the correlation coefficient, it should be said that the results are good. If the user who has not filtered out the evaluation is less than 200 at first, then it is obvious that the time will be longer when the Corr is calculated, and then the sample size in result will be very small, about 200 +. However, because of the small size of the sample, the correlation coefficient can be elevated to 0.5~0.6.
In addition, from the expectation and the actual evaluation of the difference between the absolute value of the statistics, the data is also relatively ideal.
Implementation Recommendations
When the above tests, especially the average weighted part, are done, the recommended implementation is nothing new.
First, redo a copy of the Corr table on the original data that was not castrated:
>>> Corr = data. T.corr (min_periods=200)
>>> Corr_clean = Corr.dropna (how= ' all ')
>>> Corr_clean = corr_ Clean.dropna (axis=1,how= ' all ')
We randomly select a user in Corr_clean to make a list of recommendations for him:
>>> lucky = np.random.permutation (Corr_clean.index) [0]
>>> gift = data.ix[lucky]
>> > gift = Gift[gift.isnull ()] #现在 gift is a full empty sequence
The final task is to fill this gift:
>>> corr_lucky = Corr_clean[lucky].drop (lucky) #lucky correlation coefficient Series with other users, does not contain lucky itself >>> Corr_lucky = Co Rr_lucky[corr_lucky>0.1].dropna () #筛选相关系数大于 0.1 users >>> for movie in Gift.index: #遍历所有 Lucky never seen a movie prediction = [] For others in Corr_lucky.index: #遍历所有与 lucky User if not Np.isnan (Data.ix[other,movie) with a correlation factor greater than 0.1: prediction . Append ((Data.ix[other,movie],corr_clean[lucky][other])) if prediction:gift[movie] = SUM ([Value*weight for Value,wei Ght in prediction])/sum ([pair[1] for pair in prediction]) >>> Gift.dropna (). Order (Ascending=false) #将 gift Non-empty
Elements in descending order movie_id 3245 5.000000 2930 5.000000 2830 5.000000 2569 5.000000 1795 5.000000 981 5.000000 696 5.000000 682 5.000000 666 5.000000 572 5.000000 1420 5.000000 3338 4.845331 669 4.660464 214 4.655798-3410 4.624088.
. 2833 1 2777 1 2039 1 1773 1 1720 1 1692 1 1538 1 1430 1 1311 1 1164 1 843 1 660 1 634 1 591 1 1 Nam e:3945, length:2991, Dtype:float64
Supplements
The examples given above are prototype code that has a lot of space to optimize. For example, the row and column transformation of data, such as the determination of Min_periods Square Foo only need to calculate half; for example, some for loops and corresponding operations can be implemented using array object methods (the method version is much faster than the user-written version); even some bugs. In addition, the volume of this dataset is not too large, and if you increase the order of magnitude, it is necessary to further optimize the compute-intensive parts (such as Corr), using multiple processes, or CYTHON/C code. (or switch to better hardware)
Although collaborative filtering is a relatively easy to recommend method, but in some cases is not like using meta information recommended. The two common problems that collaborative filtering encounters are
- Sparsity problem-------------------too little evaluation results in inaccurate correlation coefficient
- Cold start problem--because the item is too low to evaluate, leading to No "right" into the list of recommendations
are caused by too little sample size. (at least 200 effective overlapping evaluations are also used in the previous example) so it might be better to use some more generic methods when recommending new users and new items. For example, recommend more films to new users with higher average scores, and recommend new films to people who like similar films, such as directors or actors. The following approach requires maintaining an inventory of items that can be divided either based on item meta information or through clustering.