Collaborative Filtering tutorial using Python and collaborative filtering using python

Source: Internet
Author: User

Collaborative Filtering tutorial using Python and collaborative filtering using python

Collaborative Filtering

Preference information, such as rating, can be easily collected under the user-item data relationship. The way to recommend items for users based on the possible associations behind these scattered preferences is collaborative filtering or collaborative filtering ).

The effectiveness of this filtering algorithm is based on:

User preferences are similar, that is, users can be classified. The more obvious the features of this classification, the higher the recommendation accuracy.
There is a relationship between items, that is, anyone preferences an item is likely to preference another item at the same time.

The validity of these two theories varies in different environments and needs to be adjusted accordingly. For example, the degree of user preferences of literature and art works on Douban is strongly correlated with the user's taste. for e-commerce websites, the internal relationship between products has a more significant impact on users' purchasing behaviors. When used in recommendation, these two directions are also called user-based and item-based. This article is based on users.
Movie Rating recommendation instance

This article focuses on item recommendation based on user preferences. The data set used is a group of movie rating data collected by GroupLens Research from the end of 1990s to the beginning of the 21st century by MovieLens users. The data contains about 6000 million users who scored 4000 million movies in 1 million. Data packets can be downloaded from the Internet, which contains three data tables: users, movies, and ratings. Because the topic of this article is based on user preferences, only the ratings file is used. The other two files contain the metadata of the user and the movie respectively.

The data analysis package used in this article is pandas and the environment is IPython. Therefore, Numpy and matplotlib are carried by default. The prompt in the following code does not seem to be in the IPython environment because the Idle format is better viewed on the blog.
Data normalization

First, read the Score data from ratings. dat to a DataFrame:

>>> import pandas as pd>>> from pandas import Series,DataFrame>>> rnames = ['user_id','movie_id','rating','timestamp']>>> ratings = pd.read_table(r'ratings.dat',sep='::',header=None,names=rnames)>>> ratings[:3] user_id movie_id rating timestamp0  1  1193  5 9783007601  1  661  3 9783021092  1  914  3 978301968 [3 rows x 4 columns]

The ratings table only uses the user_id, movie_id, and rating columns. Therefore, we can retrieve these three columns and put them in a user-based row and movie-based column, in the table data with the value of rating. (In fact, it is more scientific to adjust the relationship between the user and movie, but it is too troublesome to re-run it again, so it is not changed here .)
 

>>> data = ratings.pivot(index='user_id',columns='movie_id',values='rating')>>> data[:5]movie_id 1 2 3 4 5 6 user_id                  1   5 NaN NaN NaN NaN NaN ...2  NaN NaN NaN NaN NaN NaN ...3  NaN NaN NaN NaN NaN NaN ...4  NaN NaN NaN NaN NaN NaN ...5  NaN NaN NaN NaN NaN 2 ...

We can see that this table is quite sparse and the filling rate is only about 5%. The first step to achieve recommendation is to calculate the correlation coefficient between users. The DataFrame object has a very friendly one. the corr (method = 'pearson, min_periods = 1) method can calculate the correlation coefficient of all columns. The default method is Pearson correlation coefficient. If this is OK, we will use this. The problem lies only in the min_periods parameter. this parameter is used to set the minimum sample size when the correlation coefficient is calculated. A column lower than this value is not computed. The trade-off between this value is related to the accuracy of the correlation coefficient calculation. Therefore, it is necessary to determine this parameter first.

Correlation coefficient is a value used to evaluate the linear relationship between two variables. The value range is [-1, 1],-1 indicates negative correlation, 0 indicates irrelevant, and 1 indicates positive correlation. 0 ~ 0.1 is generally considered weak correlation, 0.1 ~ 0.4 is related, 0.4 ~ 1 is strongly correlated.

Min_periods Parameter Determination

The basic method for determining such a parameter is to calculate the standard deviation of the correlation coefficient when min_periods gets different values. The smaller the value, the better. But we also need to consider that our sample space is very sparse, if the value of min_periods is too high, the result set is too small. Therefore, you can only select one of the values in the discount.

Here we use the method for measuring the standard deviation of the scoring system: select a user with the most overlapping scores in the data, and use the standard deviation of the correlation coefficient between them to estimate the overall standard deviation. Under this premise, the correlation coefficient of the user in different sample sizes is calculated, and the standard deviation is observed.

First, you need to find the one with the most overlapping scores. Create a new user-based column matrix foo, and then fill in the number of overlapping scores of different users one by one:
 

>>> foo = DataFrame(np.empty((len(data.index),len(data.index)),dtype=int),index=data.index,columns=data.index)>>> for i in foo.index:  for j in foo.columns:   foo.ix[i,j] = data.ix[i][data.ix[j].notnull()].dropna().count()

This code is especially time-consuming because the last line of statements needs to execute 4000*4000 = 16 million times. (half of them are repeated operations because the foo square matrix is symmetric) another reason is that Python's GIL allows it to use only one CPU thread. After an hour of execution, I couldn't help but test the total time. I found that it would take more than three hours to Ctrl + C. In the foo of half a week, the rows and columns corresponding to the maximum value I found are 424 and 4169 respectively. The overlapping scores between the two users are 998:


 

>>> For I in foo. index: foo. ix [I, I] = 0 # first set the diagonal value to 0> ser = Series (np. zeros (len (foo. index) >>> for I in foo. index: ser [I] = foo [I]. max () # calculate the maximum value in each row> ser. idxmax () # The row where the maximum value of the returned ser is located is 4169> ser [4169] # obtain the maximum value 998> foo [foo = 998] [4169]. dropna () # obtain another user_id424 4169 Name: user_id, dtype: float64

We took out the scoring data of 424 and 4169 separately and put it in a table named test. In addition, we calculated the correlation coefficient between the two users as 0.456, which is not bad, in addition, we can use a bar chart to learn about the distribution of their scores:

>>> data.ix[4169].corr(data.ix[424])0.45663851303413217>>> test = data.reindex([424,4169],columns=data.ix[4169][data.ix[424].notnull()].dropna().index)>>> testmovie_id 2  6  10 11 12 17 ...424    4  4  4  4  1  5 ...4169    3  4  4  4  2  5 ... >>> test.ix[424].value_counts(sort=False).plot(kind='bar')>>> test.ix[4169].value_counts(sort=False).plot(kind='bar')

For the correlation coefficient statistics of these two users, we randomly select 20, 50, 100, 200, 500, and 998 sample values, each of which is 20 times. And statistical results:

>>> Periods_test = DataFrame (np. zeros (50,100,200,500,998), columns = [,]) >>> for I in periods_test.index: for j in periods_test.columns: sample = test. reindex (columns = np. random. permutation (test. columns) [: j]) periods_test.ix [I, j] = sample. iloc [0]. corr (sample. iloc [1]) >>> periods_test [: 5] 10 20 50 100 200 500-9980 0.306719 0.709073 0.504374 0.376921 0.477140 0.426938 0.4566391 0.386658 0.607569 0.434761 0.471930 0.437222 0.430765 0.4566392 0.507415 0.585808 0.440619 0.634782 0.490574 0.436799 0.4566393 0.628112 0.628281 0.452331 0.380073 0.472045 0.444222 0.4566394 0.792533 0.641503 0.444989 0.499253 0.426420 0.441292 0.456639 [5 rows x 7 columns] >>> periods_test.describe () 10 20 50 100 200 500 #998 slightly count 20.000000 20.000000 20.000000 20.000000 20.000000 20.000000 mean 0.346810 0.464726 0.458866 0.450155 0.467559 std 0.452448 0.398553 0.181743 0.103820 0.093663 0.036439 0.029758 0.444302 min-0.087370 0.192391 0.242112 0.412291 0.399875 25% 0.174531 0.320941 0.434744 0.375643 0.439228 0.435290 50% 0.487157 0.525217 0.476653 0.468850 0.472562 0.443772 75% 0.638685 0.616643 0.519827 0.500825 0.487389 0.465787 max 0.850963 0.709073 0.592040 0.634782 0.546001 0.513486 [8 rows x 7 columns]

From the std line, the ideal min_periods parameter value should be around 200. Some people may think that 200 is too big, and this recommendation algorithm is meaningless to new users. However, it makes sense to calculate a correlation coefficient with a large error and then make unreliable recommendations.
Algorithm Test

To verify the reliability of the recommendation algorithm under min_periods = 200, we recommend that you first perform a test. The specific method is as follows: 200 users are randomly selected among users with a rating greater than 1000. Each user randomly extracts one rating and stores it in an array. Then, the rating is deleted from the data table. Then, the expected values of the extracted 1000 scores are calculated based on the castrated data table, and then compared with the true evaluation array to see how the results are.
 

>>> Check_size = 1000 >>>> check ={}>> check_data = data. copy () # copy a copy of data for verification to avoid tampering with the original data >>> check_data = check_data.ix [check_data.count (axis = 1)> 200] # filter out users whose rating is less than 200> for user in np. random. permutation (check_data.index): movie = np. random. permutation (check_data.ix [user]. dropna (). index) [0] check [(user, movie)] = check_data.ix [user, movie] check_data.ix [user, movie] = np. nan check_size-= 1 if not check_size: break >>> corr = check_data.T.corr (min_periods = 200) >>>> export _clean = corr. dropna (how = 'all') >>> export _clean = pai_clean.dropna (axis = 1, how = 'all ') # delete all empty rows and columns >>> check_ser = Series (check) # The extracted 1000 real scores >>> check_ser [: 5] (15,593) 4 (23,555) 3 (33,336 3) 4 (36,235 5) 5 (53,360 5) 4 dtype: float64

Next, we will give the check_ser's 1000 user-video scoring expectations based on objective _clean. The calculation method is as follows: weighted average of other user scores with a user correlation coefficient greater than 0.1, and the weight is the correlation coefficient:
 

>>> Result = Series (np. nan, index = check_ser.index) >>> for user, movie in result. index: # This loop looks messy. The actual content is the weighted average. prediction = [] if user in partition _clean.index: partition _set = partition _clean [user] [partition _clean [user]> 0.1]. dropna () # else: continue for other in limit _set.index: if not np. isnan (data. ix [other, movie]) and other! = User: # Note bool (np. nan) = True prediction. append (data. ix [other, movie], pai_set [other]) if prediction: result [(user, movie)] = sum ([value * weight for value, weight in prediction]) /sum ([pair [1] for pair in prediction]) >>> result. dropna (inplace = True) >>> len (result) # Among the 1000 users randomly selected, 200 are also flushed by min_periods = 862 >>> result [: 5] (23,555) 3.967617 (33,336 3) 4.073205 (36,235 5) 3.903497 (53,360 5) 2.948003 (62,148 8) 2.606582 dtype: float64 >>> result. corr (check_ser.reindex (result. index) 0.436227437429696 >>> (result-check_ser.reindex (result. index )). abs (). describe () # absolute value of the Recommendation expectation and actual evaluation count 862.000000 mean 0.785337std 0.605865 min 0.00000025% 0.29038450% 0.68603375% 1.132256max 3.629720 dtype: float64

The correlation coefficient of 862 of samples can reach 0.436. It should be said that the result is not bad. If users with a rating less than 200 are not filtered out at the beginning, the time needed to calculate the corr is obviously longer, and the sample size in the result is smaller than 200, which is about +. However, the correlation coefficient can be increased to 0.5 ~ because the sample size is small ~ 0.6.

In addition, from the statistic of the absolute value of the difference between expectation and actual evaluation, the data is also ideal.
Recommendation

In the above test, especially after the average weighting is completed, the implementation of recommendation is nothing new.

First, re-create a corr table on the raw uncaressed data:
 

>>> corr = data.T.corr(min_periods=200)>>> corr_clean = corr.dropna(how='all')>>> corr_clean = corr_clean.dropna(axis=1,how='all')

In cmd_clean, we randomly select a user to make a recommendation list for him:
 

>>> Luky = np. random. permutation (pai_clean.index) [0] >>> gift = data. ix [lucky] >>> gift = gift [gift. isnull ()] # gift is a complete empty Sequence

The final task is to fill in the gift:
 

>>> Export _lucky = export _clean [lucky]. drop (lucky) # The correlation coefficient Series between lucky and other users, excluding lucky itself >>>corr_lucky = corr_lucky [corr_lucky> 0.1]. dropna () # filter users whose correlation coefficient is greater than 0.1 >>> for movie in gift. index: # prediction = [] for other in movie _ lucky.index: # traverse all users with lucky correlation coefficient greater than 0.1 if not np. isnan (data. ix [other, movie]): prediction. append (data. ix [other, movie], corr_clean [lucky] [other]) if prediction: gift [movie] = sum ([value * weight for value, weight in prediction]) /sum ([pair [1] for pair in prediction])> gift. dropna (). order (ascending = False) # Sort the non-empty elements of gift in descending order movie_id3245 5.0000002930 5.0000002830 5.0000002569 5.0000001795 5.000000981 5.000000696 5.000000682 5.000000666 5.000000572 5.0000001420 5.0000003338 4.845331669 4.660464214 4.6557983410 4. 624088... 2833 12777 12039 11773 11720 11692 11538 11430 11311 11164 1843 1660 1634 1591 1 Name: 156, Length: 3945, dtype: float64

 
Supplement

The examples above are all prototype code with a lot of room for optimization. For example, the row-column conversion of data; for example, if the phalanx foo is used for the determination of min_periods, it only needs to be calculated in half; for example, some for loops and corresponding operations can be implemented using the array object method (the method version is much faster than the version written by the user). There are even some bugs. In addition, the volume of this dataset is not too large. If it increases by an order of magnitude, it is necessary to further optimize the computing-intensive part (such as corr). You can use multiple processes, or Cython/C code. (Or for better hardware)

Although collaborative filtering is a relatively easy-to-use recommendation method, in some cases it is not useful to use meta-information recommendation. Two common problems of collaborative filtering are:

  1. Sparsity: The correlation coefficient calculated is inaccurate because the user has made too few comments.
  2. Cold start problem-the item does not have the "right" to enter the recommendation list because the item gets too few comments

Because the number of samples is too small. (In the above example, at least 200 of the valid overlapping ratings are used.) Therefore, when recommending new users and new items, it may be better to use some more general methods. For example, to recommend more movies with a high average score to new users, and to recommend new movies to people who like similar movies (such as those with the same directors or actors. In this way, we need to maintain an item classification table, which can be divided based on item meta information and obtained through clustering.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.