The basic principle and realization of collaborative filtering algorithm

The basic principle and realization of collaborative filtering algorithm _ Collaborative filtering

Last Update:2018-08-20 Source: Internet

Author: User

Tags rand

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

It is well known that collaborative filtering (collaboration filtering) algorithm is one of the most commonly used algorithms in recommender systems. Today we take the film recommendation as an example, briefly discuss the basic principles, and finally give the implementation of the Python code.

1. Problem definition

Suppose you have a two-dimensional table that records each user's rating of the movie they see, as shown in the following illustration:

It is known from the graph that the two-dimensional table records the score of 4 users on 5 films, some of which are missing. So if we can fill in the missing values based on the information in the table, the practical significance is to assess the user's rating of the film based on user preferences and film characteristics. Therefore, this problem needs to solve two small problems. One is how user preferences are measured. Second, how to determine the characteristics of the film. Below, we will explain separately from these two aspects.

2. Known film features

Assume that each movie has two features x1, x2, where X1 represents the extent of the film's romance, X2 the extent to which the movie is funny. If the feature of a film is (1,0), it is considered a romantic romance; If the feature is expressed as (0,1), then the film is considered a comedy; if it is (0.6,0.5), then the film is a romantic comedy of love.

If you have now worked through manual tagging to figure out all the feature data for a movie, then we can use that data to learn the preferences of each user. Since the movie feature X is two-dimensional, the user preference for θ should also be two-dimensional. If you learn that Alice's user preference is theta (4.5,1), and cute puppies's love film features (0.9,0), you can evaluate Alice's score of cute puppies of love as 4.05. The score is reasonable in reality, Alice's user preferences indicate that she likes love movies, the characteristics of "cute puppies of Love" shows that it is a romantic film, so the score is higher. A mathematical description is given below.

For each user, the optimization objectives are:

Where Θj represents the first J user's hobby feature, Xi represents the characteristics of Part I film, y (i,j) indicates the rating of part J of a film by the user, I:r (i,j) =1 indicates that the J user rated the first film as a (not a missing value), and MJ said the number of users rated the film as J. The right side of the item is a regular item and is prevented from fitting (the previous algorithm is often seen, no longer repeat here). Because both the left and right items have MJ, the upper-style can also be written as:

Then, using gradient descent to update θj

The final request Θj is the first J user's preference characteristics.

3. Known User preferences

If a known user likes theta, then we can learn the feature x of the movie. For each movie, its optimization function is

Next, use gradient descent to update XI

The final request for Xi is the character of Part I film.

4. The combination of the two

If movie features and user preferences are not known today, we can combine the above two objective functions to optimize. As shown in the following figure

Therefore, the following collaborative filtering algorithm can be obtained:

Ultimately, we can get user preferences and movie features. Based on the training parameters, we can predict those missing values, and then we can recommend those high score movies to the user.

5. Code implementation

Import NumPy as NP import pandas as PD # define initialization parameter function # Input: Number of features, number of users, number of products # output: User feature initial matrix, product feature initial matrix Def initialize_parameters (n Um_features, Num_users, num_products): User_matrix = Np.random.rand (num_users, num_features) Product_matrix = NP.R Andom.rand (num_products, num_features) return User_matrix, Product_matrix # Calculates the current cost function # Input: Current user matrix, product matrix, missing two-dimensional table, penalty factor LA MBDAA # Output: Current cost def get_cost (Ori_data, User_matrix, Product_matrix, LAMBDAA): Nan_index = Np.isnan (ori_data) # Record two-dimensional table missing Lost index Ori_data[nan_index] = 0 # fills missing values to 0 predict_data = Np.dot (User_matrix, Product_matrix. T) # Calculate the rating of the forecast temp = predict_data-ori_data # Calculates the difference between two matrices Temp[nan_index] = 0 # Missing value is not counted at the cost = 0.5*np.sum (temp* *2) + 0.5*lambdaa* (np.sum (user_matrix**2) +np.sum (product_matrix**2)) # Compute squared Ori_data[nan_index] = np.nan # Restore original data re  Turn cost # Input to User features: Current user matrix, product matrix, missing two-dimensional table, penalty factor Lambdaa # output: User characteristic bias matrix def get_user_derivatives (Ori_data, User_matrix, Product_matrix, lambdaa=1): Nan_index = np.iSnan (Ori_data) # Records missing index in two-dimensional table ori_data[nan_index] = 0 # Fills the missing value to 0 predict_data = Np.dot (User_matrix, Product_matrix. T) # Calculate the rating of the forecast temp = predict_data-ori_data # Calculates the difference between two matrices Temp[nan_index] = 0 # Missing value is not counted in the price ori_data[nan_index] = n P.nan # Restore Original data Num_user = user_matrix.shape[0] # calculate number of users feature_user = user_matrix.shape[1] # Calculate the number of features User_ Dervatives = Np.random.rand (Num_user, Feature_user) # Declare user characteristic partial derivative matrix for I in range (Num_user): for J in rang
    E (Feature_user): user_dervatives[i][j] = Np.dot (Temp[i], product_matrix[:,j]) + Lambdaa * User_matrix[i][j] Return User_dervatives # The product features are biased # input: Current user matrix, product matrix, missing two-dimensional table, penalty factor Lambdaa # output: Product feature bias Matrix Def get_product_derivatives (ori_d  ATA, User_matrix, Product_matrix, lambdaa=1): Nan_index = Np.isnan (ori_data) # Record missing index in two-dimensional table ori_data[nan_index] = 0 # Fill in the missing value to 0 predict_data = Np.dot (User_matrix, Product_matrix. T) # Calculate the rating of the forecast temp = predict_data-ori_data # Calculates the difference between the two matrices temp[nan_index] = 0 # Missing value not at the cost ori_data[nan_index] = Np.nan # restore Original data num_product = product_matrix.shape[0] # calculate number of products Featur E_product = product_matrix.shape[1] # Compute the number of features product_dervatives = Np.random.rand (Num_product, feature_product) # DECLARE product Special  Partial derivative matrix for I in range (Num_product): for J in Range (Feature_product): Product_dervatives[i][j] = Np.dot (Temp[:,i], user_matrix[:,j]) + Lambdaa * Product_matrix[i][j] return Product_dervatives # based on a two-dimensional table containing missing values, Learning Related Parameters # Input: Two-dimensional table with missing value, user feature initial matrix, product feature initial matrix, iterative times, learning efficiency learning_rate, Penalty factor Lambdaa # output: Optimal user feature matrix, optimal product feature matrix Def CF (Ori_data, User_matrix, Product_matrix, iterate_num=500, learning_rate=0.01, lambdaa=1): For I in Range (iterate_num): Cos t = Get_cost (Ori_data, User_matrix, Product_matrix, LAMBDAA) # Calculates the current cost user_derivatives = Get_user_derivatives (ori_ Data, User_matrix, Product_matrix, LAMBDAA) * For user special bias product_derivates = Get_product_derivatives (Ori_data, User_ Matrix, Product_matrix, LAMBDAA# to the product special solicit bias User_matrix = user_matrix-learning_rate * user_derivatives # update parameters Product_matrix = product_ Matrix-learning_rate * product_derivates print i, ' th cost: ', the cost return User_matrix, Product_matri X # based on the parameters of Learning # Input: User feature matrix, product feature matrix # output: Two-dimensional table def without missing values Evaluate_score (User_matrix, Product_matrix): Return Np.dot (U Ser_matrix, Product_matrix. T) # main function if __name__ = = ' __main__ ': Ori_data = Np.array ([[5,5,NP.NAN,0,0],[5,NP.NAN,4,0,0],[0,NP.NAN,0,5,5],[0,0,NP . Nan,4,np.nan]] #user_matrix = Np.array ([[[5,0.1],[5,0.1],[0.1,5],[0.1,5]]) #product_matrix = Np.array ([[[0.9,0.1], [1.0,0.01],[0.99,0.01],[0.1,1.0],[0.1,0.9]]) User_matrix, Product_matrix = Initialize_parameters (2,ori_data.s HAPE[0],ORI_DATA.SHAPE[1]) User_matrix, Product_matrix = CF (Ori_data, User_matrix, Product_matrix, iterate_num=100, le arning_rate=0.1, lambdaa=0) score = Evaluate_score (User_matrix, Product_matrix) print score

6. Experimental results

Using the example in the problem definition as a test, after 100 iterations, the final predicted data is as follows:

[[4.99997719e+00 4.99998362e+00 3.99998939e+00 2.70274725e-03 2.70273945e-03]
[4.99997832e+00 4.99998474e+00 3.99999029e+00 2.70274079e-03 2.70273299e-03]
[ -1.55079571e-07-9.98770235e-08-7.57981453e-08 5.00001356e+00 5.00000909e+00]
[ -1.18565589e-07-7.44035873e-08-5.62400915e-08 4.00000695e+00 4.00000337e+00]]

This shows that the predicted data and the original data are very close, so the algorithm is very effective. A problem was found during the test: the artificial initialization parameter could not be degraded at times, and I think because of the particularity of the algorithm, the cost function in the previous iterations suddenly became larger, thus unable to continue the iteration (with a better understanding of the Welcome Comment exchange). In the experiment, the random initialization parameters can be used directly.

With improved code (instead of Rmsprop gradient descent, principle)

Import NumPy as NP import pandas as PD # define initialization parameter function # Input: Number of features, number of users, number of products # output: User feature initial matrix, product feature initial matrix Def initialize_parameters (n Um_features, Num_users, num_products): User_matrix = Np.random.rand (num_users, num_features) Product_matrix = NP.R Andom.rand (num_products, num_features) return User_matrix, Product_matrix # Calculates the current cost function # Input: Current user matrix, product matrix, missing two-dimensional table, penalty factor LA MBDAA # Output: Current cost def get_cost (Ori_data, User_matrix, Product_matrix, LAMBDAA): Nan_index = Np.isnan (ori_data) # Record two-dimensional table missing Lost index Ori_data[nan_index] = 0 # fills missing values to 0 predict_data = Np.dot (User_matrix, Product_matrix. T) # Calculate the rating of the forecast temp = predict_data-ori_data # Calculates the difference between two matrices Temp[nan_index] = 0 # Missing value is not counted at the cost = 0.5*np.sum (temp*
    
    *2) + 0.5*lambdaa* (np.sum (user_matrix**2) +np.sum (product_matrix**2)) # Compute squared Ori_data[nan_index] = np.nan # Restore Original data Return cost # Input to user characteristics: Current user matrix, product matrix, missing two-dimensional table, penalty factor Lambdaa, weighted average parameter # output: User characteristic bias matrix, weighted average matrix def get_user_derivatives (ori_ Data, User_matrix, Product_matrix, Weight_averagE_matrix, lambdaa=1, Weight_average_para = 0): Nan_index = Np.isnan (ori_data) # Record missing indexes in a two-dimensional table Ori_data[nan_inde X] = 0 # Fills the missing value to 0 predict_data = Np.dot (User_matrix, Product_matrix. T) # Calculate the rating of the forecast temp = predict_data-ori_data # Calculates the difference between two matrices Temp[nan_index] = 0 # Missing value is not counted in the price ori_data[nan_index] = n P.nan # Restore Original data Num_user = user_matrix.shape[0] # calculate number of users feature_user = user_matrix.shape[1] # Calculate the number of features User_ Dervatives = Np.random.rand (Num_user, Feature_user) # Declare user characteristic partial derivative matrix for I in range (Num_user): for J in rang
    
    E (Feature_user): user_dervatives[i][j] = Np.dot (Temp[i], product_matrix[:,j]) + Lambdaa * User_matrix[i][j] Weight_average_matrix = Weight_average_para * Weight_average_matrix + (1-weight_average_para) * (user_dervatives  * * 2) # Calculates the weighted average user_dervatives = user_dervatives/(weight_average_matrix**0.5) # Calculates the biased return user_dervatives of the transformation,
Weight_average_matrix # The product features are biased # input: Current user matrix, product matrix, missing two-dimensional table, penalty factor Lambdaa# Output: Product feature bias Matrix Def get_product_derivatives (Ori_data, User_matrix, Product_matrix, Weight_average_matrix, Lambdaa=1, weight_average_para=0): Nan_index = Np.isnan (ori_data) # Records missing index in two-dimensional table ori_data[nan_index] = 0 # Fill in the missing value to 0 predic T_data = Np.dot (User_matrix, Product_matrix. T) # Calculate the rating of the forecast temp = predict_data-ori_data # Calculates the difference between two matrices Temp[nan_index] = 0 # Missing value is not counted in the price ori_data[nan_index] = n P.nan # Restore Original data num_product = product_matrix.shape[0] # Calculate the number of products feature_product = product_matrix.shape[1] # COMPUTE characteristics Number product_dervatives = Np.random.rand (Num_product, feature_product) # declares product characteristic partial derivative matrix for I in range (num_product ): for J in Range (Feature_product): product_dervatives[i][j] = Np.dot (temp[:,i), user_matrix[:,j]) + L AMBDAA * Product_matrix[i][j] Weight_average_matrix = Weight_average_para * Weight_average_matrix + (1-weight_a Verage_para) * (Product_dervatives * 2) # COMPUTE weighted Average product_dervatives = product_dervatives/(Weight_average_matrix**0.5) # Compute the biased return product_dervatives of the transformation, Weight_average_matrix # According to the two-dimensional table containing the missing value, learn the relevant parameter # input: two-dimensional with missing values Table, user feature initial matrix, product feature initial matrix, iterative times, learning efficiency learning_rate, Penalty factor Lambdaa # output: Optimal user feature matrix, optimal product feature matrix Def CF (Ori_data, User_matrix, product_ Matrix, iterate_num=500, learning_rate=0.01, Lambdaa=1, weight_average_para=0.5): User_weight_average_matrix = Np.zer
    OS (user_matrix.shape) # Initializes a user bias weighted average of 0 Product_weight_average_matrix = Np.zeros (product_matrix.shape) # Initialize product bias weighted average of 0 For I in Range (iterate_num): Cost = Get_cost (Ori_data, User_matrix, Product_matrix, LAMBDAA) # Calculate Current Price U Ser_derivatives, User_weight_average_matrix = Get_user_derivatives (Ori_data, User_matrix, Product_matrix, User_  Weight_average_matrix, Lambdaa, Weight_average_para) # to the user for special bias product_derivates, Product_weight_average_matrix = Get_product_derivatives (Ori_data, User_matrix, Product_matrix, Product_weight_average_matrix, Lambdaa, Weight_ Average_para) # to the product of the special request for biased User_matrix = User_mAtrix-learning_rate * user_derivatives # update parameters Product_matrix = product_matrix-learning_rate * product_derivate s print i, ' th cost: ', the cost return User_matrix, Product_matrix # based on learning parameters, evaluate # Input: User feature matrix, product feature matrix # Output: Two-dimensional table def evaluate_score (User_matrix, Product_matrix) without missing values: Return Np.dot (User_matrix, Product_matrix. T) # main function if __name__ = = ' __main__ ': Ori_data = pd.read_csv (' cf_data.csv ') columns = Ori_data.columns Ori_da Ta = Np.array (ori_data) #ori_data = Np.array ([[5,5,np.nan,0,0],[5,np.nan,4,0,0],[0,np.nan,0,5,5],[0,0,np.nan,4, Np.nan]] User_matrix, Product_matrix = Initialize_parameters (20,ori_data.shape[0],ori_data.shape[1)) User_ma  Trix, Product_matrix = CF (Ori_data, User_matrix, Product_matrix, iterate_num=100, learning_rate=0.01, lambdaa=0) score = Evaluate_score (User_matrix, product_matrix) Predict_cf_data = PD. Dataframe (Score, Columns=columns) predict_cf_data.to_csv (' Predict_cf_data.csv ', index=false)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More