Build your own recommender system with Python
Today, the site uses a referral system to personalize your experience, telling you what to buy, what to eat and even who you should make friends with. Although everyone tastes different, they generally apply to this routine. People tend to like things that are similar to other things they like, and tend to have similar tastes to people around them. The referral system tries to capture this pattern to help predict other things you might like.
To help users select products efficiently, e-commerce, social media, video and online news platforms have been actively deploying their own referral systems, which is a winning strategy.
The two most common types of recommender systems are based on content filtering and collaborative filtering methods. The collaborative filtering method is based on the user's evaluation information to produce the recommendation, is to use the wisdom of the public to recommend content. In contrast, content-based recommender systems focus on the attributes of the commodity and are recommended based on the similarity between them.
In general, collaborative filtering (CF) is the main engine of recommendation engines. This algorithm has a very interesting feature, it can learn autonomously, which means that it can start to learn which features can be used for their own. Collaborative filtering can be divided into memory-based collaborative filtering and model-based collaborative filtering. In this tutorial, you will implement a model-based collaborative filtering method using singular value decomposition and a memory-based collaborative filtering method that computes the cosine similarity.
We will use the Movielens dataset, which is one of the most common data sets used to implement and test the recommendation engine. It contains 100,000 movie reviews from 943 users and a collection of 1682 movies. You'd better extract this data set (movieslens-100k) to your notebook directory.
12 |
import numpy as np import pandas as pd |
The full data set is included in the u.data file. You can find a brief description of this data set here
12 |
header = [ ' user_id ' ' item_id ' ' rating ' , ' timestamp ' ] DF = pd.read_csv ( ' ml-100k/u.data ' = Code class= "python string" > ' \ t ' = header) |
Take a look at the first two rows in the data set. Next, let's count the number of users and movies in it.
N_users = df.user_id.unique (). Shape[0]n_items = df.item_id.unique (). shape[0]"
Number of users = 943 | Number of movies = 1682
You can use the Scikit-learn library to divide the dataset into two parts: test and training. The Cross_validation.train_test_split
module mixes the data and divides it into two parts based on the percentage of the test sample, where the percentage is 0.25
Import cross_validation as cvtrain_data, Test_data = Cv.train_test_split (DF, test_size=0.25)
Based on the memory collaborative filtering method
The memory-based collaborative filtering method can be divided into two main parts: user-Project filtering (User-item filtering) and project-project filtering ( Item-item filtering) . user-item filtering Select a specific user to find other users who are similar to the user based on the evaluation similarity and recommend items that are similar to those users like. By contrast, Item-item filtering selects a project and then finds other users who also like the project, and identifies other projects that these users or similar users like, and the recommended process requires projects and outputs other projects.
- Item-item Collaborative Filtering: "Users who liked this Item also liked ..."
- User-item Collaborative Filtering: "Users who is similar to you also liked ..."
In both cases, you create a user-project matrix based on the entire data set. You need to create two [943 x 1682] matrices because you have divided the data into two parts: test and training. The training matrix contains 75% of the evaluations, and the test matrix contains 25% matrices.
User-Project Matrix example:
After creating the user-project matrix, compute the similarity and create a similarity matrix.
The similarity between projects in the item-item collaborative Filtering algorithm relies on observations of all users who have evaluated the same project.
For the user-item collaborative Filtering algorithm, the similarity between users relies on observing all items that the same user has evaluated.
Cosine similarity is usually used as a distance measure in the recommender system, and is considered as a vector in n Vicon space, and the similarity is calculated based on the angle between these vectors.
Users a and m can calculate the cosine similarity with the following formula, where you can use the dot product between the user vector uk and ua and divide the Euclidean length by the two vectors.
The similarity between the items m and B can be calculated using the following formula:
First create a user-item matrix, so you need to create two matrices for test and training datasets.
12345678 |
#Create two user-item matrices, one for training and another for testing
train_data_matrix
= np.zeros((n_users, n_items))
for line
in train_data.itertuples():
train_data_matrix[line[
1
]
-
1
, line[
2
]
-
1
]
= line[
3
]
test_data_matrix
= np.zeros((n_users, n_items))
for line
in test_data.itertuples():
test_data_matrix[line[
1
]
-
1
, line[
2
]
-
1
]
= line[
3
]
|
You can use the pairwise_distances function to calculate the cosine similarity. Note that because the evaluation is positive, the output value should be 0 to 1.
123 |
from sklearn.metrics.pairwise import pairwise_distances user_ Similarity = pairwise_distances (train_data _matrix, metric = ' cosine ' item_similarity = pairwise_distances (Train_data_matrix. T, metric = ' cosine ' " |
The next step is to make predictions. Since the similarity matrix user_similarity
is constructed anditem_similarity,
因此你可以运用下面的公式为user-based CF做一个预测:
The similarity between user K and user A is weighted by the product of a series of evaluations of a similar user a (fixed to the average rating of that user). You will need to standardize the similarity so that the evaluation can be maintained between 1 and 5, and the final step is to count the sum of the average user evaluations you want to predict.
The problem that is considered here is that some users can either give the highest score or give the lowest score when evaluating all movies. These users are given the relative non-annual value of the evaluations more important. For example: Imagine that user K rated 4 stars for his favorite movie, while other good movies rated 3 stars. Suppose now another user T rated his/her favorite movie as 5 stars and watched a movie that wanted to sleep rated 3 stars. The two users may have similar tastes, but the methods used to evaluate the system are different.
When making a recommendation for item-based CF, you should not correct the average user rating, as the user itself uses queries to make predictions.
123456789 |
def predict(ratings, similarity,
type
=
‘user‘
):
if type =
= ‘user‘
:
mean_user_rating
= ratings.mean(axis
=
1
)
#You use np.newaxis so that mean_user_rating has same format as ratings
ratings_diff
= (ratings
- mean_user_rating[:, np.newaxis])
pred
= mean_user_rating[:, np.newaxis]
+ similarity.dot(ratings_diff)
/ np.array([np.
abs
(similarity).
sum
(axis
=
1
)]).T
elif type =
= ‘item‘
:
pred
= ratings.dot(similarity)
/ np.array([np.
abs
(similarity).
sum
(axis
=
1
)])
return pred
|
Item_prediction = Predict (Train_data_matrix, item_similarity, type='item') user_prediction = Predict ( Train_data_matrix, user_similarity, type='user')
Evaluation
There are many evaluation indicators, but one of the most popular indicators for evaluating predictive accuracy is Root Mean squared Error (RMSE).
You can use the mean_square_error
(MSE) function in Sklearn, Rmse is just a square root of the MSE. To read more about the different evaluation indicators you can check this article.
Because you just want to consider the predictive evaluation in this test data set, you can use Prediction[ground_truth.nonzero ()] to filter all the other elements in the test matrix.
123456 |
from sklearn.metrics
import mean_squared_error
from math
import sqrt
def rmse(prediction, ground_truth):
prediction
= prediction[ground_truth.nonzero()].flatten()
ground_truth
= ground_truth[ground_truth.nonzero()].flatten()
return sqrt(mean_squared_error(prediction, ground_truth))
|
12 |
print ‘User-based CF RMSE: ‘ + str (rmse(user_prediction, test_data_matrix)) print ‘Item-based CF RMSE: ‘ + str (rmse(item_prediction, test_data_matrix)) |
12 |
User - based CF RMSE: 3.1236202241 Item - based CF RMSE: 3.44983070639 |
The memory-based algorithm is easy to implement and produces reasonable predictive quality. The disadvantage of memory-based CF is that it cannot be extended to real-world scenarios and does not handle well-known cold-start problems (when faced with new users or new projects entering the system). The model-based CF method is scalable and can handle a higher level of sparsity than the Memory-based method, which can be worse when new users or new projects without any reviews enter the system. I would like to thank Ethan Rosenthal about memory-based collaborative filtering blog.
model-based Collaborative Filtering
Model-based collaborative filtering is based on the larger exposure that has been received, mainly as a decomposition of latent variables and dimensionality-reduction unsupervised learning methods matrix decomposition (MF)
model-based Collaborative filtering is based on matrix factorization (MF), which has received many exposures, mainly as a potential variable decomposition and dimensionality reduction unsupervised learning method. Matrix factorization is widely used in recommender systems because of its ability to solve scalability and sparse problems better than memory-based cf. The goal of MF is to learn the potential attributes of a user's potential and from a known scoring project (learning to describe the characteristics of the scoring feature), and then predict the unknown score by the potential feature dot product of the user and project.
When you have a multidimensional sparse matrix, you can user-item the user-project matrix (The matrix) into a low-scoring structure (Low-rank structure), and you can multiply it by two low-scoring (Low-rank) matrices, Where the rows of the matrix contain potential vectors.
The matrix is adjusted to approximate the original matrix to fill the missing items in the original matrix, as far as possible by the low evaluation matrix product.
Now start calculating the sparse level of the Movielens dataset:
12 |
sparsity
=
round
(
1.0
-
len
(df)
/
float
(n_users
*
n_items),
3
)
print ‘The sparsity level of MovieLens100K is ‘ +
str
(sparsity
*
100
)
+ ‘%‘
|
1 |
The sparsity level of MovieLens100K is 93.7 % |
Give an example of the potential preferences of learning users and projects: Take the Movielens DataSet and you have information about: (User ID, age, location, gender, movie ID, director, actor, language, year, rating). By using matrix factorization this model learns that important user features are age group (under 10 years, 10-18 years, 18-30 years, 30-90 years old), geographical location and gender, and the characteristics of the film it is most important to learn the age, director and actor. Now if you look back at the information you have stored, there are no characteristics such as age, but this model can be learned by itself. The important aspect is that the CF model only needs to use the data (user ID, movie ID, scoring) to learn these potential features. If no data is available the CF model performance will be poor, because it is more difficult to learn these potential features.
The model used for scoring and characterization is called a hybrid recommender system (Hybrid recommendersystems), which is a combination of collaborative Filtering and content-based models. Hybrid Recommender systems typically exhibit higher accuracy than collaborative Filtering or content-based models: they are better able to handle cold-start problems (because if you don't have any reviews for users or projects that you can use for datasets, it's hard to make predictions). The hybrid recommendation system will be described in the next article.
Svd
A well-known matrix factorization method is Singular value decomposition (SVD). Collaborative Filtering
Collaborative filtering can be formulated by using the singular value decomposition approximation matrix X
This article translated from: Http://online.cambridgecoding.com/notebooks/eWReNYcAfB/implementing-your-own-recommender-systems-in-python-2
Hara Agnes Jóhannsdóttir
Build your own recommender system with Python