In the application of matrix decomposition in collaborative filtering recommendation algorithm, we summarize the application principle of matrix decomposition in recommendation algorithm, here we use Spark Learning matrix decomposition recommendation algorithm from the practical point of view.
1. Overview of the Spark recommendation algorithm
In Spark Mllib, the recommended algorithm only implements a collaborative filtering recommendation algorithm based on matrix decomposition. And based on the algorithm is the FUNKSVD algorithm, the M user and n items corresponding to the score matrix M decomposition into two low-dimensional matrix: $ $M _{m \times n}=p_{m \times k}^tq_{k \times n}$$
where k is decomposed into a low-dimensional dimension, which is generally much smaller than M and N. If you are not familiar with the FUNKSVD algorithm, you can review the corresponding principle of the article.
2. Introduction to the Spark recommendation algorithm class Library
In Spark Mllib, the implementation of the FUNKSVD algorithm supports Python,java,scala and R interfaces. Since we are all based on Python in the previous practice, the introduction and use of the Mllib Python interface is also used later in this article.
The Spark mllib recommended algorithm python corresponds to the interface in the Pyspark.mllib.recommendation package, which has three classes, Rating, Matrixfactorizationmodel and ALS. Although there are three classes, the algorithm is only the FUNKSVD algorithm. The purpose of these three classes is described below.
The rating class is simple, just to encapsulate the user, item and score 3 values. In other words, the rating class contains only users, items and scoring triples, and there is no function interface.
The ALS is responsible for training our FUNKSVD model. The reason for this is that the alternate least squares ALS is the result of the use of the ALS in the optimization of the objective function of the FUNKSVD matrix decomposition by Spark. The ALS function has two functions, one is train, this function directly uses our scoring matrix to train the data, while the other function trainimplicit is slightly more complex, it uses the implicit feedback data to train the model, compared to the train function, It has a parameter that specifies the implicit feedback confidence threshold, for example, we can convert the scoring matrix into a matrix of feedback data, and convert the corresponding scoring value into confidence weight value according to a certain feedback principle. Because the implicit feedback principle usually depends on the specific problem and the data, only the common scoring matrix decomposition is discussed later in this paper.
The Matrixfactorizationmodel class is a model that we train with ALS, which can help us make predictions. Common predictions have a user and an item corresponding to the score, a user favorite n items, an item may be favorite n users, all users of their favorite n items, and all items are favorite n users.
For the use of these classes, we'll have an example to explain later.
3. Key class parameters for spark recommendation algorithm
Here we summarize the important parameters of the ALS training model.
1) Ratings : The RDD corresponding to the scoring matrix. We need to enter. If it is implicit feedback, it is the implicit feedback matrix corresponding to the scoring matrix.
2) rank : The dimension of the lower dimension corresponding to the decomposition of the matrix. That is, the dimension k in $p_{m \times k}^tq_{k \times n}$. This value affects the performance of the matrix decomposition, and the larger the algorithm runs, the more memory may be consumed. It is usually necessary to make an argument, which can generally take a number between 10-200.
3) iterations : The maximum number of iterations when solving a matrix decomposition with alternating least squares. This value depends on the dimension of the scoring matrix and the degree of the coefficient of the scoring matrix. In general, you don't need to be too big, like 5-20 times. The default value is 5.
4) Lambda: Lambda_ is used in the Python interface because Lambda is a reserved word for Python. This value is the corresponding regularization factor for FUNKSVD decomposition. It is mainly used to control the fitting degree of the model and enhance the generalization ability of the model. The higher the value, the stronger the regularization penalty. Large recommender systems generally require an adjustment parameter to get the appropriate value.
5) Alpha : This parameter is only useful when using implicit feedback trainimplicit. The implicit feedback confidence threshold is specified, and the larger the value, the more likely there is no association between the user and the item that he has no rating. It is generally necessary to get the appropriate value for the parameter.
As can be seen from the above description, the use of the ALS algorithm is quite simple, it is important to note that the parameters of the parameter is the matrix decomposition of the dimension of rank, regularization of the super-parameter lambda. In the case of implicit feedback, the parameter implicit feedback confidence threshold alpha is also required.
4. Examples of Spark recommendation algorithms
Let's take a concrete example of the use of the Spark matrix decomposition recommendation algorithm.
Here we use Movielens 100K data, data download link here.
After extracting the data, we only use the scoring data in the U.data file. This dataset has 4 columns per row, corresponding to the user ID, item ID, score, and timestamp. Because my machine is broken, in the following example, I only used the first 100 data. So if you use all the data, the predictions will be different from mine.
First you need to make sure that you have Hadoop and spark installed (not less than 1.6) and that you have set up environment variables. Generally we are studying in Ipython notebook (Jupyter notebook), so it's best to make a notebook-based spark environment. Of course it doesn't matter if you don't take notebook's spark environment, but you need to set the environment variables every time before you run them.
If you don't have a spark environment for notebook, you'll need to run the following code first. Of course, if you've already done that, the following code doesn't have to run.
ImportOSImportSYS#The following directories are the Spark installation directory and the Java installation directory for your own machineos.environ['Spark_home'] ="c:/tools/spark-1.6.1-bin-hadoop2.6/"Sys.path.append ("C:/tools/spark-1.6.1-bin-hadoop2.6/bin") Sys.path.append ("C:/tools/spark-1.6.1-bin-hadoop2.6/python") Sys.path.append ("C:/tools/spark-1.6.1-bin-hadoop2.6/python/pyspark") Sys.path.append ("C:/tools/spark-1.6.1-bin-hadoop2.6/python/lib") Sys.path.append ("C:/tools/spark-1.6.1-bin-hadoop2.6/python/lib/pyspark.zip") Sys.path.append ("C:/tools/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip") Sys.path.append ("C:/Program Files (x86)/java/jdk1.8.0_102") fromPysparkImportSparkcontext fromPysparkImportSPARKCONFSC= Sparkcontext ("Local","Testing")
Before running the algorithm, it is recommended to output spark context as follows, if the memory address can be printed normally, then the operating environment of Spark is done.
Print SC
For example, my output is:
<pyspark.context.sparkcontext Object at 0x07352950>
First we read the U.data file into memory, and try to output the first line of data to verify the success of the read-in, note that when copying code, the data directory to use your own U.data directory. The code is as follows:
# The following directory to use after decompression u.data directory user_data = Sc.textfile ("c:/temp/ml-100k/u.data") User_data.first ()
The output is as follows:
U ' 196\t242\t3\t881250949 '
You can see that the data is separated by \ t, we need to divide the string of each line into an array, and take only the first three columns, not the timestamp column. The code is as follows:
Rates = User_data.map (lambda x:x.split ("t") [0:3])print Rates.first ()
The output is as follows:
[u ' 196 ', U ' 242 ', U ' 3 ']
While we've got the RDD for the array of scoring matrices, but the data is still a string, spark needs an array of several rating classes. So now we're going to convert the RDD data type into the following code:
from Import = Rates.map (lambda x:rating (int (x[0)), int (x[1]), int (x[2]))print Rates_ Data.first ()
The output is as follows:
Rating (user=196, product=242, rating=3.0)
Our data is already based on the rating class of Rdd, and now we can finally put the collated data to train, the code is as follows, we set the matrix decomposition dimension to 20, the maximum number of iterations set to 5, and the regularization coefficient is set to 0.02. In practical applications, we need to select the appropriate matrix decomposition dimension and regularization coefficients by cross-validation. Here we are simplified because of the example.
from Import ALS from Import matrixfactorizationmodelsc.setcheckpointdir ('checkpoint/'= 2 = Als.train (Ratings=rates_data, rank=20, iterations=5, lambda_=0.02)
After training the model, we can finally make the prediction of the recommendation system.
Let's start with one of the simplest predictions, such as predicting the user's 38 rating of item 20. The code is as follows:
print model.predict (38,20)
The output is as follows:
0.311633491603
The visible score is not high.
Now let's predict the user 38 favorite 10 items, the code is as follows:
print model.recommendproducts (38,10)
The output is as follows:
[Rating (user=38, product=95, rating=4.995227969811873), Rating (user=38, product=304, rating=2.5159673379104484), Rating (user=38, product=1014, rating=2.165428673820349), Rating (user=38, product=322, rating=1.7002266119079879), Rating (user=38, product=111, rating=1.2057528774266673), Rating (user=38, product=196, rating=1.0612630766055788), Rating (user=38, product=23, rating=1.0590775012913558), Rating (user=38, product=327, rating=1.0335651317559753), Rating (user=38, product=98, rating=0.9677333686628911), Rating (user=38, product=181, rating=0.8536682271006641)]
You can see that the user 38 may like the corresponding score from high to low of 10 items.
Next, let's predict that item 20 is probably the most recommended 10 users, the code is as follows:
print model.recommendusers (20,10)
The output is as follows:
[Rating (user=115, product=20, rating=2.9892138653406635), Rating (user=25, product=20, rating=1.7558472892444517), Rating (user=7, product=20, rating=1.523935609195585), Rating (user=286, product=20, rating=1.3746309116764184), Rating (user=222, product=20, rating=1.313891405211581), Rating (user=135, product=20, rating=1.254412853860262), Rating (user=186, product=20, rating=1.2194811581542384), Rating (user=72, product=20, rating=1.1651855319930426), Rating (user=241, product=20, rating=1.0863391992741023), Rating (user=160, product=20, rating=1.072353288848142)]
Now let's take a look at the three most recommended items for each user, the code is as follows:
Print model.recommendproductsforusers (3). Collect ()
Because the output is very long, there is no copy of the output here.
And each item is worth the recommended three users, the code is as follows:
Print model.recommendusersforproducts (3). Collect ()
Also because the output is very long, there is no copy of the output here.
Hopefully the above example will help you to use the Spark matrix decomposition recommendation algorithm.
(Welcome reprint, reproduced please indicate the source.) Welcome to communicate: [email protected])
A recommendation algorithm for learning matrix decomposition with spark