1. Alternating Least Square
ALS (Alternating Least Square), alternating least squares. In machine learning, a collaborative recommendation algorithm using least squares method is specified. As shown, u represents the user, v denotes the product, the user scores the item, but not every user will rate each item. For example, user U6 did not give the product V3 scoring, we need to infer that this is the task of machine learning.
Since not every user gives each product a score, it can be assumed that the ALS matrix is low-rank, i.e. a m*n matrix, which is obtained by multiplying the m*k and k*n two matrices, wherein the k<<m,n.
Amxn=umxkxvkxn
This assumption is reasonable because users and products contain some hidden features of low dimensions, such as when we know that someone likes carbonated drinks, we can infer that he likes Coke, Coca-Cola, Fanta, without having to make it clear that he likes the three kinds of drinks. The carbonated drink here is the equivalent of a hidden feature. In the above formula, UMXK represents the user's preference for hidden features, and VKXN indicates the extent to which the product contains hidden features. The task of machine learning is to find out umxk and VKXN. It is known that UITVJ is the user I preference for commodity J and uses the Frobenius norm to quantify the errors generated by the reconstruction of U and V. Because many places in the matrix are blank, that is, the user does not rate the goods, for this case we do not have to calculate the unknown, only the observed (user, product) set R.
This translates the collaborative recommendation problem into an optimization problem. In the objective function, you and V are coupled to each other, which requires the use of alternating squares. That is, you first assume the initial value of U (0), so that the problem is converted to a least squares problem, you can calculate the V (0) according to the U (0), and then calculate the U (1) according to V (0), so that the iteration continues until a certain number of iterations, or convergence. Although the global optimal solution of convergence cannot be guaranteed, it has little effect.
2. Mllib's ALS implementation
Mllib's ALS uses a data partitioning structure that will decompose u into u1,u2,u3,... um,v into V1,v2,v3,... vn, the associated U and v are stored in the same partition, thus reducing the cost of inter-partition data exchange. For example, when you calculate v by U, the partition that stores U is p1,p2 ..., the partition where V is stored is q1,q2 ..., you need to send different u to different Q, the block that holds this relationship is called outblock; in P, what you need to calculate V, the block that holds the relationship is called Inblock.
For example, there are a12,a13,a15,u1 stored in the r p1,v2,v3 stored in the q2,v5 stored in the Q3, you need to P1 to U1 and Q2, this information stored in Q3 has outblock;r, so the calculation A12,a32 need v2 and U1, This information is stored in the Inblock.
Directly on the code:
Importorg.apache.log4j. {level, Logger}ImportOrg.apache.spark. {sparkconf, sparkcontext}ImportOrg.apache.spark.mllib.recommendation.ALSImportorg.apache.spark.mllib.recommendation.Rating/*** Created by Administrator on 2017/7/19. */Object ALSTest01 {def main (args:array[string])={ //setting up the operating environmentVal conf =NewSparkconf (). Setappname ("ALS 01"). Setmaster ("spark://master:7077"). Setjars (Seq ("E:\\intellij\\projects\\machinelearning\\machinelearning.jar"))) Val SC=Newsparkcontext (conf) Logger.getRootLogger.setLevel (Level.warn)//read sample data and parseVal Datardd = Sc.textfile ("Hdfs://master:9000/ml/data/test.data") Val Ratingrdd= Datardd.map (_.split (', ')) Match { CaseArray (user, item, rate) =Rating (User.toint, Item.toint, rate.todouble)}) //split into training sets and test setsVal dataparts = Ratingrdd.randomsplit (Array (0.8, 0.2)) Val Trainingrdd= Dataparts (0) Val Testrdd= Dataparts (1) //establishment of an ALS alternate least squares algorithm model and trainingVal Rank = 10Val numiterations= 10Val Alsmodel= Als.train (Trainingrdd, Rank, numiterations, 0.01) //ForecastVal user_product =Trainingrdd.map { CaseRating (user, product, rate) =(user, product)} val predictions=alsmodel.predict (user_product). Map { CaseRating (user, product, rate) =(user, product), rate)} Val ratesandpredictions=Trainingrdd.map { CaseRating (user, product, rate) =(user, product), rate)}.join (predictions) Val MSE=Ratesandpredictions.map { Case(User, product), (R1, r2)) =Val Err= (R1-R2) Err*err}.mean () println ("Mean squared Error =" +MSE) println ("User" + "\ T" + "products" + "\ T" + "rate" + "\ T" + "prediction") RatesAndPredictions.collect.foreach (rating={println (rating._1._1+ "\ T" + rating._1._2 + "\ T" + rating._2._1 + "\ T" +rating._2._2)} ) }}
The 4 parameters of the Als.train () function are the data set used for training, the number of features, the number of iterations, and the regular factor.
Operation Result:
It can be seen that the predicted results are very accurate.
Spark Machine Learning (TEN): ALS Alternate least squares algorithm