Collaborative filtering algorithm R/mapreduce/spark Mllib multi-language implementation

Source: Internet
Author: User
Tags spark mllib



User movie scoring data set download

http://grouplens.org/datasets/movielens/


1) item-based, non-personalized, everyone sees the same

2) user-based, personalized, everyone sees not the same

After the user's behavior analysis gets the user's liking, can calculate the similar user and the item according to the user's liking, then may base on the similar user or the item to recommend. This is the two branches in collaborative filtering, based on user-and item-based collaborative filtering.

When calculating the similarity between users, a user's preference for all items is used as a vector, while the similarity between items is calculated as a vector for all users ' preferences for an item. After finding the similarity, the next step is to make a similar neighbor.


3) Model-based (MODELCF)

According to the model, can be divided into:

1) Nearest neighbor Model: distance-based collaborative filtering algorithm

2) Latent Factor mode (SVD): A model based on matrix decomposition

3) Graph: Graph model, social Network diagram model




Applicable scenarios


      for an online website, the number of users is often more than the number of items, At the same time the item data is relatively stable, so the similarity of the items is not only

      calculation is small and does not have to be updated frequently. However, this only applies to e-commerce types of websites, such as news, blogs and other such sites

      System recommendations, the situation is often the opposite, the number of items is huge, and frequent updates.



r language implementation of object-based collaborative filtering algorithm

    
  

   #引用plyr包     library (plyr)      #读取数据集      train<-read.table (file= "C:/users/administrator/desktop/u.data", sep= " ")      Train<-train[1:3]        names (train) <-c ("User", "item", "Pref")          #计算用户列表方法     usersunique<-function () {       users<-unique (Train$user)       users[ Order (users)]    }             #计算商品列表方法     itemsunique<-function () {      items<- Unique (Train$item)       items[order (items)]    }         #  user List     users<-usersunique ()      #  Product List &NBSP;&NBSp;  items<-itemsunique ()       #建立商品列表索引     index< -function (x)  which (items %in% x)     data<-ddply (train,. ( USER,ITEM,PREF), Summarize,idx=index (item))           #同现矩阵      cooccurrence<-function (data) {      n<-length (items)       co<-matrix (Rep (0,n*n), nrow=n)       for (U  in users) {    idx<-index (Data$item[which (data$user==u))      m<-merge (IDX,IDX)     for (I in 1:nrow (m)) {       co[m$x[i],m$y[i]]=co[m$x[i],m$y[i]]+1    }      }       return (CO)     }          #推荐算法     recoMmend<-function (udata=udata,co=comatrix,num=0) {      n<-length (items)             # all of pref       pref<-rep (0,n)       pref[udata$idx]<-udata$ pref            #  User Ratings Matrix        userx<-matrix (pref,nrow=n)              #  co-existing matrix * Scoring matrix       r<-co %*% userx             #  Recommended Results Sorting        #   Set the recommended value for the product that the user has scored to 0      r[udata$idx]<-0       idx<-order (r,decreasing=true)       topn<-data.frame (User=rep (udata$ User[1],length (IDX)), Item=itemS[IDX],VAL=R[IDX])       topn<-topn[which (topn$val>0),]             #  recommended results before taking Num       if ( num>0) {    topn<-head (topn,num)       }              #返回结果       return ( TOPN)     }         #生成同现矩阵      Co<-cooccurrence (data)       #计算推荐结果     recommendation<- Data.frame ()     for (I in 1:length (users)) {       Udata<-data[which (Data$user==users[i]),]      recommendation<-rbind ( Recommendation,recommend (udata,co,0))      }



Mareduce implementation

Reference article:

Http://www.cnblogs.com/anny-1980/articles/3519555.html


Code download

Https://github.com/bsspirit/maven_hadoop_template/releases/tag/recommend





Spark ALS Implementation

Spark Mllib uses a matrix decomposition for collaborative filtering, not userbase or itembase.


Reference article:

Http://www.mamicode.com/info-detail-865258.html



Import org.apache.spark.sparkconfimport org.apache.spark.mllib.recommendation. {als, matrixfactorizationmodel, rating}import org.apache.spark.rdd._import  org.apache.spark.sparkcontextimport scala.io.sourceobject movielensals {  def  Main (args:array[string])  {    //set the operating environment     val sparkconf  = new sparkconf (). Setappname ("Movielensals"). Setmaster ("local[5]")     val  sc = new sparkcontext (sparkconf)     //load user ratings for ratings generated by the indexer ( Generate file PersonalRatings.txt)     val myratings = loadratings (args (1))      val myratingsrdd = sc.parallelize (myratings, 1)     / /Sample Data Catalog     val movielenshomedir = args (0)     //loading sample scoring data , where the last column timestamp takes the remainder of 10 as the value of key,rating, i.e. (int,rating)    &nbsP;val ratings = sc.textfile (movielenshomedir +  "/ratings.dat") .map {       line =>        val fields =  line.split ("::")         // format:  (timestamp %  10, rating (userid, movieid, rating))          ( Fields (3). tolong % 10, rating (Fields (0). Toint, fields (1). Toint, fields (2). ToDouble))     }    //Loading Film catalogue (film id-> movie title)     val  Movies = sc.textfile (movielenshomedir +  "/movies.dat") .map {       line =>        val fields =  Line.split ("::")         // format:  (Movieid, moviename)          (Fields (0). Toint, fields (1))     }.collect () .tomap     //counts the number of users and the number of movies and the number of users who rated the film     val numratings = ratings.count ()     val numusers = ratings.map (_._2.user). Distinct (). Count ()      val nummovies = ratings.map (_._2.product). Distinct (). Count ()      println ("got "  + numRatings +  " ratings from "  +  numusers +  " users "  + numMovies +  " movies")      //the sample scoring table with a key value divided into 3 parts, respectively, for training   (60%, and adding user ratings),  check   (20%), and  test   (20%)      //This data is applied multiple times during the calculation, so the cache to memory     val numpartitions = 4     val training = ratings.filter (x => x._1 < 6). Values.union (Myratingsrdd). RepartitioN (numpartitions). Persist ()     val validation = ratings.filter (x =>  x._1 >= 6 && x._1 < 8). Values.repartition (numPartitions). Persist ()     val test = ratings.filter (x => x._1 >=  8). Values.persist ()     val numtraining = training.count ()      val numvalidation = validation.count ()     val numtest  = test.count ()     println ("training: "  + numtraining +   " validation: "  + numValidation +  " test: "  + numtest)     //training models under different parameters and validating in the calibration set, obtaining the model under the best parameters     val ranks =  List (8, 12)     val lambdas = list (0.1, 10.0)      val numiters =&nbSp List (10, 20)     var bestmodel: option[matrixfactorizationmodel] =  None    var bestValidationRmse = Double.MaxValue     var bestRank = 0    var bestLambda = -1.0     var bestNumIter = -1    for  (rank <- ranks;  lambda <- lambdas; numiter <- numiters)  {       val model = als.train (TRAINING,&NBSP;RANK,&NBSP;NUMITER,&NBSP;LAMBDA)        val validationrmse = computermse (model, validation,  numvalidation)       println ("RMSE (validation)  = "  +  validationrmse +  " for the model trained with rank = "          + rank +  ",lambda = "  + lambda +  ", and numiter =   " + numIter + ". ")       if  (VALIDATIONRMSE&NBSP;&LT;&NBSP;BESTVALIDATIONRMSE)  {         bestmodel = some (model)          bestValidationRmse = validationRmse         bestrank = rank        bestlambda = lambda         bestNumIter = numIter       }    }    //predicts the score of the test set with the best model and calculates the root mean square error between the actual scores (RMSE)      val testrmse = computermse (bestmodel.get, test, numtest)      println ("the best model was trained with rank = "  + bestrank +  " and lambda = "  + bestLambda       +  ", and numiter = "  + bestNumIter +  ",  and its rmse on the test set is  " + testRmse + ". ")     //create a naive baseline and compare it with  The best model    val meanrating = training.union (validation). Map (_.rating). Mean ()     val baselinermse = math.sqrt (Test.map (x =>   (meanrating - x.rating)  *  (meanrating - x.rating). Reduce (_ + _)  / numtest)     val improvement =  (baselinermse -  TESTRMSE)  / baselinermse * 100    println ("The best model  improves the baseline by  " + "%1.2f ". Format (Improvement)  + "%. ")     //recommended the top 10 most interesting movies, note to remove the user has scored the film     val myratedmovieids =  myratings.map (_.product). Toset    val candidates = sc.parallelize ( Movies.keys.filter (!myratedmovieids.contains (_)). Toseq)     val recommendations =  bestmodel.get      .predict (Candidates.map (0, _))        .collect ()       .sortby (-_.rating)        .take (&NBSP;&NBSP;&NBSP;&NBSP;VAR&NBSP;I&NBSP;=&NBSP;1&NBSP;&NBSP;&NBSP;&NBSP;PRINTLN) ("Movies  recommended for you: ")     recommendations.foreach { r =>       println ("%2d". Format (i)  +  ": "  + movies (r.product )) &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;I&NBSP;+=&NBsp;1    }    sc.stop ()   }  /**  The RMS error between the calibration set forecast data and the actual data  **/  def computermse (Model:matrixfactorizationmodel,data:rdd[rating], N:long):D ouble = {    val predictions:rdd[rating] = model.predict (Data.map (x =>  (x.user,x.product)))     val predictionsandratings =  predictions.map{ x => ((x.user,x.product), x.rating)}      . Join (Data.map (x =>  ((x.user,x.product), x.rating)). VALUES&NBSP;&NBSP;&NBSP;&NBSP;MATH.SQRT ( Predictionsandratings.map ( x =>  (x._1 - x._2)  *  (x._1 - x._2)) . reduce (_+_)/n)   }  /**  load user ratings files  personalratings.txt **/  def  loadratings (path:string):seq[rating] = {    val lines =  Source.fromfile (Path). Getlines () &NBSP;&NBsp;  val ratings = lines.map{      line =>         val fields = line.split ("::")          rating (Fields (0). Toint,fields (1). Toint,fields (2). ToDouble)      }.filter (_.rating > 0.0)     if (ratings.isempty) {       sys.error ("no ratings provided.")     }else{      ratings.toSeq    }   }}


Reference article:

http://blog.csdn.net/acdreamers/article/details/44672305

Http://www.cnblogs.com/technology/p/4467895.html

http://blog.fens.me/rhadoop-mapreduce-rmr/





This article is from "not what Daniel qq:934033381" blog, please make sure to keep this source http://tianxingzhe.blog.51cto.com/3390077/1710048

Collaborative filtering algorithm R/mapreduce/spark Mllib multi-language implementation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.