Spark user-based Collaborative filtering algorithm with pit point, submit job

Source: Internet
Author: User
Tags cos

To undertake the above:
http://blog.csdn.net/wangqi880/article/details/52875524
yes, the firewall for each machine to shut down, or the spark cluster will not start up
The previous time, the distributed cluster of Spark has been set up, so write a simple case to run today. Will write some of the recommended things about spark, here there are 4 points, 1 based on user collaborative filtering, 2 based on object Collaborative filtering, 3 model-based collaborative filtering, 4 based on association rules recommendation (fp_growth), write only the core code AH.

Implementation of collaborative filtering algorithm based on Spark User 1 user collaborative filtering algorithm 1.1 meaning

It is a similar user of the statistical calculation search target user, and according to the similar user to the Item's score to predict the target user to the specified Item's scoring, generally chooses TOPN chooses the similar high similarity user to make the recommendation Result.
From this sentence, we can see that the userbase recommendation algorithm mainly has 3 work to do: 1 user similar measures, 2 nearest neighbor lookup, 3 forecast Score.
Specific Baidu Search

1.2 Similarity distance

The Cos distance is directly used here, and the Cos distance is measured by the angle of the Cos between the vectors, and if it grows in the same direction, then the similarity does not become. The formula is as Follows:

1.3 Sample data is as Follows:
1,1,5.01,2,1.01,3,5.01,4,1.02,1,5.02,2,1.02,3,5.02,4,1.03,1,1.03,2,5.03,3,1.03,4,5.04,1,1.04,2,5.04,3,1.0
The 2Spark code is as Follows:
Package org.wq.scala.mlImportOrg.apache.spark.mllib.linalg.distributed._ImportOrg.apache.spark. {sparkconf, sparkcontext}/** * Created byAdministrator on  ./Ten/.*/object userbasetest {def Main (args: array[string]): Unit = {val conf =Newsparkconf (). Setappname ("userbasemodel"). Setmaster ("local"). Set ("spark.sql.warehouse.dir","e:/ideaworkspace/scalasparkml/spark-warehouse") Val sc =NewSparkcontext (conf)//Test.data is a user _ item _ score sample, and the user is an int, the object is an int val data = Sc.textfile ("data/mllib/test.data") Val Parsedata= Data.map(_.split (",") match { case Array (user,item,rate) =>matrixentry (user.tolong-1, item.tolong-1, Rate.todouble)})/*Parsedata.Collect().Map(x=>{println (x.i+", +x.j+", +x.value)})*/    //Coordinatematrixis specifically saveduser_item_ratingThis sample of dataprintln("ratings:")    Val Ratings=New Coordinatematrix(parsedata)      Ratings.Entries.Collect().Map(x=>{println (x.i+","+x.j+","+x.value)})PutCoordinatematrixConverted IntoRowmatrixCalculates a two-userCossimilarity, and the row represents the user, the column represents the item//RowmatrixThe method,columnsimilaritiesIs the calculation, the similarity between columns and columns, is nowuser_item_rating, you need to transpose(transpose)Yesitem_user_rating, this is the User's similarity//Torowmatrix()After that, the order of the items is not ordered from small to large, but the similarity isOkOfVal Matrix=Ratings.Transpose().Torowmatrix()    println("results after torowmatrix:")    Matrix.rows.Collect().Map(x=>{x.toarray.map (x=>{print(x+",")}) println ("")})    Val similarities=Matrix.columnsimilarities()Similarity is Right.println("similarity")    similarities.Entries.Collect().Map(x=>{println (x.i+", +x.j+", +x.value)})/*similarities.Entries.Filter(_.i==0).SortBy(_.value,false).Collect().Map(x=>{println (x.i+", +x.j+", +x.value)})*//////////////calculate the user 1 to the item 1 score, The forecast result is, User 1 's evaluation points + other similar User's weighted average value, the similarity is the weight//Val RatingOfUser1=Ratings.Torowmatrix().rows.Collect() (3).ToArray, this is the number cannot representUserThe subscript//Torowmatrix()There seems to be a problemVal RatingOfUser1=Ratings.Entries.Filter(_.i==0).Map(x=>{(x.j,x.value)}).SortBy(_._1).Collect().Map(_._2).toList.ToArray    Val AvgRatingOfUser1=RatingOfUser1.sum/RatingOfUser1.size//println(avgRatingOfUser1)Calculates the weighted average of item 1 for other users,Matrixis an item_User_rating//Matrixis a row of all user ratings for the item,Drop(1)Mean to delete your own score ha//MatrixOf(n)Cannot represent the User's Subscript.Val ratingsToItem1=Matrix.rows.Collect() (0).ToArray.Drop(1)//ratingsToItem1.Map(x=>print(x))Weight_.I==0 Select the first user,SortBy(_.j)Represents the User's subscript asKey,valueDescendingvalueThe greater the similarity, the higher the similarity, the moreVal Weights=similarities.Entries.Filter(_.i==0).SortBy(_.j).Map(_.value).Collect()//Val Weights=similarities.Entries.Filter(_.i==0).SortBy(_.value,false).Map(_.value).Collect()//(0 to 2)Represents from 0 to 2, the default step 1, which is represented here, goesTOP2Similar users as predictive user ratings, the real situation,TOPNToo little, huh//sum(weight * User Ratings)/sum(weights)    var Weightedr=(0 to 2).Map(t=>weights (t) * ratingsToItem1 (t)).sum/Weights.sumThe average +TOP2Weighted average of similar usersprintln("rating of Uses1 to item1 is"+ (avgRatingOfUser1))    println("rating of Uses1 to item1 is"+ (weightedr))    println("rating of Uses1 to item1 is"+ (avgratingofuser1+weightedr))}}

The code has comments ha, should all can understand, mainly is calculates the similar reads, calculates the user 1 to Item1 's score, here the calculation is: the user average value +topn user's weighted mean value, the weight is the Similarity.

3 Pit Point

1 tested 300w multi-w record, user estimate 20w, item Big 500,windows single-machine environment 16g memory, configuration 2g xxm, ran for 1 hours did not come out, the speed is too slow, of course, also with the configuration, directly Stopped.
2 The method of the middle of the turn row matrix nausea, Torowmatrix (), is the Method. because, after using this method, the User's label Order of the Matrix has changed, do not know how to judge, the label and the user number are Different. For example, everyone knows, you can try It:

//下面程序的结果,这个结果是ok的。//用户_物品_打分val ratings = new CoordinateMatrix(parseData)ratings.entries.collect().map(x=>{       println("ratings=>"+x.i+"->"+x.j+"->"+x.value)     })

The result of the operation is the same as the original sample:

0,0,5.00,1,1.00,2,5.00,3,1.01,0,5.01,1,1.01,2,5.01,3,1.02,0,1.02,1,5.02,2,1.02,3,5.03,0,1.03,1,5.03,2,1.0

But after doing the following conversion of the row matrix:

The following is done transpose (). TorowramtrixResults of ratings. Torowmatrix(). Rows. Collect(). Map(x=>{println ()x. ToArray. Map(t=>{println (t+",")  })})5.0,5.0,1.0,1.0,1.0,1.0,5.0,5.0,1.0,1.0,5.0,0.0,5.0,5.0,1.0,1.0,

Matrix traversal mode for map, can not enter the user ID lookup, nausea, User 2 of the scoring user 3 of the score reversed, artificial comparison of two data to KNOW.
But you can only use traversal to traverse the matrix, how do I know which user this record is.

however, I calculated the similarity between the calculation and the program is similar, the similarity should be ok, here also kneeling for the big God to ask questions?

//program Calculation Similarity2 -3 -0.72057669212289210 -1 -1.00000000000000021 -2 -0.38461538461538470 -3 -0.40032038451271791 -3 -0.40032038451271790 -2 -0.3846153846153847
4 Submit jar to spark cluster run 4.1 packaging method

I'm using idea, using ctrl+alt+shift+s,


4.2 Running jars and precautions

Use RZ to upload to centos, shh tools or other tools are ok, you like it,
Note that data files are guaranteed to be available on each node.
My directory structure is (three machines will be the same ha):
Run the jar directory:/home/jar/

The data directory for running the jar Is:/home/jar/data

After the jar and the data are good, make sure the Spark cluster runs ha, and then enter the command to run our jar.

spark-submit  --class org.wq.scala.ml.UserBase --master spark://master:707711  /home/jar/UserBaseSpark.jar /home/jar/data/test.data

Run successfully

4.3 precautions

1 Ensure that your data files are available in the node or report errors:

2 Ensure that you submit the job, set the running memory is not more than your own in the spark-env.sh memory, or to report the following warning, insufficient resources, program hangs, can not run down:

The question about the Torowmatrix () method solves the great God Resolution.
Have time also will look under the source research,
The next article will be written based on the collaborative filtering of Items.
If you want to make a real spark-based recommendation, it is recommended that you use model-based and expected association Rules.

Spark user-based Collaborative filtering algorithm with pit point, submit job

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.