The spark version tested in this article is 1.3.1
This article will build a simple small film recommendation system on the spark cluster to pave the whole project and accumulate knowledge.
The workflow of the entire system is described as follows:
1. A movie site has a considerable number of movie resources and users, through the individual users of each film scoring, aggregated to get a huge amount of users-film-score data
2. I watched a few movies on a movie website and did a scoring operation (0-5 points)
3. The recommendation system for the film site is based on my ratings of those films, to predict which movies are suitable for me in the movie Library of the site, and recommend them to me.
4. According to my viewing habits and a user's personal information, predict the site user library, who and my hobbies are similar, and recommend to me to know
There are 4 data sets to use:
Test.dat (my scoring data), in the following format:
0-My User id:: Movie ID:: Rating for this movie:: Timestamp of the score
Users.dat (user data) in the following format:
User id:: Gender:: Age:: Job Type:: Zip-code
Movies.dat (movie resource data) in the following format:
Movie ID:: Movie name:: Movie type
Ratings.dat (User-movie-scoring data), in the following format:
User id:: Movie ID:: The user's rating for the movie
(This data set does not contain my scoring data, which is the data with User ID 0)
Recommended system data sets
Probably owned by 6000+ users, 3800+ movie, more than 1 million rating data
For specific data formats, please see the Readme in the full data set, which is described in detail
After downloading the dataset, be careful to check that there are no missing lines, if you delete it, because it will generate an exception when reading the data
Before you start, it is best to clarify the idea, then the coding will have to Belittlin ' a feeling ~
In this system, we are going to use the ALS algorithm to do collaborative filtering
The algorithm builds a model that requires a training data set
So, first of all, let's be clear
What data does the 1.ALS algorithm take to train?
2. What kind of data should be predicted for the model after training?
3. What does the data after the forecast look like?
The training data set is obviously ratings.dat, because this is user-movie-scoring data
However, Ratings.dat alone is not enough, why?
Because in this system, the function is very simple, only a user (that is, I, user ID 0) for the film recommendation, but Ratings.dat does not include my scoring data, without my scoring data, how can the algorithm according to my preferences to recommend the film?
So the data for training should be ratings.dat+test.dat.
The ALS algorithm trains a model based on these data.
You can then use this model to make predictions for movies that I haven't seen in the movie list, and to sift through the 10 best-rated movie recommendations
So, get the answer:
1. Training data Set is Ratings.dat+test.dat
2. To make predictions is movies.dat-the movies I've seen.
3. The predictive result of the model is a list of movies with a rating (the score is for me)
Of course, the above described is a main task of the system, there are some other branch tasks such as: Calculate the variance ah, print out Ah, we look at the code to speak ~
For the basic use of the collaborative filtering algorithm in Mllib, please look first:
Spark (11) –mllib API Programming Linear regression, Kmeans, collaborative filtering demo
Nonsense not to say, on the code:
To facilitate understanding of the format and meaning of the data, it is specified that the variable/constant name is named as follows:
Data name _ Data type
Object Moviesrecommond {def main (args:array[string]) {if(Args.length <2) {System.err.println ("Usage: <master> ) System.exit (1) }//masking logs, because the results are printed on the console, to make it easier to view the results, turn off the spark log outputLogger.getlogger ("Org.apache.spark"). SetLevel (Level.warn) Logger.getlogger ("Org.eclipse.jetty.server"). SetLevel (Level.off)//Create a Portal object Valconf =NewSparkconf (). Setmaster (Args (0). Setappname ("Collaborative Filtering")Valsc =NewSparkcontext (CONF)//Scoring training total data set, tuple format ValRatingslist_tuple = Sc.textfile (args (1) +"/ratings.dat"). map {lines =ValFields = Lines.split ("::") (Fields (0). ToInt, Fields (1). ToInt, Fields (2). ToDouble, Fields (3). Tolong%Ten)//Here will be timespan this column to 10 to do the remainder operation, so that a number of scoring data this column is a 0-9 figure, what to do with? Then look at the following}//Scoring training total data set, analog key-value pair form, key is a number in 0-9, the value is rating type ValRATINGSTRAIN_KV = Ratingslist_tuple.map (x = (X._4, Rating (x._1, x._2, X._3)))//Print out from Ratings.dat, how many users and movies we get Score recordsprintln"Get"+ ratingstrain_kv.count () +"ratings from"+ Ratingstrain_kv.map (_._2.user). Distinct (). Count () +"Users on"+ Ratingstrain_kv.map (_._2.product). Distinct (). Count () +"Movies")//My rating data, rdd[rating] format Valmyrateddata_rating = Sc.textfile (args (2). map {lines =ValFields = Lines.split ("::") Rating (Fields (0). ToInt, Fields (1). ToInt, Fields (2). ToDouble)}//From the total training data to 80% as a training set, 20% as a validation data set, 20% as a test data set, the previous timespan to 10 to do the remainder of the function is to set out from the total of three parts //Set number of partitions ValNumpartitions =3 //The value of the key is less than 8 as the training data Valtraningdata_rating = Ratingstrain_kv.filter (_._1 <8). Values//Note, since the original data set is a pseudo-key-value pair, and as training data requires only the rdd[rating] type of data, the values collection. Union (Myrateddata_rating)//Use the Union action to add my scoring data to the training set as a benchmark for training. Repartition (Numpartitions). Cache ()//format and meaning similar to the above, because it is the validation data, do not need my scoring data, so there is no union Valvalidatedata_rating = ratingstrain_kv.filter (x = x._1 >=6&& X._1 <8). Values. Repartition (numpartitions). Cache ()Valtestdata_rating = Ratingstrain_kv.filter (_._1 >=8). Values. Cache ()//Print out how many data sets are recorded for training, validation, and testingprintln"Training data S num:"+ traningdata_rating.count () +"Validate data S num:"+ validatedata_rating.count () +"test data ' num:"+ Testdata_rating.count ())//Start model training, select the best model based on variance Valranks = List (8, A)ValLambdas = List (0.1,10.0)ValIters = List (5,7)//The number of iterations here depends on the hardware of the respective cluster machine, because my machine can only iterate 7 times, and then more memory overflowvar Bestmodel:matrixfactorizationmodel =NULLvar bestvalidaternse = double.maxvalue var Bestrank =0var Bestlambda =-1.0var bestiter =-1 //A three-layer nested loop that produces 8 ranks, lambdas, iters combinations, each of which produces a model that calculates the variance of 8 models, the smallest of which is recorded as the best model for(rank <-ranks; Lam <-lambdas; ITER <-iters) {ValModel = Als.train (traningdata_rating, Rank, ITER, LAM)//rnse is the function that calculates the variance, defined at the bottom ValValidaternse = Rnse (model, validatedata_rating, Validatedata_rating.count ()) println ("validation ="+ Validaternse +"for the model trained with rank ="+ Rank +"lambda ="+ Lam +"and Numiter"+ iter)if(Validaternse < Bestvalidaternse) {Bestmodel = Model Bestvalidaternse = Validaternse Bestrank = Rank Bestlambda = Lam Be Stiter = iter}}//Apply the best model to the test data set ValTestdatarnse = Rnse (Bestmodel, testdata_rating, Testdata_rating.count ()) println ("The best model is trained with rank ="+ Bestrank +"and lambda ="+ Bestlambda +"and Numiter ="+ Bestiter +"and rnse on the test data is"+ testdatarnse)//Calculate how much it has improved compared to its original base Valmeanrating = Traningdata_rating.union (validatedata_rating). Map (_.rating). Mean ()ValBaselinernse = Math.sqrt (testdata_rating.map (x = = (meanrating-x.rating) * (meanrating-x.rating)). mean ())ValImprovent = (baselinernse-testdatarnse)/baselinernse * -println"The best model improves the baseline by"+"%2.2f". Format (improvent) +"%")//Movie list total data, tuple format ValMovielist_tuple = Sc.textfile (args (1) +"/movies.dat"). map {lines =ValFields = Lines.split ("::") (Fields (0). ToInt, Fields (1), Fields (2)) }//Movie name total data, map type, key is ID, value is name ValMovies_map = Movielist_tuple.map (x = = (x._1, x._2)). Collect (). Tomap//Movie type Total data, map type, key ID, value type ValMoviestype_map = Movielist_tuple.map (x = = (x._1, x._3)). Collect (). tomap var i =1println"Movies Recommond for you:")//Get the ID of the movie I've seen ValMyratedmovieids = Myrateddata_rating.map (_.product). Collect (). Toset//filter out the movies from the list of movies, and the rest of the movie list will be sent to the model to predict the score I might make for each movie. ValRecommondlist = Sc.parallelize (Movies_Map.keys.filter (Myratedmovieids.contains (_)). Toseq)//To select the highest rated 10 records output by scoring the result data from the big and smallBestmodel.predict (Recommondlist.map (0, _))). Collect (). SortBy (-_.rating). Take (Ten). foreach {r = println ("%2d". Format (i) +"---------->: \nmovie name --"+ Movies_map (r.product) +"\nmovie type ---"+ Moviestype_map (r.product)) i + =1}//Calculate the people who may be interestedprintln"Interested in these people:")ValSqlContext =NewSqlContext (SC) Import sqlcontext.implicits._//Convert movies, users, scoring data into Dataframe, perform sparksql operations ValMovies = movielist_tuple. Map (m = Movies (M._1.toint, m._2, M._3)). TODF ()ValRatings = ratingslist_tuple. Map (r = Ratings (R._1.toint, R._2.toint, R._3.toint)). TODF ()ValUsers = Sc.textfile (args (1) +"/users.dat"). map {lines =ValFields = Lines.split ("::") Users (Fields (0). ToInt, Fields (2). ToInt, Fields (3). ToInt)}.todf () ratings.filter (' rating >=5)//filter out records with a score of 5 in the rating list. Join (Movies, ratings ("MovieID") = = = Movies ("id"))//and movie Dataframe for join Operation. Filter (Movies ("Mtype") ==="Drama")//filter out a record with a score of 5 and a movie type of drama (it should have been filtered based on the type of movie in my scoring data, which is represented by a drama as the data format limits). Join (Users, ratings ("UserId") = = Users ("id"))//Join the user Dataframe. Filter (Users ("Age") === -)//filter out the records of age =18 (consistent with my information). Filter (Users ("Occupation") === the)//filter out the record of work type =18 (and my information is consistent)The. Select (Users ("id"))//Save the user ID only, the result is similar to my personal information, but also like to see the type of movie and I am similar to the user collection. Take (Ten). foreach (println)}//Calculate Variance functiondef rnse (Model:matrixfactorizationmodel, predictiondata:rdd[rating], n:long): Double = {//According to the parameter model, to predict the validation data set ValPrediction = Model.predict (Predictiondata.map (x = = (X.user, x.product)))calculates the variance of the score and returns after the prediction and validation datasets join ValPredictionandoldratings = Prediction.map (x = ((X.user, x.product), x.rating). Join (Predictiondata.map (x = ( X.user, x.product), x.rating)). Values Math.sqrt (predictionandoldratings.map (x = (x._1-x._2) * (x._1-x._2)). Red UCE (_-_)/N)}//Sample class, used as Sparksql implicit conversionCaseclassRatings (Userid:int, Movieid:int, rating:int) caseclassMovies (Id:int, name:string, mtype:string) caseclassUsers (Id:int, Age:int, Occupation:int)}
The results of the system running on the spark cluster are as follows:
For some basic operations on Sparksql, see:
Spark (ix) –SPARKSQL API programming
If there are any deficiencies or errors in this article, please indicate ~
If you have any questions, please contact the Exchange ~
The film recommendation system based on Spark Mllib,sparksql