Preferences are not measurable.
Compared to other machine learning algorithms, the recommended engine output is more intuitive and easier to understand.
The next three chapters mainly describe the main machine learning algorithms in Spark. One chapter revolves around the recommendation engine, which mainly introduces music recommendations. In the following chapters we first introduce the practical applications of spark and Mlib, and then introduce some basic ideas of machine learning.
3.1 Data sets
The relationship between a user and an artist is implicitly extracted through other actions, such as playing a song or album, rather than by an explicit rating or liking. This is called implicit feedback data. Now the home TV-on-demand is also the case, users generally do not actively rated.
Data set in http://www-etud.iro.umontreal.ca/~bergstrj/audioscrobbler_data.html, need to bring a ladder, is http://www.iro.umontreal.ca/~ Lisa/datasets/profiledata_06-may-2005.tar.gz, this doesn't seem like a ladder. In Baidu Network disk share address is HTTP://PAN.BAIDU.COM/S/1BQ4ILG.
3.2 Alternate least squares recommendation algorithm
We are looking for a learning algorithm that does not require user and artist attribute information. This type of algorithm is often called a collaborative filtering algorithm. Judging by the age of two users, there may be similar preferences, which is not called collaborative filtering. On the contrary, according to the two users of the private rooms over many of the same songs to judge that they may like a song, this is called collaborative filtering.
The underlying factor model explains the observable interactions between a large number of users and products through a relatively small amount of non-observable underlying causes.
This example uses a matrix decomposition model. The problem is simplified to the product of the user-characteristic matrix and the feature-artist matrix, and the result of this product is a complete estimate of the entire dense user-artist mutual care matrix.
The alternate least squares (alternating Least squares,als) algorithm is used when solving matrix decomposition. The QR Decomposition method is needed.
3.3 Preparing data
If the data is not running on the cluster, but is running locally, in order to ensure sufficient memory, you need to specify the parameter--driver-memory 6g when starting Spark-shell.
The first step in building a model is to understand the data, parse it, or transform it so that it can be analyzed in spark.
The implementation of the ALS algorithm for Spark Mlib has a small drawback: it requires that the user and product IDs must be numeric and be 32-bit nonnegative integers, which means that IDs greater than Integer.max_value (2147483647) are illegal. Let's first look at whether the data set meets the requirements:
Scala:
scala> val rawuserartistdata = Sc.textfile ("D:/workspace/analysiswithspark/src/main/java/advanced/chapter3/profiledata_06-may-2005/user_artist_ Data.txt") rawuserartistdata:org.apache.spark.rdd.rdd[string]= d:/workspace/analysiswithspark/src/main/java/advanced/chapter3/profiledata_06-may-2005/user_artist_data.txt mappartitionsrdd[1] at textfile at <console>: -Scala> Rawuserartistdata.map (_.Split(' ')(0). ToDouble). Stats () Res0:org.apache.spark.util.StatCounter= (count:24296858, mean:1947573.265353, Stdev:496000.544975, Max:2443548.000000, min:90.000000) Scala> Rawuserartistdata.map (_.Split(' ')(1). ToDouble). Stats () Res1:org.apache.spark.util.StatCounter= (count:24296858, mean:1718704.093757, Stdev:2539389.040171, Max:10794401.000000, min:1.000000)
Java:
1 //Initialize sparkconf2sparkconf sc =NewSparkconf (). Setmaster ("local"). Setappname ("Recommendingmusic");3System.setproperty ("Hadoop.home.dir", "d:/tools/hadoop-2.6.4");4Javasparkcontext JSC =NewJavasparkcontext (SC);5 6 //read-in users-artists play data7javardd<string> rawuserartistdata =jsc.textfile ("src/main/java/advanced/chapter3/profiledata_06-may-2005/ User_artist_data.txt ");8 9 //Displaying Data statisticsTenSystem.out.println (rawuserartistdata.maptodouble line-double.parsedouble (Line.split ("") [0]). Stats ()); OneSystem.out.println (rawuserartistdata.maptodouble line-double.parsedouble (Line.split ("") [1]). Stats ());
The maximum user and artist IDs are 2443548 and 10794401, and there is no need to process these IDs.
The artist ID is then parsed with the artist's name. Because a small number of lines in the file is not canonical, some lines do not have tabs, some accidentally add line breaks, so you cannot use map processing directly. This is required to use FLATMAP, which simply expands the set of two or more results that correspond to the input, and then puts it into a larger rdd. The following Scala program is not running, just paste it here. Read the artist id-artist name data and reject the error data:
Scala:
Val rawartistdata = Sc.textfile ("Hdfs:///user/ds/artist_data.txt") val Artistbyid = rawartistdata.flatmap {line = Val (ID, name) = Line.span (_! = ' \ t ') if (name.isempty) {None} else {try {Some (Id.toint, Name.trim)} catch {case E:numbe rformatexception = None}}}
Java:
1 //read artist id-artist name Data2javardd<string> rawartistdata =jsc.textfile ("src/main/java/advanced/chapter3/profiledata_06-may-2005/ Artist_data.txt ");3Javapairrdd<integer, string> Artistbyid = Rawartistdata.flatmaptopair (Line- {4List<tuple2<integer, string>> results =NewArraylist<>();5string[] Linesplit = Line.split ("\\t", 2);6 if(Linesplit.length = = 2) {7 Integer ID;8 Try {9id = integer.parseint (linesplit[0]);Ten}Catch(NumberFormatException e) { OneID =NULL; A } - if(!linesplit[1].isempty () && ID! =NULL){ -Results.add (NewTuple2<integer, string> (ID, linesplit[1])); the } - } - returnresults; -});
Map the wrong artist ID or non-standard artist ID to the artist's regular name:
Scala:
Val Rawartistalias = Sc.textfile ("Hdfs:///user/ds/artist_alias.txt") val Artistalias = rawartistalias.flatmap {line = >val tokens = line.split (' \ t ') if (tokens (0). isEmpty) {None} else {Some ((tokens (0). ToInt, tokens (1). toint))}}. Collectasmap ()
Java:
1 //map the wrong artist ID or nonstandard artist ID to the regular name of the artist2javardd<string> rawartistalias =jsc.textfile ("src/main/java/advanced/chapter3/profiledata_06-may-2005/ Artist_alias.txt ");3Map<integer, integer> artistalias = Rawartistalias.flatmaptopair (Line- {4List<tuple2<integer, integer>> results =NewArraylist<>();5string[] Linesplit = Line.split ("\\t", 2);6 if(Linesplit.length = = 2 &&!linesplit[0].isempty ())) {7Results.add (NewTuple2<integer, Integer> (Integer.parseint (linesplit[0)), Integer.parseint (linesplit[1])));8 }9 returnresults;Ten}). Collectasmap ();
The first article in Artist_alias.txt is: "1092764 1000311", get the artist name with ID 1092764 and 1000311:
Java:
1 artistbyid.lookup (1092764). ForEach (System.out::p rintln); 2 artistbyid.lookup (1000311). ForEach (System.out::p rintln);
The output is:
Winwood, Stevesteve Winwood
Examples in the book are:
Scala:
Artistbyid.lookup (6803336). Head artistbyid.lookup (1000010). Head
Java:
1 artistbyid.lookup (1000010). ForEach (System.out::p rintln); 2 artistbyid.lookup (6803336). ForEach (System.out::p rintln);
The output is:
Aerosmith (Unplugged) Aerosmith
3.4 Building a first model
We need to do two conversions: first, convert the artist ID to a formal ID, and second, transform the data into a rating object, which is the ALS algorithm's abstraction of "user-product-value". Products refer to "Recommended items to people". Now complete both of these tasks:
Scala:
Import Org.apache.spark.mllib.recommendation._val Bartistalias = Sc.broadcast (artistalias) Val trainData = Rawuserartistdata.map {line =>val Array (UserID, ArtistID, count) = Line.split ("). Map (_.toint) Val Finalartistid =bar TistAlias.value.getOrElse (ArtistID, ArtistID) Rating (UserID, Finalartistid, Count)}.cache ()
Java:
1 //Data Set Conversions2Broadcast<map<integer, integer>> Bartistalias =Jsc.broadcast (artistalias);3 4javardd<rating> traindata = Rawuserartistdata.map (Line- {5list<integer> list = Arrays.aslist (Line.split ("")). Stream (). Map (X-integer.parseint (x)). Collect (Collectors.tolist ());6Bartistalias.getvalue (). Getordefault (List.get (1), List.get (1));7 return NewRating (list.get (0), List.get (1), List.get (2));8}). cache ();
The broadcast variable is used here to cache the data as the original Java object on each executor, so that no deserialization is performed for each task, and data can be cached between multiple jobs and phases.
Build the Model:
Scala:
Val model = als.trainimplicit (Traindata, 10, 5, 0.01, 1.0)
Java:
1 Matrixfactorizationmodel model = Org.apache.spark.mllib.recommendation.ALS.train (Javardd.tordd ( Traindata), 10, 5, 0.01, 1);
It takes a long time to build. I'm using i5 's notebook, which is estimated to take three or four days to finish. So the actual calculation, I only used the first 98 users of the data, a total of 14903 rows. So when a print user plays an artist's work in 3.5, the ID uses the ID in the data set.
To view a feature variable:
Scala:
Model.userFeatures.mapValues (_.mkstring ("")). First ()
Java:
1 model.userfeatures (). Tojavardd (). foreach (F-System.out.println (f._1.tostring () + f._2[0] + f._2. ToString ()));
3.5 Check recommendation Results individually
Get the artist for the user id:
Scala:
Val rawartistsforuser = Rawuserartistdata.map (_.split (")). Filter {case Array (user,_,_) + User.toint = 2093760}val Existingproducts =rawartistsforuser.map {case Array (_,artist,_) = Artist.toint}.collect (). Tosetartistbyid.filter {case (ID, name) =>existingproducts.contains (ID)}.values.collect (). foreach (println)
Java:
1 javardd<string[]> rawartistsforuser = Rawuserartistdata.map (X-X.split ("")). Filter (F- Integer.parseint (f[0]) = = 1000029 ); 2 list<integer> existingproducts = Rawartistsforuser.map (f-integer.parseint (f[1])). Collect ( ); 3 artistbyid.filter (F-existingproducts.contains (f._1)). VALUES (). Collect (). ForEach (System.out:: PRINTLN);
We can make 5 recommendations for this user:
Scala:
Val Recommendations = Model.recommendproducts (2093760, 5) Recommendations.foreach (println)
Java:
Rating[] Recommendations = Model.recommendproducts (1000029, 5);
Arrays.aslist (Recommendations). Stream (). ForEach (System.out::p rintln);
The results are as follows:
Rating (1000029 , 1001365 , 506.30319635520425 ) Rating ( 1000029 , 4531 , 453.6082026572616 ) Rating ( 1000029 , 4468 , 137.14313260781685 ) Rating ( 1000029 , 599 , 130.16330043654924 ) Rating ( 1000029 , 1003352 , Span style= "color: #800080;" >128.75804355555215 )
The book says that the last value of each line is a fuzzy value between 0 and 1, and the larger the value, the better the recommended quality. But the result of my run return is not this.
This is what Spark 1.6.2 's Java API says:
Rating objects, each of the which contains the given user ID, a product ID, and a ' score ' in the Rating field. Each represents one recommended product, and they is sorted by score, decreasing. The first returned is the one predicted to being most strongly recommended to the user. The score is an opaque value, indicates how strongly recommended the product is.
The last array should be scored.
Once you have the ID of the recommended artist, you can find the artist's name in a similar way:
Scala:
Val recommendedproductids = Recommendations.map (_.product). Tosetartistbyid.filter {case (id, name) = = Recommendedproductids.contains (ID)}.values.collect (). foreach (println)
Java:
1 list<integer> recommendedproductids = arrays.aslist (Recommendations). Stream (). Map (y Y.product ()). Collect (Collectors.tolist ()); 2 artistbyid.filter (F-recommendedproductids.contains (f._1)). VALUES (). Collect (). ForEach (System.out: :p rintln);
Output Result:
barenaked Ladiesda Vinci ' s Notebook Ragethey might Be Giants " Weird Al " Yankovic
The book says it seems that the recommended results are not very good.
3.8 Selecting a Hyper-parameter
Calculation of AUC This part of the code did not try.
The parameters of Als.trainimplicit () include the following:
Rank
The number of potential factors for the model, namely, "User-feature" and "Product-feature" matrix, which is generally the order of the Matrix.
Iterations
The number of iterations of the matrix decomposition, and the more iterations, the longer it takes, but the result of the decomposition may be better.
Lambda
Standard over-fitting parameters; The larger the value, the less likely it is to produce overfitting, but the value is too general to reduce the accuracy of the decomposition. Lambda takes a larger value and looks a little better.
Alpha
When controlling matrix decomposition, the observed "user-product" interaction is weighted against the interaction that is not observed. 40 is the default value of the original ALS paper, which shows that the model is better at emphasizing what the user has heard than emphasizing what the user has not heard.
3.9 Generating recommendations
This model can generate recommendations for all users. It can be used for batching, batch processing every one hours or less time for all users to calculate the model and recommendations, the time interval depends on the data size and cluster speed.
However, the current ALS implementation of Spark Mlib does not support the recommendation to all users. The implementation can recommend one user at a time, so that each time a short number of distributed jobs are started. This is ideal for quick re-calculation of small user groups. Make recommendations and print the results for multiple users in the data:
Scala:
Val someusers = Alldata.map (_.user). Distinct (). Take (+) Val somerecommendations =someusers.map (UserID = Model.recommendproducts (UserID, 5)) Somerecommendations.map (RECs = recs.head.user + "+" + recs.map (_.product). Mkstring (",")). foreach (println)
Java:
1 // 5 Recommendations for users with ID 1000029 2 rating[] recommendations = Model.recommendproducts (1000029, 5); 3 arrays.aslist (recommendations). Stream (). ForEach (System.out::p rintln);
The entire process can also be used to recommend users to artists:
Scala:
Rawuserartistdata.map {line =>...val UserID = tokens (1). Tointval ArtistID = tokens (0). ToInt ...}
Java:
In the dataset conversion, "return new Rating (List.get (0), List.get (1), List.get (2));" Change to "return new Rating (List.get (1), List.get (0), List.get (2));"
3.10 Summary
For non-implied data, Mlib also supports a variant of ALS, which is used in the same way as ALS, except that the model is built using method Als.train (). It is useful for scoring data rather than number of times. For example, if the data set is a user rating of the artist, and the value is from 1 to 5, then this variant is appropriate. The rating object results returned by different recommended methods, where the rating field is the estimated score.
If you need to calculate the recommended results on demand, you can use Oryx 2 (Https://github.com/OryxProject/oryx), which uses libraries such as Mlib, but accesses the in-memory model data in an efficient manner.
3-spark Advanced Data Analysis-chapter III music recommendations and Audioscrobbler datasets