Machine learning with Spark learning notes (training on 100,000 movie data, using recommended models)

Source: Internet
Author: User

We are now starting to train the model, and also enter a number of parameters such as the following:
The number of factors in the rank:als. Generally, the bigger the better, but has a direct impact on memory usage, usually rank between 10 and 200.


Iterations: The number of iterations, each iteration will reduce the reconstruction error of the ALS. After several iterations, the ALS model will converge to get a good result, so many iterations (typically 10 times) are not required in most cases.
Lambda: The regularization parameter of the model, which controls the avoidance of overfitting. The larger the value, the more regularization.

We will use 50 factors, 8 iterations, regularization parameters 0.01来 training model:

val model = ALS.train(ratings, 50, 8, 0.01)

Description: The iteration number used in the original book is 10. However, the use of 10 iterations on this machine causes the heap memory to overflow, which is changed to 8 after debugging.


It will return an Matrixfactorizationmodel object, including the user and item's RDD, to (ID. Factor) pairs of forms, they are userfeatures and productfeatures.

println(model.userFeatures.count)println(model.productFeatures.count)



The Matrixfactorizationmodel class has a handy way of predict, which predicts fractions for a combination of users and items.

val predictedRating = model.predict(789, 123)

The user ID chosen here is 789. Calculate his possible rating for the film 123. The results are as follows:

The results you get may not be the same as mine, because the ALS model is randomly initialized.

The Predict method creates an RDD (User,item) to personalize the recommendation for a user, Matrixfactorizationmodel provides a convenient way to--recommendproducts. Enter the number of references: User,num. User Id,num is the number of users that will be recommended.

789 recommended 10 Movies for users today:

val789val10val topKRecs = model.recommendProducts(userID, K);println(topKRecs.mkString("\n"))

The results are as follows:

The following takes the name of the movie:

val movies = sc.textFile("F:\\ScalaWorkSpace\\data\\ml-100k\\u.item")val titles = movies.map(line => line.split("\\|").take(2)).map(array => (array(0).toInt, array(1))).collectAsMap()println(titles(123))

The results are as follows:

Let's take a look at how many movies the user scored in 789:

val moviesForUser = ratings.keyBy(_.user).lookup(789)println(moviesForUser.size)

The results are as follows:

Be able to see 789 of 33 movies scored by users.


Next we are going to get the top 10 highest rated movies, using the rating field of the rating object. And get the name of the movie according to the ID of the film:

moviesForUser.sortBy(-_.rating).take(10).map(rating => (titles(rating.product), rating.rating)).foreach(println)

The results are as follows:

Then we'll see which 10 movies are recommended for this user:

topKRecs.map(rating => (titles(rating.product), rating.rating)).foreach(println)

The results are as follows:

Find similar Movies

By calculating the cosine of the angle of the two vectors to infer the similarity, assuming that it is 1, then the description is exactly the same, assuming that the 0 is not relevant, assuming that 1 indicates that the two are completely opposite. First, we write a method for calculating the cosine of the two vectors:

def cosineSimilarity(vec1: DoubleMatrix, vec2: DoubleMatrix): Double = {    vec1.dot(vec2) / (vec1.norm2() * vec2.norm2())  }

Now to check if it's right, pick a movie. See if it is 1 with its own similarity:

val567val itemFactor = model.productFeatures.lookup(itemId).headvalnew DoubleMatrix(itemFactor)println(cosineSimilarity(itemVector, itemVector))


Can see the result is 1!

Next we calculate the similarity of other movies to it:

valcase (id, factor) =>       valnew DoubleMatrix(factor)      val sim = cosineSimilarity(factorVector, itemVector)      (id,sim)    }

Then get the first 10:

val sortedSims = sims.top(K)(Ordering.by[(Int, Double), Double]{      case(id, similarity) => similarity    })println(sortedSims.take(10).mkString("\n"))

The results are as follows:

Now let's take a look at the movie name:

val sortedSims2 = sims.top(K+1)(Ordering.by[(Int, Double), Double]{      case(id, similarity) => similarity    })println(sortedSims2.slice(1, 11).map{case (id, sim) => (titles(id), sim)}.mkString("\n"))

The results are as follows:

Machine learning with Spark learning notes (training on 100,000 movie data, using recommended models)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.