A summary of Netflix prize

Source: Internet
Author: User
Tags benchmark square root

PS: Content translation from Quora


I'm trying to say some ideas here. The matrix decomposition technique and the model combination approach may be the most discussed algorithms for Netflix prize, but we also use a lot of other insights.

Normalization of global effects normalization

Let's say Alice gives "inception" 4 points. We think this score is made up of the following parts:

    • Benchmark score (for example, all users have a score of 3.1 points for the movie)
    • Alice's special effect (such as Alice's tendency to play a lower-than-average score, so she scored 0.5 points lower than we normally expected)
    • The special effects of inception (such as Inception is a great movie, so its score is 0.7 points higher than we wanted)
    • The unpredictable interplay between Alice and Inception takes up the rest of the score (for example, Alice really loves the dream space because of his combination of Leonardo and brain science, so he gets an extra 0.7 points).

In other words, we divide the 4 points into 4 = [3.1 (Datum points) –0.5 (Alice's Special Effect) + 0.7 (the special effect of inception) + 0.7 (particular interactive effect) so instead of letting our model predict 4 points per se, We can also first try to get rid of the effect of the benchmark prediction (the first three parts) and then predict this particular 0.7 points (I think you can also think of it as a simple boosting)

More generally, other examples of benchmark predictions are:

    • One factor: allowing Alice to score (linearly) depends on the number of days since she started scoring (square root). (For example, do you notice that as time increases you become a more severe critic)
    • One factor: Allowing Alice to score depends on the number of days that everyone can rate a movie (if you're the first person to see the movie, maybe because you're a veteran fan and then really excited to see him on a DVD, you're going to have to give him a high mark)
    • One factor: Allowing Alice to score depends on all the number of people who have crossed the dream space. (maybe Alice is a nasty fad).
    • One factor: Allowing Alice to score depends on the overall score of the film.
    • Plus a bunch of other.

In fact, modeling these biases proved to be quite important: in their final paper, Bell and Koren wrote:
Among the many new algorithms that contribute, I would like to highlight the main effect of the data captured by the coarse benchmark predictions (biases). Although the literature is mostly focused on more complex algorithms, we already know that the future of the main effect is likely to be at least as effective a breakthrough as the modeling algorithm.
PS: Why is it useful to eliminate these biases? We have a more specific example, assuming that you know Bob and Alice like the same kind of movies, predicting Bob's scoring on the inception, instead of simply predicting that he scored 4 points like Alice, If we know that Bob is usually 0.3 points higher than the average, we can get rid of Alice's bias and add Bob's: 4 + 0.5 + 0.3 = 4.8

Neighborhood Models Neighbor Model

Let's look at some of the more complex models. When it comes to the above section, a standard approach is to use the neighbor model for collaborative filtering. The neighbor model is primarily based on the following work. To predict Alice's rating of Titanic, there are two things to do.

    • Item-based method: Find the item collection that Alice, similar to the Titanic, has scored, and then weighted the score on Alice on it.
    • User-based method: Find the user collection similar to Alice and also give the Titanic a rating, and then figure out the mean value of their ratings for the Titanic.

Simply taking the item-based approach as an example, our main problem is

    • How to find a collection of similar item
    • How to determine the weighted value of these similar item scores

The standard approach is to use some similarity measures (such as correlations and Jacobian indicators) to define the similarity between movie pairs, using the K most similar movies (k may be selected with cross-validation), and using the same similarity metric when calculating the weighted average. But that's a lot of questions.

    • Neighbors are not independent, so the use of standard similarity metrics to define weighted averages results in repetitive computation of information. For example, suppose you ask five friends what to eat this evening. As three of them went to Mexico last week and were fed up with burritos (a burrito), they were strongly advised not to go to Taqueria (a Mexican fast food restaurant that specializes in tacos). The advice of these five friends is strongly biased compared to the five of friends you don't know each other completely. (with this analogy, three of the Lord of the Rings movies are Harry Potter's neighbors)
    • Different movies should probably have different numbers of neighbors. Some movies can be very predictable with just one neighbor (e.g. Harry Potter 2 requires a single Harry Potter 1 to predict well), some movies require a lot, and some movies may not have good neighbours, so you should completely ignore the neighbor algorithm and let the other scoring models act on the movies.

So another way is as follows:

    • You can still use correlation or pre-similarity to select similar items
    • But instead of using similarity measures to determine the average calculation of interpolation weights, we basically use a linear regression that performs (sparse) to find weights, minimizing the sum of squared errors of the item score and the linear combination of his neighbor scores. (A slightly more complex user-based approach is also useful with the item-based neighbor model)
implicit data recessive

On top of the neighbor model, you can also let the implicit data affect our predictions. A small fact, such as users who rated many sci-fi films without scoring Western movies, suggests that users like sci-fi more than cowboys. So using a framework similar to the neighbor scoring model, we can learn about the offsets associated with the movie's neighbors in the inception space. No matter how we want to predict Bob's score on the inception, let's see if Bob scores the neighbor movie in the Dream Room. If so, we add a relevant offset; if not, we don't add (so Bob's score is implicitly punished)

matrix decomposition of factorization matrices

The neighbor method used to supplement collaborative filtering is the matrix decomposition method. Since neighbor methods are very local scoring methods (if you like Harry Potter 1 then you will like Harry Potter 2), the matrix decomposition method provides a more holistic view (we know you like fantasy movies and Harry Potter has very strong fantasy elements, so you will like Harry Potter), Breaks down the user and movie into a collection of hidden variables (which can be thought of as fantasies or violent attributes). In fact, the matrix decomposition approach is probably the most important part of winning Netflix prize technology. In 2008 of the papers written, Bell and Koren wrote:

PS: It seems that the model based on matrix decomposition is the most accurate (and therefore most popular), and is evident in recent publications and forum discussions on Netflix prize. We fully agree and want to add these matrix decomposition models to the important flexibility that is required by the time effect and the two-dollar view. Nonetheless, the neighborhood model, which has dominated the majority of the literature, will continue to be popular, based on his practical features-the ability to process new user ratings without training and provide direct explanations of the recommendations.

The typical way to find these decomposition is to perform singular value decomposition on the (sparse) scoring matrix (using random gradient descent and regularization of the weight of the factor, perhaps limiting the weights to positive to get some kind of nonnegative matrix decomposition). (Note that the SVD here differs from the standard SVD that is learned in linear algebra, because not every user has a score on every step of the movie, so the scoring matrix has a lot of missing values but we don't think of it as simply 0)

Some of the SVD-inspired approaches in Netflix prize include

    • Standard SVD: As long as you have both the user and the movie expressed as a vector of hidden variables, you can click on Alice's vector and the vector of the dream space to get its corresponding prediction score.
    • Asymptotic SVD model: Instead of a user-owned implicit variable vector, the user can be expressed as an items collection that he has beaten too much (or provided implicit feedback). So Alice expresses the sum (possibly weighted) of the vector of the item that she has already beaten, and then gets the predicted score with the Titanic's hidden variable points. From a practical point of view, this model has the added advantage that it does not require user parameterization, so once the user generates feedback (which can only be viewed by the item without the need to score), it can produce recommendations without retraining the model to include user factors.
    • svd++ model: Both the user's own factor and the items set factor express the user is the combination of the above two.
Regression regression

Some regression models are also used to predict. I think the model is quite standard, so it won't take too long here. Basically, like the neighbor model, we can take a user-centric approach and a movie-centric approach to regression.

    • User-centric: We train the regression model for each user, using the scoring data of all users as the data set. The response variable is the user's rating of the movie, and the Predictor is the property associated with the movie (can be introduced by, say, Pca,mds or SVD)
    • Film-centric: Similarly, you can learn to return to each film, using all the customers who rated the movie as a data set.
Restricted Boltzmann Machines Limited -Boltzmann machine

The finite-Boltzmann machine provides another method of implicit variables that can be used. The following paper describes how to apply this approach to Netflix Prize (if the paper is difficult to read, here is an introduction to Rbms).

temporal effects at the time

Many models contain the effects of the time. For example, when describing the baseline predictions mentioned above, we use some time-related predictions to allow the user's score (linear) to depend on the time since his first rating and the time the movie was first scored. We can also get a finer-grained time effect: grading movies by time into months, allowing movies to change bias under each category. (for example, in the May 2006, Time magazine nominated Titanic as the best film ever, so it was a sprint to the movie's ratings). In the matrix decomposition method, the user factor is also considered to be time-dependent. You can also give the most recent user behavior a higher weight.

regularization of the regularization

Regularization is also used for many models of learning to prevent overfitting on datasets. Ridge regression is used extensively in decomposition models to punish larger weights, and lasso regression (albeit less effective) is useful. Many other parameters, such as baseline predictions, similarity weights and interpolation weights in the neighbor model, are also estimated using very standard shrinkage techniques.

Ensemble Methods Combination Method

Finally, how to combine all the different algorithms to provide a single score that takes advantage of each model (PS: note, as mentioned above, many of these models are not trained directly on the original scoring data, but on the rest of the rest of the model). The final solution is given in the paper, and the champion describes how to use the GBDT model to combine more than 500 models; the previous solution was to use linear regression to combine the predicted values.

Briefly, GBDT fit a series of decision trees in the order of the data, and each tree is asked to predict the error of the previous tree and tends to have slightly different versions of the data. For a description of a longer similar technique, see the question.

Because the GBDT built-in can reference different methods on different slices of the data, we can add some predictions that help the tree use useful clustering:

    • Number of movies per user Rating
    • Number of users per movie rating
    • Hidden variables vectors for users and movies
    • Hidden unit of the finite Boltzmann machine

For example, when Bell and Koren used an earlier hybrid approach, the Rbms was more useful when a movie or user had a few ratings, and the matrix decomposition method was more useful when a movie or user had a lot of scoring. This is a picture of the mixed-scale effect of the 2007 competition.


PS: However, we should emphasize that it is not necessary to use such a large number of models to do well. The following figure shows Rmse as a function of the number of methods used. Less than 50 ways to get the score we win the game (rmse=0.8712), with the best three ways to make RMSE less than 0.8800, has been able to enter the top 10. Even using only the best single models will allow us to enter the leaderboard with 0.8890. This suggests that in order to win the game using a lot of models is swimming, but in practical terms, a good system can use only a few carefully selected models.

Finally, it's been a while since I finished my Netflix prize paper, and my memory and notes are sketchy, and I welcome my suggestions.


For more discussion and Exchange on machine learning&pattern recognition, please follow this blog and Sina Weibo songzi_tea.


A summary of Netflix prize

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.