Original link: Netflix Recommendations:beyond The 5 stars (Part 1), (Part 2)
Original Xavier Amatriain and Justin Basilico
Translator: Big Kui
Objective
Nexflix is a company that offers online video streaming services and DVD rentals, and is also the initiator of the famous Netflix Grand Prix. If readers want to learn more about Netflix, it's recommended to read an article on and news:
Netflix: From traditional DVD rental to streaming media gorgeous turn aroundand the fan Love minority:
behind the success of Netflix: high salaries, highest standards, higher elimination rates, stocks, unlimited vacations, fear culture, junk bento Lunch
in this blog post, the author has opened up Netflix's most valuable asset, the veil of recommendation systems. The full text is divided into two parts. In the first part, the author first describes the Netflix prize contribution to smart recommendations, the main modules of Netflix referral services, and how referral services meet the business needs of the site, and the second part describes the data and models used by the system, This paper discusses how to combine the Offline machine learning experiment with the AB testing on line.
First Part
Netflix Grand Prix and recommendation systemIn 2006, we launched the Netflix Grand Prix, a game of machine learning and data mining designed to solve the problem of movie scoring predictions. The purpose of this competition is to find a better way to recommend products to our users, which is the core task of our business model. For the winning team that can increase the accuracy of our cinematch system by 10%, we have prepared a $1 million award. Of course, we need a problem definition that is easier to measure and quantify, and the evaluation indicator we choose is the root mean square (root mean squared error) of the difference between the rmse--predicted score and the true score. The challenge is to defeat our 0.9525 RMSE score and reduce it to 0.8572 or less. A year after the start of the competition, the team Korbell won the first half-way award with a 8.43% increase. They have spent more than 2000 hours trying to get the bonus by blending 107 different algorithms. They provided us with the source code. We analyzed 2 of the most effective methods: Matrix decomposition (SVD) and Boltzmann Machine (RBM). SVD can obtain 0.8914 of the RMSE,RBM is 0.8990, the two methods of linear fusion can reach 0.88. In order to apply these methods to the actual system, we have to overcome some limitations, such as the data set of the match is 100 million ratings, but the actual online system is 5 billion, and the design of these methods does not take into account the user constantly generate new scoring situation. But we have finally overcome these challenges and have used these two methods in real products, and have been running as part of the system until now. If readers are concerned about the competition, they may be interested in the title of the award 2 years later. The last 2 years of work were indeed impressive, with hundreds of predictive models fused together, eventually breaking 0.8572 of the boundaries. We've measured some of the newest off-line algorithms, but unfortunately, the algorithms that are winning on the race dataset are not performing well on the online system. Given the cost of system implementation and deployment, we are not finally applying to our online environment. At the same time, our focus has shifted from a personalized experience that boosts Netflix to new areas. Next we will explain the specific reasons.
from US DVD rental to global video streaming services
The focus of our recommendation algorithms has changed over the past few years, because Netflix's business is also changing. One year after the Netflix prize, we released real-time streaming services, which not only changed the way users interact with the system, but also changed the data sources of the recommended algorithms. For the DVD rental business, the goal is to help users find movies and mail them to the user's mailbox in the next few days. Users from the selection of movies to watch movies, which need a delay of several days, so the system received feedback for a long time, if the user is not satisfied, the cost of replacement is relatively large, so users will generally carefully selected. And the users of the convection service, choose a movie immediately can watch, and even in a short period of time to watch many, at the same time, we can also know that the user is watching the whole movie, or only a part of it. Another big change is that the Web site extends from a simple web to hundreds of different devices, such as the integration with the Roku player and the Xbox, released in 2008, and a year later, the Netflix streaming service landed on the iphone, so far, Netflix is found on all kinds of Android devices and the latest AppleTV. Two years ago, we released the Canadian version, we launched our services in 43 Latin American countries in 2011, and recently we landed in the UK and Ireland. Today, Netflix is found in 47 countries with a total of 23 million subscribers. Last quarter, these users watched 2 billion hours of video through hundreds of different devices. Every day, 2 million of movies and TV shows are viewed, and 4 million user ratings are added. We've added personalized services to these new scenarios, and now 75% of video viewing is related to referral systems. These achievements derive from an ever-optimized user experience, and our user satisfaction has been significantly improved through optimization algorithms. Below we will show some techniques and algorithms for the recommended systems.
Recommended everywhere
After several years of accumulation, we have found that maximizing the personalization capabilities of recommender systems in Netflix can be of great value to our subscribers. The personalization on the home page contains a line-by-row video, each with a theme that reveals the inner link of this line of video. Most personalization is based on the method of selecting the line video, including which lines to put which video, and how to sort the video. At the top of the 10 behavior example: We suspect you are most likely to like these 10 themes. Of course, when we say "you", we also include your family. It has to be mentioned that Netflix's personalization is for every family, and that different members of a family are likely to be interested in inconsistencies. That's why we want to choose 10 lines of video, and we want to make recommendations for "Dad", "Mom", "kid" or the whole family. Even if the family has only one user, we also want to take into account the different interests and emotions of this user. Because of this, the goal of our system is not only accuracy, but also the degree of dispersion of the results.
Another important element of the Netflix personalization system is cognition (awareness). We want our users to know how we feel about their preferences. This not only allows users to trust our system, but also encourages users to submit more feedback to help our recommendations do better. Another way to increase trust in a personalized system is to provide a reason for recommendation: Why do we recommend this movie or episode? Not because it satisfies our business needs, but based on the information we get from the user (user ratings, viewing records, testimonials from users, etc.).
And with friends-based referrals, we've recently released our Facebook connectivity in 46 out of 47 countries, except in the United States, because of the impact of Vppa (Video Privacy Protection Act, 1998). By knowing what friends are doing, we are not just providing another source of data for our recommended algorithms, but also enabling us to generate a few lines of new recommendations with the theme "Social circles".
One of the most impressive aspects of our referral service is "style" as a few lines of recommendations for the theme. It contains a large class like "comedy", and also contains a small class of very long tails such as "time-space through play". The presentation of each line takes into account three aspects: which style to choose, which videos in this style, and how these videos are sorted. The user's attention to this module is very high, when we put the long tail category in front of the time, the detection of user dwell time has a significant increase. Novelty and diversity are also factors to consider when choosing a video.
We also provide recommendations for each line of choice, some based on implicit feedback: Recent viewing, user ratings and other interactions, some based on explicit feedback, and explicit feedback is obtained by inviting users to do taste preference tests.
The similarity-based recommendation is also one aspect of our personalised service. Similarity is a very broad concept, the description of the object can be different movies, users, can also be ratings, video meta-information, and so on. These similarity calculations are also used in other modules. Similarity-based recommendations are used in a variety of scenarios, such as when a user searches for a movie or puts a movie on a watch list, and can also be used to generate a "dynamic style" recommendation based on a video that the user has seen recently.
The above scenarios, including preferred 10-line recommendations, style recommendations, and similarity-based recommendations, all involve sorting algorithms, which are a key step in providing effective recommendations. The goal of the sequencing system is to discover the videos that users are most interested in for different scenarios. We break down the sorting system into: Scoring, sorting, filtering several parts. Our business goal is to maximize user satisfaction and monthly subscription ratios, which is actually equivalent to maximizing user views on video. So we recommend users with the highest-scoring video.
Now we know that Netflix Prize's prediction of movie scoring is just one of the many components of building an effective recommender system. We also need to consider aspects such as user scenarios, video popularity, novelty, diversity, user interest, and explanatory. To take into account these elements, we need to pick the right algorithm. In the next section, we're going to discuss sorting issues in detail, as well as our data and models, and the innovations we've made to meet these needs.
Part II
In the first part, we describe the various parts of the Nexflix recommendation system in detail. We also explained how our recommender system evolved over time, starting with Netflix prize. Pay $1 million for our generous return, not just the innovation of the algorithm, but also enhance the value of our brand, and attract the best talent to join. Scoring forecasts are just some of the features of our world-class recommender system, and in the next section we'll cover a broader range of personalization technologies: we'll discuss our models, our data, and our innovative approach in this area.
Sorting AlgorithmsThe purpose of the referral system is to provide some attractive items for the user to choose. This is done by selecting some candidate items and sorting the items according to the level of interest of the user. The most common way to show recommended results is to make up some sort of sequential list, for example, in Netflix, a list is a line of video. Therefore, we need to use a suitable sorting method, using a variety of information, to create a personalized recommendation list for each user. The most obvious sorting method is to sort by the popularity of the item. The reason for choosing popular recommendations as the benchmark algorithm is also obvious, and users always tend to buy items that everyone likes. However, the popular recommendation is a personalized recommendation of the antonym, which for each user-generated results are stereotyped. Therefore, our goal is to find a better personalized sorting algorithm than the top recommendation to meet the different tastes of different users. Since our goal is to recommend videos that users are most likely to watch, the most natural way to do this is to use the predictive value of the user's rating of the video instead of the popularity of the video. But there is also a problem, the user rating is likely to be a small audience of the film, but users tend to prefer to see those who are not high-scoring, but more popular movies. Therefore, the best way is to take into account the popularity of the video and the user's expected score.
There are many ways to design a sorting system, such as scoring ranking, pairing optimization, and global optimization methods. For example, we can design a simple scoring method: linear weighting of the popularity of the video and the user expectation score: (u,v) = W1*p (v) + w2*r (u,v) + B, where u represents the user, V for the video, p for the hot function, and R for the desired score. This formula can be represented by a two-dimensional space, such as:Once we have designed the scoring function, we can enter a set of videos and arrange them based on their score from high to low. You may be wondering how we choose the values of W1 and W2, in other words, how to determine whether the popularity is more important, or whether the user's expected score is more important? There are at least two solutions to this problem. You can simply select some candidate values for W1 and W2 and put them on the line for A/B testing. This is a time-consuming exercise, but the costs are still acceptable. Another scenario is a machine learning approach: selecting some positive and negative samples from historical data, and designing a goal function that automatically learns a weight for W1 and W2 by machine learning algorithms. "Learning to Rank" is used to solve this problem, and now has been widely used in the field of search engine and advertising precision matching. But there is one important difference between the sort tasks of the referral system-personalization, we don't want to get a global W1 and W2 weight, but want to have a personalized value for each user. You might think that in addition to the popularity and user expectations ratings, we have tried many other features in the Netflix recommendation system, some of which have not worked, and some have significantly improved the accuracy of system sequencing. It shows us how to improve the sorting performance by adding different features and optimizing the machine learning objective function.
Many classification algorithms can be used for sequencing systems, such as logistic Regression, support vector machines (machines), neural networks (neural Networks), decision Trees (decision Tree), and GBDT (Gradient Boosted decision Trees). On the other hand, many algorithms have been applied to the learning to rank field in the last few years, like RANKSVM and Rankboost. For a given sorting problem, finding the best algorithm is not easy. Typically, the simpler your features, the simpler the model can be. But one thing to note is that sometimes a feature doesn't work, precisely because the model you choose is unfriendly to it, or a good model does not perform well in the system, possibly because the features you use do not match the model.
Data and ModelsIn the process of building a perfect personalized experience for users, having good data and choosing the right model is important for our sorting algorithms. Fortunately, we have a lot of relevant data on Netflix, and many talented engineers are able to turn data features into products. Here are the data sources used in our referral system.
- We have billions of user ratings and are growing at millions of per day.
- We use video heat as an algorithm benchmark, but the data source we can use to calculate the heat is also very rich. Can be counted in different time periods, such as the last hour, day, or week. Users can calculate the heat value of a video in a certain part of the user by geographical division.
- Our system produces millions of plays per day, and these play scenes bring many features, such as the length of the play, the time of playback, and the type of device.
- Our users add millions of of videos to their playlists every day.
- Each video has different attribute information: Actor, director, type, rating, comment.
- Video presentation: We know when and where the recommended video is presented to the user, so you can infer how these factors affect the user's choice. We can also observe the details of the user interacting with the system: scrolling the mouse, hovering the mouse, clicking, and staying time on the page.
- Social Networking information has also recently become our source of data, and we can know what videos the users ' friends are watching.
- Users make millions of search requests per day.
- All of the above data sources come from our own systems, and of course we can also get external data, such as the movie box office, critics ' reviews.
- Not all of the above, as well as demographic data, location, language, time data (temporal data, or temporary) can be used to predict user interest.
After the data is introduced, what model do you choose? We found that there are so many high-quality data, a single model is not enough, we have to do model selection, model training and testing. We have used many kinds of machine learning algorithms: unsupervised methods such as clustering, and some supervised classification methods. If you are interested in machine learning algorithms for the recommended field, here is an incomplete list of methods.
- Linear regression (Linear Regression)
- Logical return (Logistic Regression)
- Elastic Network (Elastic Nets)
- SVD (Singular Value decomposition)
- RBM (Restricted Boltzmann machines)
- Markov chain (Markov Chains)
- LDA (latent Dirichlet Allocation)
- Association Rule (Association Rules)
- GBDT (Gradient Boosted decision Trees)
- Stochastic forest (random forests)
- Clustering methods, from the simplest k-means to the graph model, such as affinity propagation
- Matrices Decomposition (matrix factorization)
Consumer LawRich data sources, metrics, and related experimental results enable us to organize our products in a data-driven manner. From the very beginning of Netflix, this approach became the company's gene, which we call the consumer Law (Consumer Data Science). Overall, the goal of our consumer law is to make our and users more accessible through continuous innovation. The real failure is no innovation, as Mr Thomas Watson, IBM's founder, says, "If you want to be successful, don't be afraid of failure." “(
If you want to increase your success rate, double your failure rate.Our culture of innovation requires that we be able to test our ideas quickly and efficiently through experimentation, and only when we have completed the experiment can we understand why the idea succeeded or failed. In this way, we can focus on improving our user experience rather than wasting time on useless ideas. In practical work, how to implement this idea? Unlike traditional scientific research, our verification of ideas is based on online shunt testing (A/b tesing, bucket testing).
1. Make assumptions
- The algorithm/feature/design x to be tested can help increase the duration of video playback and increase user dwell time.
2. Design experiments
- Develop solutions or prototype systems. The final effect of the idea may be twice times the prototype system, but not 10 times times as much.
- Consider the external dependencies, operations, and importance of the system.
3. Conduct the test
4. Let the data speak
When we do A/B testing, we record metrics for multiple dimensions, but the most trusted is the length of the video playback and the user's dwell time. Each test usually covers up to thousands of users, and in order to verify every aspect of the idea, the test is divided into 2 to 20 parts. We typically conduct multiple A/b tests in parallel, which allows us to experiment with some radical ideas and to validate multiple ideas at the same time, and most importantly, we can drive our work through data. For a detailed introduction to our A/B testing, please refer to our Technical blog and our chief product officer, Neil Hunt, for your reply on Quora.
We have to face an interesting challenge of how to incorporate our machine learning algorithms into Netflix's data-driven A/B testing culture. Our response is to do both offline testing and online testing. Offline testing is an online test before we optimize and validate our algorithms. To measure the offline performance of the algorithm, we have adopted a number of indicators in the field of machine learning: sequencing indicators such as NDCG (normalized discounted cumulative gain), mean reciprocal rank, fraction of Concordant pairs, there are also classification indicators, such as accuracy, precision, recall, f-score, we also use the famous prize in Netflix Rmse and other less commonly used indicators, such as dispersion (diversity). We track comparisons between these offline and online effects and find that their trends are not exactly the same, so the offline indicator can only be used as a reference for the final decision.
Once the offline test validates a hypothesis, we are ready to design and publish A/b test, which is further validated by the user's feedback. If this step is passed, we will add it to our main system and provide services to all users. Explains the entire innovation cycle in detail.
This innovation cycle has a strong example of what we call the "marathon of the first 10 rows of results" (TOP10 Marathon). This is a 10-week, highly focused, high-intensity work designed to quickly examine dozens of algorithms to improve the top 10 recommendations of the system. Different teams and individuals are invited together to contribute ideas and implement them programmatically. Each week, 6 different algorithms are driven to online A/B testing and are continuously evaluated for offline and online technical indicators. The algorithms that ultimately perform well become part of our recommendation system.
ConclusionWhile Netflix prize the recommendation System task as a scoring prediction issue, scoring is just one of the many sources of data that is recommended by the system, and scoring forecasts are only part of our solution. Over the past few years, we have redefined the recommendation system task to increase the probability of users choosing videos, watching videos, enjoying our services, and becoming repeat customers. More data can lead to better results, but in order to achieve this goal, we must constantly optimize our approach, conduct reasonable evaluations, and quickly iterate. In order to build a leading personalized platform, only by our research is not enough, the system is still a large increase in space. At Netflix, we're passionate about picking, watching movies and episodes, and we're turning this passion into a powerful intuition for improving our systems: A thorough analysis of data, better features, more reasonable models and benchmarks, and the inadequacy of existing systems. We use data mining and other experimental methods to validate our intuition and to prioritize it, which is like any scientific discovery, and luck is important, but as the saying goes: Opportunity favors those who are prepared. Finally, we need to let our users evaluate our recommendation system, after all, our goal is to enhance the user experience on Netflix.