From the various sources summarized the general idea, the paper many details still need to be carefully read.
1. Introduction
The three big challenges YouTube videos recommend:
(1) Large-scale: hundreds of millions
(2) Freshness: There are a lot of new video uploads every second, to consider the user's real-time behavior and new video recommendations, balance the new video and good video. (Exploration and exploitation)
(3) Noise: User history behavior is sparse and has a variety of difficult to observe the hidden factors. The data of the video itself is also unstructured. The recommended algorithm requires good robustness.
2. System Overview
It's the same as the recall and sequencing process we've known. The generation of candidate sets here, in addition to using DNN generation, can also introduce candidate sets that are obtained in other ways, which are later called recalls.
Recall and sequencing two stages were trained with one DNN. During the recall phase, hundreds of candidate videos were selected from hundreds of millions of videos, scoring hundreds of videos in the sequencing phase and displaying them to the user.
3. Recall
In Li Hongyi's course, it was introduced that matrix decomposition can be considered as a simple nn, and matrix decomposition is a very early application of the recommendation. The use of DNN here can be seen as a non-linear generalization of MF.
To model the recall process, it is regarded as a super-large-scale multi-classification problem. Each video is a category. Under the conditions of a given user U and Context C, predict the video category to be viewed under time t as I and the video library as V
Each of the video I in V is embedding as a VJ vector, while the user and context pair is embedding as vector U. So at last it becomes the softmax problem of multi-classification.
Here, embedding is saying that the mapping of discrete variables is a dense vector.
The embedding for each video is to get the text of the video into the embedding tool (such as Word2vec, but TensorFlow comes with it). And the user's embedding, is through training to come.
The goal of DNN is to learn the input embedding vector u so that you can make accurate predictions about the video by Softmax the classifier. where u can be seen as a function of the user's historical behavior and context.
Recall Network structure:
Main feature Processing:
(1) Users visited the video, each do embedding after averaging
(2) User search query, each token do embedding after averaging
(3) User portrait characteristics: such as geographical location, equipment, gender, age, login status, such as continuous or discrete features are normalized to [0,1], and watch vector and search vector splicing (concat)
(4) Example age: The time after the video was uploaded
Selection of samples and contexts:
(1) Not only use the presentation of the recommended scene, click the log, but also use other non-recommended scene data, quickly capture the interests of users.
(2) Use a fixed number of samples per user, which is fair to each user and prevents active users from dominating the loss function.
(3) Anti-intuitive, remove the sequence of information, just for each token embeding, to avoid the last search led to the recommendation content.
(4) Asymmetric co-browsing problem asymmetric co-watch probabilities, pointing out that the user to watch the video, the following is an asymmetric common viewing mode, in the initial viewing sequence, the scope will be more extensive, in the later viewing, the scope will gradually concentrated. Most collaborative filtering algorithms, when recommended, are often used by the user's full-scale viewing history. The author's improvement, therefore, is to capture only the viewing sequence of Held-out watch from the user's viewing history sequence.
Online request, through the above DNN training to get the user & context embedding vector u, and then with the video library of all VJ in the internal product operation, to retrieve the maximum number of hundreds of recommended candidates list. We need to do the inner product with the video vectors of all the video libraries, the time cost is too big, the paper uses a fast algorithm of hash.
4. Sorting
The sort of DNN frame is similar to a recall, with more features added to the input.
For category class features, do embedding, such as Video ID class features. In the case of high cardinality, you can only click on the TOPN item to do embedding, the rest of the direct embedding is 0. The multivalent feature, like "past clicks", is the same as the recall phase, with a weighted average. Another notable thing is that embedding with the same ID as the same dimension feature is shared (such as "Past video id", "Seed video id"), which can greatly speed up training, but obviously the input layer is still populated separately. (This sentence is not very understanding)
NN is sensitive to scale, and a normalized method of integral is used for continuous class characteristics. Generally, other methods are more sensitive to scale than tree models.
Unlike the usual CTR projections, here the ranking task takes advantage of the user's expectation of viewing time to model. There is a positive sample of the click, there is PV no click for negative samples, positive samples need to be weighted according to the length of the watch, negative sample weight is 1. The last layer of the network in the training phase is weighted logistic regression. Once the parameters of the regression are trained, the desired viewing duration is obtained by using.
Reference:
https://zhuanlan.zhihu.com/p/25343518
Http://www.jianshu.com/p/19ef129fdde2
http://blog.csdn.net/xiongjiezk/article/details/73445835
https://www.zhihu.com/question/20829671
http://www.jianshu.com/p/c5b8268d273b
Paper notes-deep Neural Networks for YouTube recommendations