1. Introduction:
YouTube's recommended challenges:
Scale: Many algorithms are useful in small data, which is useless on YouTube;
Freshness: Need to be sensitive to the new uploaded video;
Noisy: no real user feedback; lack of structured data
2. Skip
3. Candidate Generation:
The previous model was based on matrix decomposition; The first layers of the YouTube DL model are the use of neural networks to simulate this decomposition, which can be seen as a nonlinear generalization of decomposition techniques.
3.1, the recommendation as a multi-classification:
NCE and HS, the text points out that HS did not get the effect of nce; YouTube believes that traversing unrelated nodes in the tree makes the effect worse.
On-line estimation, not all video scoring;
3.2. Architecture:
User viewing behavior as Wordvec; user browsing as Wordvec; as input, followed by several full connections and Relu.
PS: Users watch multiple video, each video has a VEC, with AVG as the user VEC the best effect.
3.3. Various signals:
1) The use of demographic characteristics as a priori, so that the new user recommendation is reasonable;
2) The user is interested in the new video, even if the correlation is low;
But the system is accustomed to recommending past videos, because training is based on historical data;
The viewing time of the video is not stable, but our model is biased towards the average viewing duration of the fitted video;
Therefore, in the training focus, it is important to consider the upload time characteristics of video to the model.
3.4. Sample and Context selection:
1) Take all watch as a sample, rather than the watch in the recommended result;
2) Each user chooses as many samples to prevent certain users from dominating
3) A lot of CF potentially put the user's behavior pair as symmetrical, and YouTube video is not, so the front and back asymmetry;
3.5. Experiment
The more features, the deeper the level the better
4. Sorting
Purpose of sorting:
1) Use the exposure calibration recommendation results, because the recommendations are based on relevance, but click may have more factors;
2) merging results from different referral sources
The target is the viewing time, which is LR; if the CTR is estimated, low-quality deceptive video is encouraged
4.1. Characteristic representation
There are numerical characteristics, there are classification characteristics;
Classification features can be divided into single-valued classification features, multi-value classification features;
Continuous feature generalization ability is good, because it is the generalization of item;
Candidate set source and score are also important;
The frequency characteristics can be introduced to the loss of information, but also very important (sample is displayed not to be clicked, no longer show, frequency shows the quality of the goods);
Classification features with Word vectors;
NN is sensitive to the characteristic dimension, the continuous feature should be normalized (decision tree is not sensitive); the cumulative probability density normalization is used in this paper;
Using the power function for normalized values can improve the off-line performance;
4.2. Modeling the viewing time
The target function is WEIGHTEDLR; The negative sample uses the unit weight.
Legacy issues:
1, based on the importance of weight correction sample, what is the technology?
2. How is the neighbor search program implemented?
3. How to use WEIGHTEDLR
"Paper reading-rec" <<deep neural NETWORKS for YOUTUBE recommendations>> read