Toutiao.com algorithm principle analysis, toutiao.com Algorithm
1. System Concept
Recommendation system. If a formal description is used to describe a function that fits a user's satisfaction with the content, this function requires three variables. The first dimension is content, the second dimension is user features, and the third dimension is environmental features.
In the recommendation model, the click-through rate, reading time, thumb ups, comments, and forwarding can all be quantified. The model can be used to directly fit the data for evaluation. You can see whether the online promotion is good or not. However, a large recommendation system and a large number of service users cannot be fully evaluated by indicators. The introduction of data indicators is also important.
1. Typical recommendation algorithms: coordinated filtering, Logistic Regression, DNN, Factorization Machine, and GBDT
2. Typical recommendation features:
1) Relevance features: keyword matching, classification matching, topic matching, and source matching;
2) environmental features: geographical location and time;
3) Heat features: Global heat, classification heat, topic heat, and keyword heat;
4) Coordination features: Click similar users, similar users with similar interest categories, users with similar interests topics, and users with similar interests words;
3. data dependency of the Recommendation System:
1) the feature extraction of the Recommendation model requires various tags on the user side and the content side;
2) The recall policy requires various user-side and content-side tags;
3) content analysis and user tag mining are the foundation for building a recommendation system.
Ii. Content Analysis
Content Analysis includes text analysis, image analysis, and video analysis.
1. Text Analysis plays an important role in user interest modeling in the recommendation system. The application of Text Analysis in the recommendation system is as follows:
1) user interest modeling (user profile): for example, tag users who like to read [Internet] Articles and users who like [Xiaomi] news;
2) Help content recommendation: [meizu] content is recommended to users who care about [meizu] and [Dota] content to [Dota;
3) generate the channel content: the content of the [dejia] goes to the [dejia channel] and the content of [slimming] goes to the [slimming channel ];
2. The recommendation system mainly extracts text features including:
1) semantic tag features, explicitly tagged for the article. These tags are defined by people. Each tag has a clear meaning and the tag system is predefined;
2) Implicit semantic features are mainly topic features and keyword features. topic features describe the probability distribution of words without explicit meaning. Keyword features are described based on some unified features, no clear set;
3. unique value of text features for recommendations:
1) recommendation engine cannot work without text features;
2) Collaborative features cannot solve the cold start problem of articles;
3) text features with finer granularity make cold start more powerful. For example: [Bayern Munich] VS [Sports ];
4. semantic Tag:
1) classification: user profile, filtering channel content, recommendation recall, and recommendation features;
2) concept: Filter channel content, tag search, and recommendation recall (like );
3) entity: Filter channel content, tag search, and recommendation recall (like );
4. Why layer:
1) different levels have different granularities and requirements;
2) The classification system requires full coverage. I hope that any article can always find the appropriate classification, but the accuracy requirement is not high;
3) The entity system does not require full coverage, as long as it covers popular figures, institutions, works, and products in each field;
4) The conceptual system is responsible for accurately expressing but belongs to the semantics of abstract concepts, and does not need to be fully covered;
3. User tags
Content Analysis and user tags are the two cornerstones of the recommendation system. Content analysis involves a little more machine learning content. Compared to this, the user tag project is more challenging.
1. User label Overview:
1) Interest tags: categories and themes of interest, keywords of interest, sources of interest, user clustering based on interest, and various vertical interest features (vehicle models, sports teams, and stocks of interest );
2) Identity features: gender, age, resident location;
3) behavior characteristics: videos are only watched at night;
2. Data Processing policy:
1) filter noise: Click and filter the title party with a short stay time;
2) hotspot penalty: the user's actions on some popular articles (such as PGONE News) are downgraded. Theoretically, the confidence level of the content with a large scope of dissemination will decrease;
3) Time Decay: As user actions increase, the old feature weights decrease with time, and the new feature weights increase;
4) Penalty display: If an article recommended to a user is not clicked, the weight of the relevant features (category, keyword, and source) will be punished;
5) Global Background: The ratio of clicks per capita for a given feature is considered;
Iv. Evaluation and Analysis
1. factors that may affect the recommendation performance:
1) Changes in candidate content sets;
2) improvement and increase of the recall module;
3) increase in recommendation features;
4) Recommended System Architecture improvement;
5) Optimization of algorithm parameters;
6) rule policy changes;
2. A good evaluation system should follow the following principles:
1) short-term and long-term indicators should be taken into account first;
2) User indicators and ecological indicators should be taken into account;
3) Pay attention to the impact of the coordination effect;
5. Content Security
1. Risk content recognition technology:
1) porn identification model: tens of millions of image samples are constructed and trained using the deep learning algorithm (ResNet;
2) vulgar model: Simultaneous Analysis of text and images;
3) abuse model: Purify product comments and identify inappropriate comments;