Today, algorithmic distribution is already standard for almost any software, such as information platforms, search engines, browsers, social software, and so on, but at the same time, algorithms are beginning to be challenged, challenged and misunderstood.
The main platform recommended algorithm (funny version)
Today's headline recommendation algorithm, since the first version of the September 2012 development operation, has undergone four major adjustments and modifications.
Today's headline commissioned by senior algorithm architect Dr. Cao Huanhuan, public today's headlines of the algorithm principles to promote the entire industry consultation algorithm, build speech algorithm, by making the algorithm transparent, to eliminate the misunderstanding of the algorithm, and gradually promote the entire industry to make the algorithm better for the benefit of society.
▲ 3-minute understanding of today's headlines recommended algorithm principles
This share focuses on the recommended principles of today's headlines around five aspects:
System Overview
Content analysis
User tags
Assessment analysis
Content security
System Overview
Recommendation system, if the formal way to describe is actually to fit a user to the content satisfaction of the function.
This function requires the input of a variable of three dimensions:
content. headline is now a comprehensive content platform, graphics, video, UGC small video, quiz, micro headlines, each content has a lot of its own characteristics, need to consider how to extract different content types of features to do recommendations.
user characteristics. including a variety of interest tags, occupation, age, gender, as well as many models carved out of implicit user interest.
environmental characteristics. This is the mobile internet era of the characteristics of the recommendation, users move anywhere, in the workplace, commuting, travel and other scenarios, information preferences are offset.
Combining three dimensions, the model gives an estimate of whether the recommended content is appropriate for the user in this scenario.
Here's another question, how do you introduce a goal that cannot be measured directly?
The recommended model, click-through rate, reading time, likes, comments, forwarding, including points like are quantifiable goals, can be used to model the direct fit to estimate, see the online promotion can know how to do well.
However, a large number of referral systems, the number of service users, can not be fully evaluated by the indicators, the introduction of elements outside the data indicators are also important.
such as AD and special content frequency control, such as question and answer card is a more special form of content, its recommended goal is not entirely to let users browse, but also to consider attracting users to respond to contribute to the Community content. How this content and common content mix, how to control the frequency control all need to consider.
In addition, the Platform for content ecology and social responsibility considerations, such as the suppression of vulgar content, the title party, the suppression of low-quality content, important news of the top, weighted, strong plug, low-level account content down the right is the algorithm itself can not be completed, the need for further content intervention.
Below I will briefly describe how to implement this algorithm based on the target.
The previously mentioned formula Y = F (Xi, Xu, Xc), is a very classical supervised learning problem. There are many ways to achieve this.
For example, the traditional collaborative filtering model, supervised learning algorithm Logistic Regression model, based on deep learning model, factorization machine and GBDT and so on.
An excellent industrial-grade recommendation system requires a very flexible algorithm experimental platform that can support a variety of algorithm combinations, including model structure adjustment, because it is difficult to have a common model architecture for all recommended scenarios.
It is now popular to combine LR and DNN, and Facebook has also combined LR and GBDT algorithms in previous years.
Several of today's headlines are using a robust algorithm recommendation system, but depending on the business scenario, the model architecture will be adjusted.
After the model and then look at the typical recommendation features, there are four types of features will be more important to recommend the role.
The correlation feature is the evaluation of the content's properties and whether the user matches. Dominant matches include keyword matching, classification matching, source matching, theme matching, and so on. There are some implicit matches in the FM model, which can be derived from the distance between the user vector and the content vector.
environmental characteristics, including location, time. These are both bias features and can be built with some matching features.
heat characteristics. including global heat, classification heat, theme heat, and keyword heat, and so on. Content heat information is very effective in large recommender systems especially when users are cold-booted.
Collaborative features, which can help in part to solve the problem that the so-called algorithm is more and more narrow. Collaborative features do not take into account the user's existing history.
Instead, the user behavior is used to analyze the similarities among different users, such as click Similarity, interest classification similarity, subject similarity, interest words similarity and even vector similarity, thus extending the exploration ability of the model.
Model training, the headlines are most recommended products in real-time training. Real-time training saves resources and feeds back quickly, which is important for information flow products.
The user needs behavioral information that can be quickly captured by the model and fed back to the next brush's recommended effect. We currently process sample data in real-time based on Storm clustering, including Click, Show, collect, share and other action types.
Model parameter server is a high-performance system developed in-house, because the headline data scale growth too fast, similar open source system stability and performance can not be satisfied, and our self-developed system at the bottom of a lot of targeted optimization, provide a perfect operation and maintenance tools, more suitable for existing business scenarios.
At present, the top-of-the-range recommendation algorithm model is also relatively large worldwide, containing tens of billions of primitive features and billions of vector features.
The overall training process is the online server record real-time characteristics, import into the Kafka file queue, and then further import Storm cluster consumption Kafka data, the client callback recommended Label construction training samples, and then based on the latest samples to update the model parameters on-line training, the final model is updated.
The main delay in the process of the user's action feedback delay, because the article recommendation after the user does not necessarily see immediately, regardless of this part of the time, the entire system is almost real-time.
But because the headline content is very large, plus the small video content has tens, the recommendation system is not possible all the content is estimated by the model.
So we need to design some recall strategies, and filter out thousand content library from the mass content each time we recommend it. Recall strategy The most important requirement is that the performance is extreme, the general timeout cannot exceed 50 milliseconds.
There are many kinds of recall strategy, we mainly use the idea of inverted platoon. Offline maintenance A inverted row, this inverted key can be categorized, topic, entities, sources, etc., the order of heat, freshness, action, etc.
An online recall can quickly truncate content from the inverted row based on user interest tags, efficiently sifting through a large content library for a smaller subset of content.
Content analysis
Content analysis includes text analysis, image analysis, and video analysis. Headlines start with the main information, today we mainly talk about text analysis.
One of the most important functions of text analysis in Recommender systems is the modeling of user interest. No content or text tags, no user interest tags.
For example, only know the article tag is the Internet, users read the Internet tag article, to know that users have Internet tags, and other keywords are the same.
On the other hand, the text content of the label can directly help to recommend features, such as Meizu's content can be recommended to the attention of Meizu users, which is the user tag matching.
If a period of time to recommend the main channel effect is not ideal, the recommendation narrows, users will find specific channel recommendations (such as technology, sports, entertainment, military, etc.) in the reading, and then back to the main Feed, the recommended effect will be better.
Because the entire model is open, sub-channel exploration space is small, more easily meet user needs. Only through a single channel feedback to improve the accuracy of the recommendation is more difficult, sub-channel is very important to do well. This also requires a good content analysis.
is an actual text case for today's headline. As can be seen, this article has classification, keywords, topic, solid words and other text features.
Of course not without text features, the recommendation system will not work, the recommendation system is the earliest application in Amazon, even Wal-Mart era, including NETFILX do video recommendations and no text features directly collaborative filtering recommendations.
But for information products, most of the content of the day of consumption, no text features new content cold start is very difficult, collaborative features can not solve the cold start of the article.
Today's top recommendations the text features that are mainly extracted include the following categories. The first is the semantic tag class feature, explicit for the article to play the semantic tag. This part of the label is defined by the human characteristics, each label has a clear meaning, the labeling system is predefined.
In addition, there are implicit semantic features, mainly topic features and keyword features, in which the topic feature is the description of the probability distribution of words, there is no definite meaning, and the key features are based on some unified feature description, no definite set.
In addition the text similarity feature is also very important. In front of the headlines, one of the biggest questions that ever user feedback is why always recommend duplicate content. The difficulty with this problem is that everyone has a different definition of repetition.
For example, some people think that this article about Real Madrid and Barca, yesterday has seen similar content, and today said that the two teams that is repeating.
But for a serious fan, especially Barca fans, it's a lot to look at all the stories. To solve this problem, we should make the online strategy according to the subject, text and subject of similar articles.
Similarly, there are time and space characteristics, the location of the analysis of content and timeliness. For example, Wuhan restrictions on the things to Beijing users may have no meaning.
Finally also consider quality-related characteristics, judging whether the content is vulgar, pornography, whether it is soft wen, chicken soup?
is the feature and usage scenario of the headline semantic tag. They have different levels and different requirements.
The goal of the classification is to cover the whole, I want each piece of content to have a classification of the video, and the entity system requires precision, the same name or content to be able to clearly distinguish which person or thing, but not to cover the whole.
The concept system is responsible for solving the semantics of more precise and abstract concepts. This is our initial classification, in practice found that the classification and concepts in the technical interoperability, and later unified a set of technical architecture.
At present, implicit semantic features can be very helpful in recommending, while semantic tags need to be labeled continuously, new concepts are constantly appearing, and annotations are constantly iterative.
The difficulty and resource investment of the implicitly is very big and semantic, so why do we need semantic tags?
There are some product needs, such as channels need to have a well-defined classification and easy-to-understand text labeling system. The effect of semantic tagging is to examine the touchstone of a company's NLP technology.
Today's headline recommendation system's online classification uses a typical hierarchical text classification algorithm.
Top is Root, below the first layer of classification is like science and technology, sports, finance, entertainment, sports such a big class.
The following sub-segments of football, basketball, table tennis, tennis, athletics, swimming, football and other sub-division of international football, Chinese football, Chinese football is subdivided in a, super, national team and so on.
Compared with individual classifiers, the problem of data skew can be better solved by using hierarchical text classification algorithm. There are some exceptions, if you want to increase the recall, you can see that we have connected some flying lines.
This set of architecture is common, but according to different difficulty, each meta-classifier can be heterogeneous, like some classification SVM effect is very good, some should be combined with CNN, some to combine with RNN to deal with again.
is a case of an entity word recognition algorithm. Based on the segmentation results and the selection of POS tagging candidates, during the period may need to do some stitching according to the Knowledge base, some entities are a combination of several words, to determine which words together to map the description of the entity.
If the result mapping multiple entities also through the word vector, topic distribution and even the frequency of itself and other disambiguation, and finally calculate a correlation model.
User tags
Content analysis and user tagging are the two cornerstones of the recommendation system. Content analysis involves more machine learning, and user tagging works more challenging than others.
Today's headlines commonly used user tags include categories and topics of interest to users, keywords, sources, interest-based user clustering, and various vertical interest features (models, sports teams, stocks, etc.). There are gender, age, location and other information.
The gender information is logged in via the user's third party social account. The age information is usually predicted by the model, through the models, the reading time distribution and other estimates.
The resident location is from the user authorized to access the location information, based on the location information through the traditional clustering method to get to the resident point.
The resident point, in combination with other information, can speculate on the user's work location, travel location and tourist location. These user tags are very helpful for referrals.
Of course, the simplest user tag is the browsed content tag. But here are some data processing strategies that mainly include:
filter Noise. Filter the title party by short clicks on the dwell time.
punish hot spots. the action of the user in some popular articles (such as the News of PG one in the previous period) is reduced to the right. In theory, the confidence will decrease when the content of the transmission range is larger.
time decay. user interest is offset, so the policy is more biased toward new user behavior. Therefore, with the increase of user action, the old feature weight will decay with time, the new action contribution will be more characteristic weight.
The punishment unfolds. If an article that is recommended to a user is not clicked, the weight of the relevant feature (category, keyword, source) will be punished.
Of course, also consider the global background, is not related content pushes more, as well as the related close and dislike signals.
User tag mining is generally relatively simple, mainly the engineering challenges just mentioned. The first version of the headline user tag is the batch computing framework, the process is simple, the daily users of yesterday's past two months of action data, in the Hadoop cluster batch calculation results.
The problem is that, as users grow at high speed, interest model types and other batch processing tasks are increasing, and the computational volume involved is too large.
2014, batch processing millions of user tag Update Hadoop task, the day of completion has begun reluctantly.
Cluster computing resource tensions can easily affect other work, and the pressure to centrally write to distributed storage systems is starting to increase, and the latency of user interest tag updates is getting higher.
Faced with these challenges, the end of 2014 today headlines the user tag Storm cluster streaming computing system.
After changing to streaming, as long as there is User Action update tag, the CPU cost is relatively small, can save 80% of CPU time, greatly reduce the computational resource overhead.
At the same time, only dozens of machines can support tens of millions of users of the daily interest model updates, and feature updates are very fast, the basic can be quasi-real-time. This system has been used since the launch.
Of course, we also find that not all user tags require a streaming system. This information, such as the user's gender, age, resident location, does not require real-time recurring calculations, and still retains the daily update.
Assessment analysis
The overall architecture of the recommended system is described above, so how can I evaluate the recommended effect? One of the words I think is very intelligent, "there is no way to optimize a thing that cannot be evaluated." The same is true for recommender systems.
In fact, many factors can affect the recommended effect. For example, the candidate set change, Recall module improvement or increase, the recommendation features of the increase, the model structure of the improvement, optimization of algorithm parameters and so on.
The significance of the assessment is that many optimizations may end up being negative, and that the effect is not improved after the optimization is on-line.
A comprehensive evaluation recommendation system requires a complete evaluation system, a powerful experimental platform, and an easy-to-use experience analysis tool.
The so-called complete system is not a single indicator measurement, can not only see the click-through rate or length of stay, etc., need comprehensive evaluation.
Over the past few years we have been trying to synthesize as many indicators as possible, but still in the exploration. At present, we are on the line or by the business of the senior members of the evaluation Committee to discuss the decision after in-depth discussion.
Many of the company's algorithms do not do well, not the ability of engineers is not enough, but need a strong experimental platform, there are convenient experimental analysis tools, can intelligently analyze the confidence of the data indicators.
A good evaluation system should be built according to several principles, first of all to take into account short-term indicators and long-term indicators.
I observed before the company in charge of e-commerce, a lot of strategic adjustments in the short-term users feel fresh, but in the long run do not have any help.
Second, we should take into account user indicators and ecological indicators. today's headlines serve as a platform for content distribution authoring, balancing both the value of content creators, the creation of dignity, and the obligation to satisfy users. There are also advertisers ' interests to consider, which is the process of multi-party game and balance.
In addition, attention should be paid to the effect of synergy effect. The strict flow isolation in the experiment is difficult to do, pay attention to the external effect.
The very direct advantage of the powerful experimental platform is that when the experiment on-line is relatively long, the flow can be automatically allocated by the platform, without human communication, and the end-of-flow of the experiment is recovered immediately, and the management efficiency is improved.
This can help the company reduce the cost of analysis, accelerate the iterative effect of the algorithm, so that the whole system of algorithm optimization work can be quickly forward.
This is the basic principle of the headline A/b Test experiment System. First of all, we will be in the offline state of the user sub-barrels, and then distribute the experimental traffic on the line, the bucket of users tagged, distributed to the experimental group.
For example, open a 10% flow experiment, two experimental groups each 5%, a 5% is the baseline, the strategy and the online market, the other is a new strategy.
The user action will be collected during the experiment, which is basically quasi-real-time and can be seen every hour. But because the hourly data fluctuates, it is usually viewed in days as a time node. After the collection of actions, there will be log processing, distributed statistics, writing database, very convenient.
In this system, engineers only need to set the flow requirements, experimental time, define special filters, and customize the experiment group ID.
The system can be generated automatically: comparison of experimental data, confidence of experimental data, summary of experimental conclusions and recommendations of experimental optimization.
Of course, only the experimental platform is far from enough. The online experiment platform can only speculate the change of user experience through the change of data index, but the data index and user experience are different, many indexes can't be fully quantified. Many improvements still have to be made through manual analysis, and significant improvements require a manual assessment of two confirmations.
Content security
Finally, we will introduce some initiatives on content security in today's headlines. The headline is now the largest content creation and distribution platform in the country, and more and more attention must be paid to social responsibility and the responsibility of industry leaders. If there is a problem with the 1% recommendation, it will have a big impact.
As a result, headlines have put content security at the company's highest priority queue since its inception. At the beginning of the establishment, it has been dedicated to the audit team responsible for content security.
At that time the development of all clients, back-end, algorithm of students in total less than 40 people, visible headlines attach great importance to content audit.
Today, the headlines are mainly from two parts:
A PGC platform with mature content capabilities.
UGC User Content, such as quiz, user reviews, Micro headlines.
These two parts of the content need to pass a unified audit mechanism. If it is a relatively small number of PGC content, the risk will be directly audited, no problem will be a wide range of recommendations.
UGC content needs to be filtered by a risk model, and the problematic will go through two risk audits. After the approval, the content will be really recommended.
At this point if you receive more than a quantitative comment or report negative feedback, will return to the review process, there is a problem directly under the shelf.
The overall mechanism is relatively sound, and as an industry leader, in terms of content security, today's headline has always demanded the highest standards.
The sharing content recognition technology mainly has the yellow model, the vulgar model and the abusive model. Today's headlines of the vulgar model through deep learning algorithm training, the sample library is very large, image, text analysis at the same time.
This part of the model more attention to recall rate, accuracy or even sacrifice some. Abuse model of the sample library is also more than million, recall rate of up to 95%+, accurate rate 80%+. If users often speak enoughh or inappropriate comments, we have some penalty mechanism.
Pan-Low quality identification involves a lot of situations, such as false news, black draft, the title of the party, the content of low quality and so on, this part of the content of the machine is very difficult to understand, need a lot of feedback information, including other sample information.
At present, the accuracy and recall of low-quality models are not particularly high, but also need to be combined with manual review to increase the threshold value. The final recall has reached 95%, and there is actually a lot of work to do.
Headline AI Lab Hangyuan Li is also building a research project with the University of Michigan to set up a rumor recognition platform.
The above is the principle share of the headline recommendation system, hope to get more suggestions in the future, help us improve the work better.
Turn from: Today's headline official number
The first public disclosure of senior Architects: the principle of recommended algorithms for today's headlines