Before introducing the microblogging recommendation algorithm, let's talk about recommendation systems and recommended algorithms. There are some questions: what scenarios does the recommendation system apply to? What are the problems and what value are they used to solve? How is the effect measured?
The recommendation system was born very early, but was really valued by everyone, originated from the "Facebook" as the representative of the rise of social networks and "Taobao" as the representative of the prosperity of the electric business, "choice" of the era has come, information and items of great wealth, so that users such as the vast universe of small points, at a loss. The recommendation system ushered in an explosion, becoming closer to the user:
quickly updated information, so that users need to use the wisdom of the community to understand the current hot spots. Information is extremely inflated, resulting in high personalized information acquisition costs, filtering access to useful information is inefficient. In many cases, the user's personalized needs are difficult to articulate, such as "I need to find a restaurant near the price that matches my taste tonight".
There are still a lot of application scenarios for the recommendation system, and the main problem is to find the right item (connection and sort) for the user and find a reasonable reason to explain the recommended results. And the solution of the problem is the value of the system, that is, to establish the connection, promote the flow and spread, accelerate the fittest.
Recommendation algorithm is the method and means to realize the target of recommendation system. The algorithm and the product combine, carry on the efficient and stable structure, can play its maximum effect.
Next we talk about microblogging recommendations, micro-blog's own product design, so that even without the recommendation system, will still form a large user relationship network, to achieve rapid dissemination of information; and to measure the value of a thing, a simple way is to compare and look at the difference between keeping it and removing it. Weibo needs a healthy user relationship network to protect the quality of the user feed stream, and the need for high quality information flow quickly, through the dissemination of low quality information. The role of Weibo recommendation is to speed up the process and control the flow of information in specific situations, so the role of Weibo is to be an accelerator and controller.
Finally back to the micro-blog recommended algorithm, the above rip so much, just to allow everyone to the microblogging recommendation algorithm has a better understanding. Our job is to use a variety of data tools to solve the goals and problems that need to be solved, and to sample a series of mathematical problems.
The next step is to comb through the methods and techniques we use, and then introduce them.
Foundation and Association algorithm
The main role of this algorithm is to recommend mining the necessary basic resources, solve the general technical problems when recommending, and complete the necessary data analysis to provide guidance for the recommended business.
The algorithms and techniques commonly used in this section are as follows:
Word segmentation technique and core words extraction
Is the basis of the content recommendation of Weibo, which is used to transform the content of micro-blog into structured vector, including word segmentation, Word information tagging, content core word/entity word extraction, semantic dependency analysis, etc.
Classification and Spam
An analysis of recommended candidates for microblogging content, including micro-blogging content classification and marketing advertising/pornography micro-blog recognition;
The content classification is realized by the decision tree classification model, which is divided into 3 classification systems, 148 categories, and the recognition of marketing advertising/pornography microblogging, using a hybrid model of Bayesian and maximum entropy.
Clustering Technology
It is mainly used for hot topic mining and related resources for content recommendation. The WVT algorithm (Word vector topic), which belongs to the research and development of micro-blog, is designed according to the characteristics and propagation rules of Weibo.
communication model and user influence analysis
To carry out micro-blogging propagation model research and user network influence analysis (including depth influence, breadth influence and domain influence).
main recommended algorithm 1. Graph-based recommendation Algorithm
Weibo has the characteristics of user contribution content, social transmission, and the explosion of information dissemination. Known as the graph-based recommendation algorithm, rather than the industry's common memory-based algorithm, the main reason is:
Our recommended algorithmic design is based on social networking, the core point is to start from the social network, into the information dissemination model, comprehensive use of various types of data, to provide users with the best recommended results; For example, in many cases, we are just the key link in information dissemination, add the necessary recommended regulation, change the information dissemination channel, The subsequent transmission naturally spreads along the original network. The Feed stream recommendation (which we call the trend) is our most important product, and the result must include a user relationship.
From graph's macroscopic point of view, our goal is to establish a higher value user relationship network, to promote the rapid dissemination of high-quality information, improve feed flow quality, the important work is the key node mining, key node-oriented content recommendation, user recommendation.
This part of the algorithm to do the appropriate comb, such as the following table:
The difficulty here is how to quantify and choose graph's "edge", which is based on the comprehensive scoring calculation of multiple "edges" and "nodes", and the fusion of the results of network mining analysis.
This part of the algorithm research and development, the output of the following data-attached products:
2. Content-based recommendation Algorithm
Content-based is the most commonly used and basic recommendation algorithm in Weibo recommendation, and its main technical link is the content structure analysis and correlation calculation of candidate sets.
Body page Related recommendation is the most widely used place of content-based, take it as an example, briefly say
Many of the points in the content analysis have been described above, focusing on 2 places:
content quality analysis, mainly using micro-blog exposure gains + content information/readability of the method to comprehensively calculate. Micro-Blog Exposure is the use of user group behavior, the content information computation is comparatively simple, that is, the information iteration of the IDF of the microblogging keyword; for the measurement of content readability, we do a small classification model, respectively, with better readability of the news corpus and less readable colloquial corpus as training samples, The probability of good readability of the new microblog is calculated by extracting various words and collocation information. Word extension, the effect of content-based depends on the depth of the content analysis. Micro-blog content is relatively short, can extract the key information is relatively small, do related operations easily due to sparse data caused by the recommended recall and accuracy of the weighing; we introduced Word2vec technology to optimize the effect of word expansion, and then on this basis to carry out word clustering work, The recommended recall rate and the accuracy rate of synchronization upgrade.
The technical point of correlation calculation is the quantization and distance metric of vectors, we usually use the two methods of "TF weight quantization + cosine distance" or "topic probability + kld distance".
3. Model-based recommendation Algorithm
As China's largest social media product, Weibo has a huge number of users and information resources; This poses 2 challenges to the recommendation:
SOURCE Fusion and sequencing
The candidate is immensely rich, means that we have more options, so we recommend that the results are produced in two layers: selection of a variety of referral algorithms for primary and source fusion sequencing, in order to get more objective and accurate results, we need to introduce machine learning models to learn the rules behind the user community behavior.
Content dynamic classification and semantic correlation
The content production mode of Micro Bo UGC, and the characteristic that the information spreads and renews quickly, means that the method of training static classification model is already obsolete, we need a good clustering model to aggregate the recent total information into class, then establish semantic correlation and complete recommendation.
Model-based algorithm is to solve the above problems, the following is our two most important machine learning work:
3.1 ctr/rpm (per thousand recommended relationship rate) estimate model, the basic algorithm used is logistic regression, below is our CTR estimate model overall architecture diagram:
This part of the work includes sample selection, data cleansing, feature extraction and selection, model training, online estimation and sequencing. It is worth mentioning that the model training before the data cleaning and noise elimination is very important, data quality is the upper bound of the algorithm effect, we have been in this place before the loss.
Logisitic regression is a 2 classified probabilistic model
The goal of optimization is to maximize "the multiplicative value of the correct classification probability of the sample"; We use the Vowpal_wabbit machine learning platform developed by Yahoo to complete the optimization process of the model eigenvalue solution.
3.2 LFM (latent Factor Model): LDA, Matrix decomposition (svd++, SVD Feature)
LDA is the focus of the 2014 project, now has a better output, but also in the recommended online products have been applied; Lda itself is a very beautiful and rigorous mathematical model, and here is an example of our LDA topic, for reference only.
As for Matrix decomposition, 2013 when the corresponding attempt, the effect is not particularly ideal, did not continue to invest.
The semantic model is the single model with the highest recommended precision, and the difficulty is that the computational efficiency becomes the bottleneck when the data is large. We have done some work in this place, and some students will introduce this one.
Hybrid Technology
Heads the top of Zhuge Liang, each method has its limitations, the different algorithms to each other, play a value, is an extremely effective way. Micro-blog recommended algorithm mainly uses the following hybrid technology:
Time Series Mixing:
That is, in the different time period of the recommendation process, adopt different recommendation algorithm; Take the recommendation of the main page as an example, in the early stage of the body page exposure, using content-based + CTR Prediction method to generate the recommended results, the amount of credible user clicks to be generated after the use of user-based The collaborative filtering method gets the recommended results, as shown in the following illustration:
In this way, the use of content-based is very good to solve the problem of cold start, and give full play to the role of user-based CF to achieve 1+1>2 effect.
Layered model Blending:
In many cases, a model can not be very good to get the desired results, and layered combination will often achieve better results, layered model mix is "the upper layer of the model output as the underlying model of the eigenvalues, to comprehensive training model to complete the recommendation task." For example, when we do the CTR on the right side of the home page, the hierarchical logistic regression model is used to solve the problem of the difference between natural loss of feature and sample size and the effect deviation caused by exposure position.
Waterfall Mix:
This kind of hybrid technology is very simple, that is, in the case of very rich recommendation candidates, by using the method of layer-by-step filtering, we usually put the algorithm of fast and low discrimination into the front, complete the selection of a large number of candidate sets, and then put the algorithm with slow operation and high distinguishing degree into the back, and then compute the small collection of the remainder. This kind of mix is used extensively in the microblog recommendation, we use various lightweight algorithms to complete the candidate set of rough selection, and then use the CTR estimate to do refinement sort.
Cross Mixing:
Each kind of recommendation algorithm neutron technology, may in the other recommendation algorithm synthesis use, for instance content-based in the correlation computation accumulates the distance computation method, may apply very well in the collaborative filtering quantization computation. As a practical example, we will study the vector computing method accumulated in LDA successfully applied to the user recommendation.
Online and offline
The characteristics of Weibo data (massive, diverse, static and dynamic Data mixed together), decided that most of the recommended product results need to use online and offline calculation to complete. From the point of view of system and algorithm design, this is a "heavy" and "light" problem, compute decomposition and combination is the key, we need to put the time insensitive heavy computing on the offline side, and the time sensitive light quick calculation on the online end. A few of our common ways are shown below:
Online requires a simple and reliable algorithm to get results quickly;
Semi-Finished products have the following 3 form
1 The off-line part of the calculation process disassembly, such as the user similarity in user-based CF, online after reading through the database to complete user-based recommendation.
2 The high quality candidate set of off-line mining, such as the content candidate set of the main body page, the online through the index obtains the data, then through the correlation and the CTR estimate sort produces the recommendation result.
3 has a higher similarity of the recommended result set, such as offline compute good fans similar high users, online user behavior to make real-time feedback, real-time supplemental recommendation and its just focus on users similar to users.
static recommendation results, refers to those with little time associated with the recommended item, such as our user recommended 95% results from offline computing. Machine learning model, this is a computational process on the timing of the dismantling; offline completed the model training, online call model to complete the item ordering, of course, can also be online-learning or real-time feature value to complete the model real-time update. At the same time, model on-line calculation, the need to pay attention to the missing feature value of the completion, to ensure the consistency of offline and online environment.
In addition, we also have direct online calculation of the recommended results, such as the topic recommended on the right, because the user on the topic of the difference is very small, it is basically a list of demand, but the popular micro Bo can also have sophisticated design, we adopted a dynamic exposure model, Through the last period of time (click Earnings-exposure costs) to control the next period of the item exposure probability, achieved very good results, CTR and diversion volume of more than 3 times times the upgrade.
Different types of recommendation results, supplemented by different recommendation reasons, require a variety of presentation attempts and offline log analysis on the front end.
Effect Evaluation
The method of measuring the effect of the algorithm determines the direction of everyone's efforts, and for different types of recommendations, it is best to use a different standard system to measure work results according to the positioning and objectives of the product. The actual effect of the evaluation is divided into 3 levels: User satisfaction, product layer indicators (such as CTR), algorithm layer indicators, our effectiveness evaluation will be divided into artificial evaluation, online A/b test, off-line algorithm evaluation of 3.
The formulation of product indicators should be based on the objectives of product expectations, reflecting customer satisfaction.
For off-line evaluation of the algorithm, the key is to find a set of reasonable algorithm evaluation index to fit the product layer index, because the algorithm off-line evaluation is always in front of the line, the better the corresponding to do, the optimization of the results of the algorithm can be better converted into online product indicators.
The diagram below is an architectural diagram of our algorithm off-line effect evaluation
Commonly used off-line evaluation indicators are: RMSE, recall, AUC, user diversity, user diversity, novelty and so on. For different products have different combinations of indicators to measure, such as user recommendations in the "User diversity" is very important, but the hot topic can allow users to have a greater degree of overlap between results.