Written in Front: it is said that next week will be xxxxxxxx, frighten the baby hurriedly find some advertising things to see
Gbdt+lr's model was known before, and Dnn+lr's model was known, but none of them had been tested.
The application of deep learning in the ranking of recommended platform for American group reviewsoriginal 2017-07-28 Pan Hui Group Reviews technical Team
United States as the largest domestic service platform, business types involved in food, live, line, play, Music and other fields, is committed to let everyone eat better, live better, There are hundreds of millions of users and rich user Behavior. With the rapid development of business, the number of users and merchants in the United States has been growing rapidly. In this context, through the optimization of the recommendation algorithm, you can better provide users with interesting content, to help users find more quickly and easily. Our goal is to recommend to users what they are interested in, based on their interests and behavior, and to create a highly accurate, highly enriched and user-pleasing referral System. To achieve this goal, we are constantly trying to introduce new algorithms and new technologies into existing frameworks.
1. Introduction
Deep learning has become the most focused technology in the field of machine learning and artificial intelligence since the 2012 Imagenet Competition. Before the emergence of deep learning, people use sift, hog and other algorithms to extract good distinguishing features, and then combine SVM and other machine learning algorithms for image Recognition. however, the characteristics of this kind of sift algorithm are limited, which leads to the error rate of the best result at that time is more than 26%. The first appearance of convolutional neural Networks (CNN) reduced the error rate from 26% to 15%, and in the same year the Microsoft team published a paper showing that deep learning could reduce the error rate of the Imagenet 2012 data set to 4.94%.
In the following years, deep learning has made remarkable progress in many application areas, such as speech recognition, image recognition, Natural language processing, and so On. In view of the potential of deep learning, the major Internet companies have invested resources to carry out scientific research and Application. Because people realize that in the big data age, more complex and powerful depth models can profoundly reveal the complex and rich information contained in massive amounts of data, and make more accurate predictions about future or unknown events.
As an internet company that has been committed to the forefront of science and technology, we have also carried out some explorations in the field of deep learning, in which we apply deep learning technology to text analysis, Semantic matching, search engine sequencing model, etc. in the field of computer vision, we apply it to Word recognition, image classification , image quality sorting, and so On. This article is the author of the team, in reference to Google in 2016 proposed wide & deep learning thought, based on some characteristics of its own business, in the public comment on the recommendation of the system to make some thinking and get practical experience.
2. Reviews recommended System Introduction
Unlike most of the recommended systems, the scene of the US reviews is due to the diversity of its business, making it difficult to accurately capture the user's point of interest or the User's real-time Intentions. And we recommend the scene will be with the User's interests, location, environment, time and other Changes. The recommendation system mainly faces the following challenges:
diversity of Business forms: in addition to the recommended merchant, we also based on different scenarios, real-time judgment, so as to introduce different forms of business, such as the group, hotels, attractions, overlord Meal.
user consumption scene diversity: users can choose to spend at Home: take-out, to store consumption: group orders, Flash benefits, or travel consumption: booking hotels and so On.
In response to these issues, we have customized a comprehensive set of recommended system frameworks, including multi-selection recall and sequencing strategies based on machine learning, and recommendation engines for offline computing of large volumes of data to high-concurrency online Services. Recommendation system strategy is mainly divided into recall and sequencing two processes, recall is responsible for generating the recommended candidate set, sorting is responsible for the results of multiple algorithm strategies to personalize the Order.
Recall Layer: We make real-time judgments through user behavior, scenarios, etc., Recalling different candidate sets through multiple recall Strategies. The recall candidate set is then Fused. Candidate set fusion and filtering layer has two functions, one is to improve the coverage and accuracy of the recommendation strategy, but also to undertake a certain filtering duties, from the product, operational point of view to develop some artificial rules, filter out the non-eligible item. Here are some of the recall strategies we used to:
model-based Collaborative filtering: use a set of implicit factors to contact users and Products. Each user, each commodity is represented by a vector, the user U to the product I evaluation by calculating the inner product of these two vectors obtained. The key of the algorithm is to estimate the hidden factor vectors of the user and the commodity according to the known User's behavior Data.
item-based Collaborative Filtering: We first use Word2vec for each item to take its hidden space vector, and then use cosine similarity to calculate the user u used every item and unused item I similarities between The. finally, the results of top n are recalled.
Query-based: A policy that is triggered by abstracting the User's intentions based on the real-time information contained in the Query (such as geolocation information, wifi to store, keyword search, navigation search, etc.).
location-based: The location of mobile devices is constantly changing, different geographic locations reflect different user scenarios and can be leveraged in specific businesses. In the recommended candidate set recalls, we also trigger the corresponding strategy based on the User's geographic location, place of work, and place of Residence.
sorting layer: Each type of recall strategy recalls a certain result, these results need to be unified after the Order. Reviews the recommended sorting framework can be broadly divided into three blocks:
Offline Computing Layer: the offline computing layer mainly includes the algorithm set, the algorithm engine, is responsible for the data integration, the feature extraction, the model training, as well as the offline Evaluation.
near-line Real-time data flow: mainly for the implementation of different user flow subscription, behavior prediction, and the use of a variety of data processing tools to clean the original log, processed into formatted data, landed into different types of storage systems, for downstream algorithms and models to Use.
Online Real-time scoring: according to the User's scene, extracts the corresponding characteristics, and using a variety of machine learning algorithms, The results of multi-strategy recall Fusion and scoring Rearrangement.
The specific recommended flowchart is as Follows:
From the overall framework point of view, when the user requests each time, the system will write the data of the current request to the log, using a variety of data processing tools to clean the original log, format, landing to different types of storage systems. During training, we use feature engineering to select the training and test sample set from the processed data, and to train and estimate the offline model. We use a variety of machine learning algorithms and evaluate their performance through offline auc, NDCG, precision and other Indicators. After the offline model has been trained and evaluated, if there is a noticeable increase in the test set, it will be online ab Test. At the same time, we also have a variety of dimensions of the report to the model for data support.
3. Application of deep learning in the review of recommended sorting systems
For the candidate sets produced by the different recall strategies, if it is only based on the historical effect of the algorithm to determine the location of the item is somewhat simple and rough, at the same time, within each algorithm, the order of the different item is simply determined by one or several factors, these methods can only be used for the first step of the primary process , the final sorting results need to be determined by means of a machine learning approach, using the relevant sequencing model, and combining various Factors.
3.1 Introduction to existing sort frames
So far, the recommended sorting system has tried a variety of linear, non-linear, mixed model and other machine learning methods, such as logistic regression, gbdt, GBDT+LR and so On. It is found that compared with the linear model, the traditional non-linear model, such as gbdt, is not necessarily able to improve the CTR prediction in the online ab Test. The linear model, such as logistic regression, can not distinguish the nonlinear scene in real life because of its weak non-linear representation ability, and it will often over-memorize the data in historical Data. Is that the linear model lists some history-clicked orders in front of the memory:
We can see that the system in the very front of the location recommended a number of long-distance merchants, because these merchants have been point by the user, its own click-through rate is high, then it is easy to be recommended by the system Again. But this recommendation does not combine the current scene to recommend some novelty item to the User. To solve this problem, you need to consider more and more complex features, such as combining features to replace simple "distance" Features. How to define and combine features, the process is expensive and relies more on manual experience.
Deep neural networks, through the low-dimensional dense features, can learn the relationship between some item and features that have not previously appeared, and greatly reduce the demand for feature engineering compared to the linear model, thus attracting us to explore the Research.
In the practical application, we fused the linear model component and the deep neural network according to the wide & Deepin learning model proposed by Google in 2016 and combined with the needs and characteristics of our own business, and formed a wide-depth learning framework to realize memory and generalization in a model. In the following chapters, we will discuss how to do sample screening, feature processing, deep learning algorithm implementation, and so On.
3.2 Screening of samples
Data and features are the two most important aspects of machine learning, because they determine the upper limit of the entire model. Reviews recommended because of its own multi-service (including takeaway, merchant, Group purchase, wine brigade, etc.), multi-scene (user to shop, user at home, off-site request, etc.) characteristics, resulting in our sample set compared to other products more diversified. Our goal is to predict the User's click Behavior. There is a positive sample of the click, No click is a negative sample, at the same time, in training for the purchased samples to a certain degree of weighting. furthermore, to prevent overfitting/less fitting, we control the ratio of positive and negative samples to 10%. finally, we have to clean the training samples to remove the noise samples (the eigenvalues are approximate or the same case, respectively, the positive and negative two samples).
At the same time, the recommendation business as the entire app home core module, the demand for novelty and diversity is very high. In the implementation of the review recommendation system, first of all to determine the application scenario data, the United States Group review of the data can be divided into the following categories:
user portrait: gender, residency, price preference, item preference, etc.
Item Portrait: contains a variety of item such as merchant, takeaway, group order, etc. Among the merchant features are: merchant price, merchant's praise number, merchant location and so On. Take-Out features include: The average price of takeout, delivery time, takeout sales, etc. The characteristics of the group include: the number of groups to apply, the Group's single visit to purchase Rate.
scene portrait: the User's current location, time, location near the business circle, user-based contextual scene information, and so On.
3.3 feature processing in deep learning
Another core area of machine learning is feature engineering, including data preprocessing, feature extraction, feature selection, and so On.
feature extraction: The process of constructing new features from Raw Data. The method includes calculation of various simple statistics, principal component analysis, unsupervised clustering, after the construction method is determined, it can be turned into an automated data processing process, but the core of the feature construction process is Manual.
feature Selection: Select a few useful features from a variety of features. Features and redundancy features that are not related to learning goals need to be removed, and if the computational resources are insufficient or the complexity of the model is limited, you also need to choose to discard some unimportant features. Feature selection methods are commonly used in the following ways:
The feature selection cost is big, the characteristic construction costs are high, in the early days of recommending business, we feel not strong about this Aspect. however, with the development of business, the demand for CTR model is more and more high, the great investment of feature engineering has not satisfied our demand, so we want to seek a new Solution.
Deep learning can automatically combine and transform the low-order features of the input to get the characteristics of Higher-order features, and also impel us to explore the deep learning. The advantages of "auto-extracting feature" in deep learning have different performance in different fields. For example, for image processing, pixel points can be used as Low-order feature input, and the High-order features obtained by convolution layer automatically have better results. In natural language processing, some semantics are not from the data, but from the People's prior knowledge, the use of prior knowledge of the characteristics of the structure is very helpful.
therefore, we hope to use deep learning to save the huge investment in feature engineering, and to let the CTR model and each auxiliary model automate the work of feature construction and feature selection, and always align with business objectives. Here are some of the features we use in deep learning:
3.3.1 Combination features
For the characteristics of the processing, we have used the current industry common methods, such as normalization, standardization, discretization and so On. however, It is worth mentioning that we have introduced many combination features into the model Training. Because the combination of different characteristics is very effective, and has good explanatory, such as we will "whether the merchant in the user resident", "the user is resident" and "merchant and User's current distance" combination, and then discretization of the data, through the combination of features, we can well grasp the inherent relationship of discrete features , adding more nonlinear representations to the linear model. The composition feature is defined As:
Normalization of 3.3.2
Normalization is based on the row processing data of the characteristic matrix, which is designed to have a uniform standard when the sample vectors are computed for similarity in point multiplication or other kernel functions, that is to say, "unit vectors". In practical engineering, we have used two normalization methods:
Min-max:
min is the minimum value of this feature, andMax is the maximum value for this Feature.
cumulative distribution Function (CDF): A CDF is also called a cumulative distribution function, and the mathematical meaning is the probability that a random variable is less than or equal to one of its x values. The formula Is:
In our offline experiment, the continuous feature is less than 0.1% higher than the Min-max,cdf's offline AUC after being processed by the CDF. We suspect that because some continuous features do not satisfy the random function that is evenly distributed on (0,1), CDF in this case is not as intuitive as min-max, so we use the Min-max method on Line.
3.3.3 Fast Aggregation
In order for the model to converge faster and give the network a better representation, we set its Super-Liner and Sub-liner for each successive feature of the primitive, i.e., for each feature x, 2 sub-features are derived:
The experimental results show that by introducing 2 sub-features to each successive variable, the performance of the AUC can be improved, but considering the problem of calculation on line, there is no online experiment to add these 2 sub-features.
3.4 Selection of the optimizer (Optimizer)
In deep learning, choosing the right optimizer will not only speed up the entire neural network training process, but also avoid being trapped in the saddle point during training. This article will combine its own usage situation, the use of the optimizer to put forward some of their own understanding.
3.4.1 Stochastic Gradient descent (SGD)
SGD is a common optimization method that calculates the mini-batch gradient for each iteration and then updates the Parameters. The formula Is:
The disadvantage is that there is more severe oscillation for the loss equation, and it is easy to converge to the local minimum value.
3.4.2 Momentum
In order to overcome the more serious problem of SGD oscillation, Momentum introduced the concept of momentum in physics into SGD and replaced the gradient by accumulating momentum. That
Compared to sgd,momentum, it is the equivalent of moving downward from the hillside, when there is no resistance, its momentum will become larger, but if the resistance is encountered, the speed will be smaller. That is, in the training, in the gradient direction unchanged dimension, the training speed is faster, gradient direction changes in the dimension, the update speed is slow, so that the convergence and reduce the Oscillation.
3.4.3 Adagrad
Compared to sgd,adagrad, the equivalent of a constraint on the learning rate, namely:
The advantage of Adagrad is that in the early stages of training, because of the smaller GT , constraints can speed up training. In the later stages, as the GT becomes larger, it causes the denominator to grow and eventually the training ends Prematurely.
3.4.4 Adam
Adam is a combination of momentum and adagrad, which takes into account the use of momentum to speed up the training process, taking into account the constraints on the learning Rate. The learning rate of each parameter is dynamically adjusted by using the first order moment estimation and second moment estimation of the Gradient. The advantage of Adam is that after biased correction, each iterative learning rate has a definite range, which makes the parameters more Stable. The formula Is:
which
Summary
Through practice, Adam combines adagrad's ability to handle sparse gradients and momentum's ability to handle non-stationary targets better than several other optimizer Effects. At the same time, we also notice that many papers refer to Sgd,adagrad as the optimization Function. But compared to other methods, in practice, SGD needs more training time and the disadvantage of being trapped in the saddle point, which restricts its performance on many real Data.
3.5 Selection of loss function
Deep learning also has a number of loss functions to choose from, such as the squared difference function (Mean squared error), The absolute squared difference function (Mean Absolute error), the cross-entropy function (crosses Entropy), and so On. In theory and practice, we find that cross entropy has obvious advantages over the squared difference function which is better than the linear model. The main reason is that while deep learning is updating W and b by reverse passing, the derivative of the activation function sigmoid into the left and right two saturation intervals when most values are taken, resulting in very slow updating of the Parameters. The specific derivation formula is as Follows:
The generic MSE is defined As:
Where y is the output we expect,a is the actual output a=σ (wx+b)of the Neuron. Due to the mechanism of deep learning reverse transfer, The correction formula of the weight value W and offset b is defined As:
Because of the nature of the sigmoid function, σ′ (z) causes saturation when most values in z are Taken.
The formula for cross entropy Is:
If there are multiple samples, the average cross-entropy for the entire sample set Is:
where n represents the sample number andI represents the category Number. If used in a logistic classification, the upper formula can be simplified to:
The Cross-entropy function has a very good trait compared to the square loss function:
As you can see, there is no σ′ this, so the update w and b will not be affected by Saturation. When the error is large, the weight update is fast, when the error is small, the weight of the update is SLOW.
3.6 Wide Depth model frame
At the beginning of the experiment, we only compared the individual 5-layer DNN model with the linear model. Using the Offline/online AUC comparison, we found that the simple DNN model was not significantly improved for Ctr. and the individual DNN model itself has some bottlenecks, for example, when the user itself is a non-active user, because the interaction between itself and item is relatively small, resulting in a very sparse eigenvector, and deep learning model in dealing with this situation may be excessive generalization, Causes the recommendation to be less relevant to the user Itself. therefore, we combine the broad linear model with the deep learning model, and include some combination features to better grasp the common relationship among the item-feature-label. We hope that the wide-linear part of the WIDE-DEPTH model can use the cross-feature to effectively memorize the interaction between sparse features, and in the deep neural network part, through the interaction between the mining features, enhance the generalization ability between the Models. Is our wide-depth learning model Framework:
In the offline phase, we use the theano, tensorflow-based Keras as the model ENGINE. At the time of training, we separately cleaned and weighted the sample Data. In terms of features, we use the Min-max method for normalization of continuous features. In terms of cross-features, we combine business requirements to refine multiple cross-features that are more significant in business Scenarios. In the model we use Adam as the optimizer, and cross entropy as the loss Function. The difference between the wide & deep learning papers during training is that we input composite features as input layers into the corresponding deep components and wide components Respectively. All input data is then sent to 3 Relu layers in the deep section, which is graded at the end through the sigmoid layer. Our wide&deep model is trained in more than 70 million training data and is used for offline model estimation with over 30 million of the test Data. Our batch-size is set to 50000,epoch to 20.
4. Deep Learning Offline/online effects
In the experimental stage, we make a series of contrast between deep learning, wide depth learning and logistic regression, and put a good wide depth model on-line with the original base model for AB Experiment. From the results, the Wide-depth learning model has a better effect on LINE/LINE. The concrete conclusions are as Follows:
With the increase of the hidden layer width, the effect of the offline training will be improved gradually. however, considering the performance of real-time online prediction, we are currently using 256->128->64 frame Structure.
is a comparison of the linear experimental effects of the wide depth model and the base model with the combined features:
In terms of the online effect, the Wide-depth learning model solves the problem of the long-distance recall of a history-clicked group Order. At the same time, the Wide-depth model recommends some novelty item based on the current scenario.
5. Summary
Sorting is a very classic machine learning problem, and implementing the memory and generalization of the model is a challenge in the Recommender system. Memory can be defined as the reproduction of historical data in the recommendation, and generalization is based on the transitivity of data correlation, exploring item that has never or rarely occurred in the Past. The wide-linear part of the WIDE-DEPTH model can use the cross-feature to effectively memorize the interaction between the sparse features, while the deep neural network can enhance the generalization ability between the models by excavating the interaction between the Features. The results of on-line experiment show that the wide-depth model is more obvious to Ctr. At the same time, we are also trying to evolve a series of models:
Incorporate RNN into the existing Framework. The existing deep & wide model simply blends the DNN with the linear model and does not model the changes in the time Series. The chronological order of the samples is equally important for the recommended sort, for example, when a user browses a few off-site hotels and attractions by time, the user requests the other city again, and the food around the attraction should be launched.
The introduction of reinforcement learning, so that the model can be based on the User's scene, dynamically recommend Content.
The fusion of deep learning and logistic regression makes it possible to combine the advantages of both, and also lays a solid foundation for the design and optimization of the further CTR prediction Model.
6. References
H. Cheng, L. Koc, J. Harmsen etc, Wide & deep Learning for Recommender Systems, Published by ACM article, https:/ /static.googleusercontent.com/media/research.google.com/zh-cn//pubs/archive/45530.pdf
P. Covington, J. Adams, E. sargin, deep neural Networks for YouTube recommendations, Recsys ' Proceedings of the 10th AC M Conference on Recommender systems,https://arxiv.org/pdf/1606.07792.pdf
H. wang, N. wang, D. Yeung, Collaborative deep learning for Recommender Systems.
7. Introduction of the author
Pan hui, senior Algorithmic Engineer. He joined Microsoft in 2015 after graduating from Ph. d., mainly engaged in natural language processing. Joining us in December 2016, we are now responsible for the sequencing business of public reviews and are committed to using big data and machine learning technologies to solve business problems and improve the user Experience.
Search recommended Technology Center: responsible for the review of the construction of the basic search framework and the general recommendation platform; optimize the End-to-end user experience of search lists, improve the accuracy and novelty of the recommended booths through big data and AI technologies, and build an intelligent technology platform that supports the intelligent needs of the reviews side Business. Our mission is to use the search recommendation technology to effectively connect people, businesses and services to help users to accurately and efficiently discover information content, optimize user experience, Expand user needs, and promote business Development.
Group Reviews search recommended teams recruit all kinds of machine learning, data mining talent, base in Shanghai. We are committed to the analysis of big data through cutting-edge technologies such as machine learning, data mining, real-time data stream analysis, mining all kinds of user features, providing users with the best personalized search/recommendation experience, You will model users from multiple angles, and continuously optimize the machine learning model to optimize the user experience of review Search. Welcome all friends to recommend or introduce to [email protected].
Popular Blogs
Algorithm
Several problems in the optimization of model in machine learning the practice of the application of the user portrait of the American group DSP advertising strategy
American Group Reviews
Technical Team
Http://tech.meituan.com
Long press QR code follow us
To view the original URL of the article, click "read original".
More technology Blog: American group reviews technology Blog.
PS: the text of the green nouns are reference links, you can click on the Query.
Read the original
Sweep
Follow the public number
The application of deep learning in the ranking of recommended platform for American group Review--study notes