I have read Xavier Amatriain "lessons learned from building ML systems" and "more lessons learned from building Real-life M" Achine Learning System-quora "feel quite deep, and quite can cause resonance." Therefore, today's small part of the combination from the great God get to the essence of the pit with his teammates and we have to share the problems encountered in our work, as well as some solutions. I hope we can avoid the pits that we once trod.
Small part of the work before doing is recommended by the electrical business, is now doing recruitment recommendations, including candidates and recruiters recommended. Although the two are very different in the business, but same, nothing more than data, characteristics, models and evaluation of the effectiveness of these four aspects. Today Xiao-Bian also from these four aspects and you share your experience and experiences. The data section will share with you the commonly used data types, the negative effects of abnormal data on model training and how to avoid the effects of negative and positive samples in model training and their solutions; The feature section will mainly introduce the principle of feature selection and some basic feature processing methods. Models mainly want to share with you is the trend of multi-model integration, the final evaluation of the impact of the small series of commonly used three offline metrics and model Debug. Some of these examples may be used in the electrical business scene, some will use the recruitment recommendations.
Data
Dominant data & implicit data
Data may have many different distributions, "dominant data" and "stealth data" are one of them. The so-called "dominant data" is to intuitively reflect the user's preferences of data, such as film scoring, point praise, step, and "hidden data" is those who are not very intuitive response to user preferences, such as the user's browsing, collection, plus shopping cart. And sometimes the user directly expressed preferences may not be his real preferences, in other words, "words can be deceptive, but long-term behavior deceive people" Table 1 lists the differences. It is obvious that we use the most implicit data in practical applications.
Table 1. Comparison between dominant and implicit data
Junk in Junk out
Let's take a look at the data first. We are doing big data processing, so the data is our fundamental. Quote a statistical industry very popular words "Junk in, Junk out", if your input data is Junk, then you can expect the output can only be Junk. Therefore, we should always consider the effect of anomaly data on the computational results. Below I raise two chestnuts:
The "Pack and buy" module, also known as "partner Portfolio", "best partner", is shown in figure one. As the name suggests is suitable and the current commodity buys together other goods. The recommended use of this module is the association rule algorithm. "Packaged purchase" is a mathematical expression of the probability, where Y represents the current product, X is suitable for the purchase of other goods, that is, if the user wants to buy product Y, he has the probability of buying product X. Sorting the set of candidate items from high to low topn is the result of this module displayed in front of the user. The data used in this module is the user's order data. And we all know that there is a widespread "brush list" and "store goods" behavior, these "brush list" and "store goods" behavior is our order data in the "anomaly", if not dealt with will inevitably affect the recommended results.
Figure 1. Book "Partner Combination" module
Let's talk about the effect of the brush list first. Usually understand the "brush list" is the brush sales, at best, will affect the hot list recommended. But we found in the analysis of the data there is a more advanced way to brush, that is, "brush with", is repeated with the best-selling book to buy together, so do not have to be like a simple brush sales need to brush a lot of single, there are many opportunities to appear in front of users. They will appear in the bestseller's "Pack and buy" recommendation module. For example, there are 1000 orders for bestseller A, 100 of which contain cheat book b,50 contains C book, 35 contains D book, then P (b| A) The probability is necessarily the largest, so will be placed in the first position of recommendation. And that's definitely not the result we expected.
After discussing the "brush list", we will discuss the effect of "hoarding goods". The electric Dealer's various creation festival has created batch after batch of Chop hand party, these chopping hands often advance to buy goods in advance into the shopping cart, may have mother and child, may have men's, may have children's books, or literature, social science, and so on, and so on the day of the list. If you use these orders to calculate the association rules, it will cause the original is not "related" goods associated, which will also affect our recommendation results
Of course, the examples here are only two types of effects, and in different business scenarios there will be different impact model calculations. For example, the CV recommendation will be as shown in Figure 2, "bulk View resumes", the position recommended in the "Batch application" and so on. These batch processing data will have an impact on model learning.
So how to deal with this data caused by the results of the study is unsatisfactory. We often have two methods: one is directly from the data itself, to find ways to remove the results have adverse effects of the "anomaly." For example, to cheat orders, to get rid of the double 11 and other big orders; Another method is to calculate the "anomaly" data to reduce the weight of the impact of the results as much as possible attenuation. For example, when calculating the association rules, calculate the number of common occurrences of 22 commodities, according to the quantity of goods included in the order to properly decay.
Figure 2: Resume Details page "Your peers also view the following CV" module
The sample is not balanced
The problem of uneven positive and negative samples should not be unfamiliar to everyone. The product recommended by the electric dealer is always much smaller than the display; The position of the user in the recruitment recommendation is also a small part of many jobs, as are most other Internet applications. This results in a sample of the actual sample will be far less than the number of negative samples.
Ideally, we expect positive and negative sample ratios to be 1:1 in model training. Why do you say that? As an extreme example, let's say that the positive and negative sample ratio is 1:99 if the goal of the model is to minimize miscalculation. If you are the "model", what will you do after you have a new sample? Are you going to make a negative sample without your brain? Because even if all the negative samples, you can guarantee the accuracy rate of 99%. And when the positive and negative sample than I am 1:1, the model can not "opportunistic", can only "calm down" to learn which features can inspire clicks, which features cannot hook up the user's desire to click.
The reality PK ideal, that actual work how we should solve the problem that the sample unbalance causes. There are two main methods, one way is to start from the cost function. Here I use the cost function of SVM as an example, we can take the positive and negative samples to different degrees of punishment, to the positive sample penalty points, and negative samples are increased penalties, such as type 1. This will allow the model to "know" we are "biased" positive samples, which will affect the result of the weight w.
Type 1: From the cost function to solve the sample imbalance
Considering the actual work we will use a wide range of open source tools, so the adjustment cost function is relatively poor operability, so there is another way to apply more directly from the sample, manually adjust the proportion of positive and negative samples, mainly with drop sampling and over sampling two ways. The drop-off sample is for the category with more samples, and the sampling is the category with fewer samples. Because the majority of the cases are negative samples are far larger than the positive sample, so I will use the negative sample drop sampling and positive samples of the past sampling to introduce. However, it should be noted that the sample has been manually adjusted, not without cost. Here is a detailed analysis of the possible costs of the two sampling methods.
Negative sample drop sampling, as the name implies is to sample negative samples. However, it should be noted that the drop sampling also needs to have degrees, if you want to drop sampling to positive and negative sample ratio of 1:1 is a bit too exaggerated. Drop sampling can be a random sampling method, can also according to business characteristics, targeted according to certain conditions to do drop sampling. The loss of a negative sample after dropping the sample may make the final model and true result a certain deviation. As shown in Figure 3. So we need to weigh the deviations caused by the sampling from the lack of training with the original sample, and choose the one that is more advantageous to the business.
Figure 3. Influence of sampling on the result
After introducing the drop sampling of negative samples, let's take a look at the method of sample sampling and the possible result. Sampling can also be called "interpolation", that is, on the basis of existing positive samples based on a certain principle to insert some positive samples. The easiest way to do this is to replicate parts of the positive sample multiple times, but it's not an unjustified copy, it's usually a combination of business to replicate. For example, if a position is clicked on a recruitment recommendation, and the user has applied for it, the sample can be strengthened appropriately. This may cause problems as shown in Figure 4. Because some positive samples are strengthened, the model will go too far to fit those positive samples, leading to the problem of fitting.
Figure 4. Effect of simple replication of positive samples on training results
Another relatively clever method is to interpolate a number of "pseudo positive samples", pure digital features, you can consider the median interpolation, the nearest neighbor point interpolation and other methods. The situation I encountered is usually based on the business of interpolation, for example, and positive samples are very high similarity of those negative samples forced into a positive sample. This method also produces a problem as shown in Figure 5. Although it will deviate from the ideal model, but compared to the previous simple copy of the sampling method is still much better, the problem of fitting will be relatively relieved.
Figure 5. Effect of increasing pseudo positive sample on training results
To sum up, most of the practical samples are uneven and may be particularly severe. Most of the training models on this sample do not give us the results we expected. So many times we need to sample samples, but this will inevitably lead to deviations, after all, after sampling the sample distribution has changed, and the real situation on the line is inconsistent. But as long as the benefits of sampling are much greater than the resulting error, we can still sample samples.
Characteristics
Feature is a very important part in model training, and feature selection is generally related to business high, and the quality of feature selection directly influences the accuracy of model prediction. So what are the principles when selecting features? Xavier Amatriain, in his share, mentions that good features should have the following four qualities:
Reusable, good features should not be used only once
can be transformed, flexibility is better
It can be explained that the chosen feature is meaningful to the business, otherwise it's just noise.
Reliable, easy to monitor and debug
In addition, it is also necessary to note that the feature group selection should be as independent as possible, even if not independent, but also low correlation.
There are two kinds of common features, which are digital feature and text feature. The characteristic of digital will involve the basic operation of outliers, discretization and normalization, and the text feature will involve the segmentation and extraction of the keyword. Model training can be directly used in the original features, but in many cases we will be based on the actual application of the characteristics of the transformation, we also call "feature engineering." For example, biological information applications (small part of the book when doing bioinformatics research), the characteristic of a sample is usually a gene, and the order of magnitude of the gene is on the day, so it is often encountered in the field of bioinformatics, "Curse of Dimension", so in this kind of application will generally do dimensionality reduction processing, On the one hand, to achieve the effect of dimensionality reduction, on the other hand is to extract more important to the model, better expression, relatively independent of a better feature group, more commonly used such as PCA, LDA and so on. Or, in some scenarios, there are too few primitive features, and lead to linear irreducible, at this time we may need to do some work to improve the dimension, so that it can be linearly divided in the high dimension, for example, the kernel function in SVM, it successfully transforms the low dimensional linear irreducible samples into high latitude linear can be divided. Or in other applications, the original characteristics of the ability to express relatively weak, this time also requires features to improve the ability to express features, such as depth learning in the image of the application of the pixel preprocessing to get the edge, and then abstract as an object component, and finally further abstract into the object model. In practical application, we can make specific choice according to the specific business.
In the feature processing there is also an important content is the appropriate human intervention. The effect of proper manual intervention on the results of model training it's a lever. Model does not understand the business, really understand the business is people. What the model can do is to learn from the cost function and sample, and find the optimal fit of the current sample. Therefore, machine learning workers should be appropriate to the needs of the characteristics of some human intervention and "guidance", such as the hard rule to discard some noise characteristics. How to judge the noise, it depends on the understanding of the business. Small make up before do the electric dealer recommend, have done such a thing. At that time recommended the effect of the bottleneck, so everyone a Center on the title of goods, business classification information, such as the result of Word segmentation, remove low-frequency words and too high frequency words, the remainder of the word manually filtered again, eyes are blind have wood. But after filtering, the characteristics not only increased, but also better quality, the effect of the model improved significantly.
Of course someone will say, "The features are not afraid Ah, throw them to the model, and then let the model to train to find good features", the idea that too young too naïve. Model training is just a tool, it is not Aladdin's lamp, can give you all the help, it is not a cow, you give it grass, it gives you milk. You need to give the model a high quality input, it can return you a perfect result.
Model
The model is based on training samples, objective functions and evaluation indicators of the three elements of learning. The effect of the model directly depends on the design of the three. Different target functions create different machine learning algorithms.
When it comes to models, you have to mention the problem of fitting and not fitting. The so-called too fit is too much to fit the training samples and lead to too weak generalization ability, and so on the face of new data will produce a large deviation, and the lack of fit is the model expression ability is too weak, there is no good "description" out of the sample.
Figure 6. Less fit, perfect fit, sample fitting
Therefore, in order to ensure that the model can perfectly fit the training samples, on the one hand also does not lose the generalization ability, we all need to do cross-validation, that is part of the sample for training, the other part for testing. In addition, we generally need to pass regularization to prevent the fitting, commonly used L1 regular, also known as Lasso; L2 Ridge. In the case of least squares, the cost function with L1 and L2 regularization is shown in type 2:
Type 2. Example of regular cost function with L1 and L2
In other words, we limit the model space to a norm ball in W. In order to facilitate understanding, the regularization of this Part I directly quote ref[3] in the interpretation, I personally think the explanation is very good. Assuming we only consider two-dimensional conditions, we draw the contour of the objective function on the plane (W1, W2), and the constraint condition becomes a norm ball with a radius of C on the plane. The first place where the contour line intersects with norm Ball is the optimal solution. As you can see, the difference between L1-ball and L2-ball is that L1 has a "corner" at the intersection of each axis, and the geodesic of the objective function, unless it is positioned very well, will intersect at the corner most of the time. Note that the position of the corners will be sparse, such as the intersection point in the diagram there is w1=0, and higher dimensions (imagine what the three-dimensional l1-ball is like.) In addition to the corner point, there are many edges of the contour is also a large probability of the first intersection of the place, and will produce sparsity. L2-ball, by contrast, has no such nature, because there is no angle, so the probability of the first intersection appearing in a sparse position becomes very small. This is intuitive to explain why l1-regularization can produce sparsity, and l2-regularization is not the reason.
Figure 6. L1 Regular and L2 regular illustrations
So, one sentence concludes: L1 tends to produce a small number of features, while the other features are 0, and L2 chooses more features that are close to 0. Lasso is very useful in feature selection, and ridge is just a rule.
About the model, but also need to say is the fusion of multiple models. "Everything is a ensemble", "Heads top Zhuge Liang", integrated the advantages of different algorithms, to avoid weaknesses, integration of a cow-breaking super model. Many internet companies in a single model after the bottleneck, most will adopt model fusion, more popular is GBDT+LR. Fundamentally, the fusion of multiple models is essentially the output of one model as the input of another model, and the first model acts as the role of feature conversion.
Effect Assessment
It is generally necessary to evaluate the model after training the model. In addition to off-line metrics, a debug tool is needed to simulate the online environment and visualize the results of the model.
The off-line evaluation indexes commonly used in small series are AUC, calibrated (calibration), normalized entropy (normalized Entropy, NE). Table 2 shows the specific calculations for the three evaluation indicators. AUC everyone is very familiar with, this side will not say more. Calibration calculates the actual click and projection ratio, which is closer to 1, indicating the more accurate the forecast. Calibration is actually a supplement to AUC, because AUC cares only about the sort, not about the specific predictive value. In fact, the molecular part of NE is the cross entropy and the cost function of LR. (Y is the sample label, take 1 or-1; PI is the click Prediction probability of the sample I); The denominator part is the information entropy of the original sample (P is the probability of positive sample, or exactly the frequency), i.e. the uncertainty of the original sample. So the NE reaction is with the help of the model, the uncertainty of the sample remaining and the uncertainty of the sample without the model is the ratio. So, 1-ne is actually the relative information gain, or the uncertainty of the sample that the model helps us eliminate. The more popular point is that when we do not have the model to help, sample positive and negative uncertainty will be large, we are not very easy to determine the sample positive or negative, but with the model help, we will get a predicted click rate, with this help we can more easily judge the positive and negative, this time the uncertainty dropped. In other words, the model is bringing us more information. This is also the main purpose of model training. So why use three off-line test metrics. Small knitting at the beginning is only with AUC, but later we found AUC even if the results are good, online may not be satisfactory after. Therefore, more than two evaluation indicators, multiple indicators together to evaluate, more reliable. To a certain extent, can guarantee the effect after the on-line.
Table 2. Calculation method of three kinds of evaluation indexes
In addition to the index evaluation, we also need to simulate the online environment, artificial observation model effects. Take a look at whether the recommended position is better than what is available in the new model, what are the features of the first resume, what is the weight of each feature, what features make it the first, and whether the weights are reasonable; how many training samples contain these features Whether the sample is too small to cause weight unreasonable and so on. We need a comprehensive and clear understanding of the results of the model, so that we can make some adjustments to the model training. Can not blindly trust the results of model training.
To sum up: In practical application, the hidden data is used more, the data processing should pay special attention to the cleaning of the exception, the proper sample sampling (over sampling or descending sampling) can adjust the positive and negative sample ratio, which is advantageous to the algorithm to train the real meaningful model; When a single model performance meets the bottleneck, multiple model Multi-index evaluation allows us to have more confidence in the model, offline simulation online situation is necessary.
These are the small series and their teammates have encountered some of the problems, as well as the corresponding solutions. Knowledge is limited, there are some wrong places to welcome everyone to point out and discuss together. Small series Contact:
Risingsun123@163.com
Reference
[1]10 lessons learned from building ML System–netflix, Xavier amatriain
[2] lessonslearned from building real-life Machine Learning System–quora, Xavieramatriain
[3] Http://blog.csdn.net/zouxy09
[4] Practical Lessonsfrom predicting clicks on Ads at Facebook-facebook
Thank you for reading ~