http://www.zhihu.com/question/31989952
Discretization of continuous features: Under what circumstances will continuous features be discretized to achieve better results?
Q:CTR estimates that CTR estimates are generally used in LR, and that the features are discrete. Why must we use discrete features? What are the benefits of doing this?
A:
In industry, it is very rare to direct continuous values as feature inputs to logistic regression models, but rather to discretization continuous features into a series of 0 and 1 features to the logistic regression model, which has the following advantages:
0. It is easy to increase and decrease the discrete feature, and it is easy to iterate the model quickly. (The increase and decrease of discrete features, the model does not need to adjust, retraining is necessary, compared to Bayesian inference method or tree model method iteration faster)
1, the sparse vector internal product multiplication is fast, the calculation result is convenient to store, easy to expand;
2, the characteristics of the discretization of the abnormal data has a strong robustness: for example, a feature is the age of >30 is 1, or 0. If the feature is not discretized, an abnormal data "age 300 years old" will cause great disturbance to the model; after discretization, age 300 also corresponds only to a weight, if the training data does not appear in the feature "age-300 years", then in the LR model, its weight corresponds to 0, so, even if the test data appears in the characteristics "Age-300 years" does not have an impact on the predicted outcome. Feature discretization process, such as feature A, if used as a continuous feature, in the LR model, a will correspond to a weight w, if discretized, then a expands to feature a-1,a-2,a-3 ..., each feature corresponds to a weight, if no feature A-4 appears in the training sample, Then the trained model has no weight for A-4, and if feature A-4 is present in the test sample, the feature A-4 will not work. Equivalent to invalid. However, if a continuous feature is used, y = W*a,a is a feature in the LR model, and W is a corresponding weight, such as a for age, then the range of A is [0..100], if a test case appears in the test sample, the value of a is 300, obviously a is an outlier, but W*a still has a value, The value is also very large, so outliers can have a very large effect on the final result.
3, the logistic regression belongs to the generalized linear model, the expression ability is limited, the single variable is discretized to n, each variable has the individual weight, is equivalent to introduces the nonlinear to the model, can enhance the model expression ability, enlarges the fitting; in the LR model, the weighting of feature a as a continuous feature is WA. A is a linear feature, because y = Wa*a,y is Wa for the derivative of a, and if discretized, A is discretized by interval to A_1,a_2,a_3. Then y = w_1*a_1+w_2*a_2+w_3*a_3. So y for a function is equivalent to piecewise linear function, Y for a derivative also with a value change, so, equivalent to introduce the nonlinearity.
4, after discretization can be characterized by crossover, adding feature a discretization to M values, feature B is discrete to n values, then there will be m*n after the intersection of variables, further introduce non-linear, enhance the ability to express;
5, the characteristics of discretization, the model will be more stable, such as if the user age discretization, 20-30 as an interval, not because a user age of one year old becomes a completely different person. Of course, in the interval adjacent to the sample will be just the opposite, so how to divide the interval is a knowledge, according to the interval discretization, dividing the interval is very important.
6. After the feature discretization, it simplifies the function of logistic regression model and reduces the risk of model overfitting. (When using a continuous feature, a feature corresponds to a weight, so if the feature weighs heavily, the model is dependent on the feature, and a small change in the feature can cause a big change in the final result, so the model is dangerous, When a new sample is encountered, it is possible to get the wrong classification result by being overly sensitive to this feature, that is, the generalization ability is poor and easy to fit. When using discrete features, a feature becomes multiple and weights become multiple, so the influence of the previous continuous feature on the model is dispersed and weakened, thus reducing the risk of overfitting. )
Li Yu once said: The model is to use discrete features or continuous features, in fact, is a "mass discrete features + simple model" with "small number of continuous features + complex model" trade-offs. It can be discretized with a linear model, or it can be studied with continuous features plus deep learning. It's like tossing a character or tossing a model. Generally speaking, the former is easy, and can be done in parallel with the N-person, have successful experience, the latter is very good at present, can go far to wait and see.
http://www.zhihu.com/question/28641663/answer/41653367
In machine learning, what are the engineering methods of feature selection?
Feature selection is an important problem in feature engineering (another important problem is feature extraction), which is often said: data and features determine the upper limit of machine learning, and models and algorithms only approximate this limit. Therefore, feature engineering, especially feature selection, occupies a very important position in machine learning. Machine learning is done well, by the data and models of the common impact, for example, the data itself is not separable, then SVM and other classification algorithms, and can not be completely correct separation. The dataset itself has intrinsic characteristics, and the characteristics of the dataset itself determine the upper limit of machine learning. So, a machine learning algorithm might work well on DataSet A, but it works poorly on DataSet B, which is normal because the intrinsic characteristics of datasets A and B are different. I used to learn that others use GBDT extract features, others use GBDT extract features, can make the classification effect, but I used the GBDT feature, and did not improve the effect. Because the dataset features are not the same. Therefore, the characteristics of the dataset determine the maximum effect that the algorithm can achieve.
Typically, feature selection refers to a feature set that chooses the best performance for the corresponding model and algorithm, which is commonly used in engineering:
1. Calculate the correlation between each characteristic and the response variable: The method used in engineering has the Pearson coefficient and the mutual information coefficient, the Pearson coefficient can only measure the linear correlation and the mutual information coefficient can measure all kinds of correlations well, but the calculation is relatively complicated. Fortunately, many toolkit contain this tool (such as Sklearn's mine), so that the correlation can be sorted to select features. (In fact, the derivative of the output about the input, if a feature to a large extent affect the output, then this feature will be more important).
2. Build a model of a single feature, sort the features by the accuracy of the model, and choose the features, and remember that there is a paper on JMLR ' 03 on the feature selection method based on decision tree, which is essentially equivalent. When the target feature is selected, it is used to train the final model;
3. Select features by L1 Regular term: The L1 regular method has the characteristic of sparse solution, so it is natural to have feature selection, but it is important to note that the features not chosen by the L1 do not mean unimportant, because two features with high correlation may retain only one, If you want to determine which characteristics of the important should be passed L2 regular method cross test;
4. Training can be a pre-selected model of feature scoring: Randomforest and logistic regression can score the characteristics of the model, and then train the final model by scoring the correlation;
5. After the feature combination to select features: such as the user ID and user characteristics of the most combination to obtain a larger feature set to select features, this practice in the recommendation system and advertising system is more common, this is the so-called billion or even 1 billion-level characteristics of the main source, because the user data is relatively sparse, The combination of features allows for both global and personalized models, and this problem has the opportunity to unfold.
6. Feature selection through deep learning: At present, this method is becoming a means with the popularity of deep learning, especially in the field of computer vision, the reason is the ability of deep learning with automatic learning characteristics, which is also deep learning is called unsupervisedfeature The reason for learning. After selecting the characteristics of a certain nerve layer from the deep learning model, it can be used to train the final target model.
http://www.zhihu.com/question/34271604
Why do feature combinations in the ad LR model?
in the industry, the LR model is very popular, mainly because the LR model is a logarithmic linear model, the implementation is simple, easy to parallel, large-scale expansion is convenient, iterative speed, and the use of the characteristics of a better interpretation of the predicted output between 0 and 1 fit probability model. (The model's explanatory examples, such as a-B's weight is relatively large, a for the user, B for the item, then you can think a is interested in B.) However, the linear model lacks the accurate characterization to the nonlinear relation, the characteristic combination can add the nonlinear expression and enhance the expression ability of the model. In addition, in AD LR, the basic features can be considered for global modeling, the combination of features more granular, is personalized modeling, because in this large-scale discrete LR, the single-to-global modeling will be partial users, modeling and data for each user is not easy to fit the model number explosion, so the basic features + Combination features are both global and personalized. such as eigenvectors, there are user a,b,c, items e,f,g. The basic features A,B.C.E.F.G corresponding weights, corresponding to the bias weight of each object, but if a preference e,b preferred F, then the combination of features A-e,b-f is the user's personality modeling, the combination of features a-e,b-f weight is a pair of E preferences, and b-f preferences.
Continuous feature discretization achieves better results, and the engineering method of feature selection