Feature discretization and feature selection __ machine learning

Source: Internet
Author: User

Discretization of continuous features: Under what circumstances a continuous feature can be discretized to achieve better results.

Q:CTR estimates, it is found that CTR estimates are generally used with LR, and the characteristics are discrete. Why must we use discrete features? The advantage of doing so is where.

A:

In industry, the continuous value is rarely used as the characteristic input of the logistic regression model, but the continuous feature is discretized into a series of 0 and 1 features to the logistic regression model, which has the following advantages:

0, the increase and reduction of discrete features is easy, easy to model the rapid iteration. (Discrete features increase and decrease, the model does not need to adjust, retraining is necessary, compared to Bayesian inference method or tree model method iterative faster)

1, sparse vector internal product multiplication operation speed, the calculation result convenient storage, easy to expand;

2, the characteristics of the discretization of abnormal data has strong robustness: for example, a feature is the age >30 is 1, otherwise 0. If the feature is not discretized, an abnormal data "age 300 years" will cause a great disturbance to the model; After discretization age 300 years old also only corresponds to a weight, if the training data does not appear the characteristic "age-300 years old", then in the LR model, its weight corresponds to 0, therefore, even if the test data appears the characteristic "Age-300 years" will not have an impact on the outcome of the forecast. The process of discretization of features, such as feature a, if used as a continuous feature, in the LR model, a will correspond to a weight of W, if discretized, then a is extended to feature a-1,a-2,a-3 ..., each feature corresponds to a weight, if the training sample does not appear the feature A-4, Then the training model has no weights for A-4, and if feature A-4 is present in the test sample, the feature A-4 will not work. Equivalent to invalid. However, if you use continuous features, in the LR model, y = W*a,a is a feature, W is a corresponding weight, such as a is the age, then a is the range of [0..100], if a test sample, a test case, a is a value of 300, obviously a is an anomaly, but w*a still have a value, And the value is very large, so the exception value will have a very large impact on the final result.

3, the logical regression belongs to the generalized linear model, the ability to express is limited; When a single variable is discretized into N, each variable has a separate weight, which is equivalent to introducing nonlinearity into the model, which can enhance the expression ability of the model and enlarge the fitting. In the LR model, the weight of feature a as a continuous feature is WA. A is a linear feature, because y = Wa*a,y is Wa for the derivative of a, and if it is discretized, A is discretized into a_1,a_2,a_3 by interval. So y = w_1*a_1+w_2*a_2+w_3*a_3. Then y is equivalent to a piecewise linear function for a function, and Y's derivative of A is also changed with a, so it is equivalent to introducing nonlinearity.

4, after the discretization can carry on the characteristic crossover, joins the characteristic a to be discrete to m the value, the characteristic b discrete is the n value, then after crosses will have the m*n variable, further introduces the nonlinearity, the enhancement expression ability;

5, the characteristics of the discretization, the model will be more stable, such as if the user age discretization, 20-30 as an interval, not because a user aged one year old to become a completely different person. Of course, the sample in the adjacent area is just the opposite, so how to divide the interval is a learning; it is very important to divide the interval by the interval discretization.

6. After the feature discretization, it simplifies the function of the logistic regression model, and reduces the risk of the model crossing fitting. (When using continuous features, a feature corresponds to a weight, so if the weight of this feature is large, the model will be very dependent on this feature, a small change of this feature may lead to a large change in the final result, such a model is very dangerous, When encountering new samples, it is likely to be too sensitive to this feature to get the wrong classification results, that is, poor generalization ability, easy to fit. When using discrete features, when a feature becomes multiple and weights become multiple, the influence of successive features on the model is dispersed and weakened, thus reducing the risk of fitting. )

Li Yu once said: whether the model uses discrete or continuous features is actually a trade-off between a "mass discrete feature + simple model" and a "small number of continuous features + complex models". The linear model can be discretized, and the continuous feature can be applied to the depth learning. It is like tossing the characteristics or toss the model. Generally speaking, the former is easy and can be done in parallel with N, there is a successful experience, the latter is very good at the moment, it remains to be seen how far to go.

http://www.zhihu.com/question/28641663/answer/41653367

In machine learning, what are the characteristics of the selection of engineering methods.

Feature selection is an important problem in feature engineering (another important problem is feature extraction), and it is often said that data and features determine the upper limit of machine learning, while models and algorithms only approximate the upper limit. Therefore, feature engineering, especially feature selection, plays a very important role in machine learning. Machine learning to do well, by the data and the model together, for example, the data itself can not be divided, then SVM and other classification algorithm is not good, and can not be completely correctly separated. The dataset itself is inherently characteristic, and the characteristics of the dataset itself determine the upper limit of machine learning. Therefore, a machine learning algorithm may work well on dataset A, but it is very bad on DataSet B, because the intrinsic characteristics of dataset A and B are different. I used to learn other people use GBDT to extract features, others use GBDT to extract features, can improve the classification effect, but I used the GBDT feature, and did not improve the effect. Because the dataset features are not the same. Therefore, the characteristic of the dataset determines the upper bound of the algorithm to achieve the effect.


Generally speaking, feature selection refers to the selection of the best performance of the corresponding model and algorithm feature set, the engineering methods commonly used are as follows:
1. Calculate the correlation between each feature and response variable: Engineering methods commonly used to calculate Pearson coefficient and mutual information coefficient, Pearson coefficient can only measure the linear correlation and the mutual information coefficient can well measure the various correlations, but the calculation is relatively complex, Fortunately, many toolkit contain this tool (such as the Sklearn of mine), after the correlation can be sorted to select features. (In fact, the calculation of output on the derivative of the input, if a feature to a large extent affect the output, then this feature will be more important).
2. To construct a single feature model that is sorted by the accuracy of the model to select features, and remember that JMLR ' 03 has a paper on a feature selection method based on decision tree, which is essentially equivalent. When the target feature is selected, it is used to train the final model.
3. Select features by L1 Regular items: L1 The regular method has the characteristics of sparse solution, so nature has characteristics of feature selection, but it is important to note that L1 features that are not selected do not represent unimportant, because two highly correlated features may only retain one, If you want to determine which feature is important, then cross test by L2 regular method;
4. Training can be used to score the preselection model: Randomforest and Logistic regression can grade the characteristics of the model, and then the final model after the correlation is obtained by scoring;
5. Select features after the combination of features: such as the most combination of user ID and user characteristics to obtain a larger feature set to select features, this approach in the recommendation system and advertising system is more common, this is the so-called billion-or even 1 billion-level features of the main source, because the user data is relatively sparse, The combination feature can take into account both the global model and the personalization model, and this issue has the opportunity to speak.
6. Through the depth of learning to do feature selection: At present, this means with the prevalence of in-depth learning to become a means, especially in the field of computer vision, the reason is the depth of learning with automatic learning characteristics of the ability, this is also the depth of learning also called Unsupervisedfeature The reason for learning. After selecting a characteristic of a neural layer from a depth learning model, it can be used to train the final target model.

http://www.zhihu.com/question/34271604

In the ad LR model, why do feature combinations.

  In the industry, the LR model is very popular, mainly because the LR model is a logarithmic linear model, the implementation of simple, easy to parallel, large-scale expansion of the convenient, iterative speed, the characteristics of the use of a better interpretation of the predicted output between 0 and 1 fit the probability model. (Model of the explanatory examples, such as a-b weight is relatively large, a on behalf of the user, B represents the object, then you can think that a is more interested in B) However, the linear model for non-linear relations is not accurate description, the combination of features can be added to the non-linear expression, enhance the model's ability In addition, ad LR, the basic features can be considered for global modeling, the combination feature is more fine, is personalized modeling, because in this large-scale discrete LR, the single pair of global modeling will be biased to some users, for each user modeling and data is not easy to fit the model number of explosion, so the basic characteristics + The combination features are both global and personalized. For example, the feature vector, there are user A,b,c, E,f,g. The basic characteristic a,b.c.e.f.g corresponding weight, corresponds to each object's bias weight, but if a preference e,b preference F, then the combination characteristic a-e,b-f is the user's individuality modelling, the combination characteristic a-e,b-f weight is represents a to E's liking, and B-f's liking.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.