Feature Engineering
Before we introduce the feature engineering, we'll look at two graphs.
Figure one is the basic data mining scenario
Figure II is a common method and procedure for feature engineering
Feature Engineering (Feature Engineering) is the most time-consuming and important step in the development of data mining models. Here is a brief introduction to the author in the model development of some of the methods summarized.
The characteristics , which we often say are variable/Independent variables , are generally divided into three categories: sequential unordered category (discrete) type ordered (discrete) type
Feature Engineering (Feature Engineering) includes: feature processing (Feature processing), feature selection (Feature Selection). feature processing (Feature processing) Continuous Type characteristics does not handle
In addition to normalization (go to the center, variance is one), do not have to do too much special processing, you can directly throw the continuous features into the model to use. Depends on the type of the model. of Discretization
For some special models (credit score cards) development, sometimes we need to discrete characteristics (age, income) of the continuous type.
The key to discretization is how to determine the discrete points in a segment, and several commonly used discretization methods are described below:
equal distance discrete (equidistant grouping)
As the name suggests, is the discrete point to select equidistant points.
discrete sample points (equal depth grouping)
The selected discrete point ensures that the number of sample points in each paragraph is approximately the same
discretization of decision tree (optimal grouping)
The discretization method of decision tree is also a continuous feature which is discretized each time, the principle is as follows:
A decision tree model is trained by this feature and the target value y, and then the feature segmentation points in the model are taken as discrete discrete points.
other methods of discretization
In addition to the decision tree method, the optimal grouping method can also be used in the card-side box, which is more common in the development of the scoring card. function Conversion
Sometimes the assumption of our model is that the model behaves better when the variable or dependent variable is subjected to a particular distribution (e.g., it is too distributed), or if the independent variable or dependent variable obeys the distribution. At this point we need to transform the nonlinear function of the feature or the dependent variable. This method is simple to operate, but remember to normalized the newly added features . For the transformation of the feature, it is necessary to join the training model with the characteristic and the original feature. unordered category (discrete) type of numerical one-hot encodes a variable K value to convert to K dummy variable, advantages: simple, and guaranteed no collinearity, disadvantage: too sparse (sparsity matrix).
The common way to avoid the sparse matrix is to reduce the dimension, the classification dimension of the variable value more, as far as possible to the minimum, can drop the drop, not to drop, do not reluctantly.
Not all unordered variables need to be numerically processed, and tree models such as decision trees and random forests may not need to be processed, depending on the situation. ordered category (discrete) type of numerical
The label encoder The K value of a variable and converts it to a K number (1,2,3...K) in sequence. For example, a person's state status has three kinds of values: bad, normal, good, obviously bad < normal < good. This time bad, normal, good can be converted to 1, 2, 3 respectively.
This method has a large limitation: it is not suitable for establishing models for predicting specific numerical values, such as linear regression, only for classification, even for classification, there are some models do not fit, the result may be less accurate than one-hot coding.
Generating Dummy variables
As with One-hot encoding, the K-values of a variable are generated with K-dummy variables, but note that the sequential relationship between the values of the variables cannot be lost. The general way of expression is as follows:
Status Take value |
Vector Representation |
Bad |
(1, 0, 0) |
Normal |
(1, 1, 0) |
Good |
(1, 1, 1) |
The above expression is cleverly used to express the sequential relationship between values. Feature Selection (Feature Selection) The purpose of feature selection
Feature selection generally has two objectives: to remove the independent variables with little or no correlation, and to make the model prediction or classification effect reach the ideal value. To improve the interpretative nature of the model by feature type
Here we only introduce the filter method Continuous type: Pearson correlation coefficient order type: convert to a numerical value, use the Spearman correlation coefficient, or use the disorder type method. disordered type: single factor Variance analysis (regression), Card square test, IV value (two classification)
The above method is empirical and is not comprehensive and will be supplemented when other methods are encountered. Readers are also welcome to add a message. Model Evaluation (evaluation)
Model evaluations generally have two purposes:
1. Check feature engineering work to see if the selected features are conducive to improving the performance of the model.
2. Check the adjustment of the parameters of the work, by adjusting the model parameters, to find the best parameters to make the model classification, prediction performance best. Regression prediction Problem
For the regression prediction problem of continuous target variable, the method of evaluating the model generally has: the r^2 value r^2 is larger, the better the prediction effect of the model is.
(follow-up) classification forecasting problem
For the classification prediction problem of discrete target variables, the method of evaluating the model is as follows: the prediction accuracy rate of the cross examination model is the better. But pay attention to the problem of cross fitting when using decision tree or random forest. ROC curve and AUC value of AUC, ROC observation model, the bigger the better.
concluding remarks (follow-up additions)
Feature Engineering is an extremely tedious and extremely important thing. The feature engineering is done well, not only the late model tuning is much easier or even requires no tuning, but also the stability of the model, and the interpretation of the better. Remember the first picture of the beginning of the article? If the feature engineering does not work well, the model evaluation how the parameters can not be adjusted to the desired effect, then you may need to do a circular motion in the dotted box to know your model to achieve the desired effect. The characteristics of the project is very time-consuming and energy-consuming, and it takes 80% of the time spent in the model development to say the characteristic project on the web. In my opinion, at least it accounted for 95%.
Therefore, feature engineering needs to be done with patience and logic, with the goal of improving the performance and interpretation of the model.