Feature Engineering is part of the most time-and effort-consuming work in data analysis. It is not just a definite step like algorithms and models, but also engineering experience and trade-offs. Therefore, there is no uniform method. Here is just a summary of some common methods. This article focuses on feature selection. The next two articles will focus on feature expression and feature preprocessing.
1. Source of features
During data analysis, there are generally two feature sources. One is that the business has sorted out various feature data, and we need to find out the characteristics that are suitable for our problem; the other is to find advanced data features from the business features. We will discuss the two parts separately.
2. Select appropriate features
First, let's take a look at how to find out the features that suit our needs when the business has sorted out various feature data. At this time, the number of features may be hundreds or even hundreds. What are the features we need?
The first step is to find experts who know business in this field and give them some suggestions. For example, if we need to solve the classification problem of drug efficacy, we should first find experts in the field and ask them which factors (characteristics) will affect the efficacy of the drug, both those with a greater impact and those with a smaller impact are required. These features are the first candidate set of our features.
This feature set may also be very large sometimes. Before we try to reduce the dimension, we need to use feature engineering methods to select more important feature combinations. These methods do not use domain knowledge, it is just a statistical method.
The simplest method is variance filtering. A feature with a larger variance can be considered useful. If the variance is small, for example, less than 1, this feature may not be very useful to our algorithms. In the most extreme case, if a feature variance is 0, that is, all samples have the same feature value, it does not have any effect on our model training and can be discarded directly. In practice, we will specify a variance threshold. When the variance is smaller than this threshold, We will screen out the features. The variancethreshold class in sklearn can easily complete this job.
There are many feature selection methods, which are generally divided into three types: the first type of filtering method is relatively simple, it scores each feature according to the feature divergence or correlation indicators, set the scoring threshold or the number of threshold values to be selected and select the appropriate feature. The variance screening we mentioned above is a filtering method. The second type is the packaging method. Based on the target function, it is usually the prediction result score. Each time part of features are selected, or some features are excluded. The third type of embedding rule is a little more complicated. It first uses some machine learning algorithms and models for training to obtain the weights of each feature, and selects features based on the weights. It is similar to the filter method, but it uses machine learning training to determine the advantages and disadvantages of features, rather than directly determining the advantages and disadvantages of features from some statistical indicators of features. Next, let's take a look at three methods.
2.1 filter feature selection
We have discussed the process of filtering and selecting features using feature variance. In addition to the first method of feature variance, there are other statistical indicators available.
The second one can use the correlation coefficient. This is mainly used in supervised learning algorithms that output continuous values. We calculate the correlation coefficient between each feature and the output value in each training set, set a threshold, and select some features with a large correlation coefficient.
The third option is the hypothesis test, such as the chi-square test. The Chi-square test can be used to test the correlation between a feature distribution and an output value distribution. I personally think it is better than the rough variance method. If you are not familiar with the chi-square test, refer to the chi-square test principles and applications. In sklearn, the chi-square test of chi2 can be used to obtain the chi-square value of all features and the P critical value of significance level. We can set the chi-square value threshold and select some features with a large chi-square value.
In addition to the chi-square test, we can also use the F test and t-test. They all use the hypothesis test method, but the statistical distribution is not the chi-square distribution, but the f distribution and the tdistribution. In sklearn, F-tested functions f_classif and f_regression are used for classification and regression feature selection respectively.
The fourth is mutual information, that is, the relationship score between each feature and the output value is analyzed from the Information Entropy perspective. We have discussed mutual information (information gain) in Decision Tree algorithms ). The greater the mutual information value, the greater the correlation between the feature and the output value. In sklearn, you can use mutual_info_classif (Classification) and mutual_info_regression (regression) to calculate mutual information between input features and output values.
The above is the main method of filtering. In my personal experience, when there is no idea, we can use chi-square test and mutual information for feature selection first.
2.2 Packaging Method Selection features
The solution of packaging method is not as direct as filter method. It selects a target function to filter features step by step.
The most common packaging method is the recursive feature elimination (RFE ). Recursive feature elimination uses a machine learning model for multiple rounds of training. after each round of training, features corresponding to several weights are eliminated, and then the next round of training is performed based on the new feature set. In sklearn, you can use the RFE function to select features.
Next we will discuss the idea of this feature selection with the classic SVM-RFE algorithm. This algorithm uses SVM to select features for RFE machine learning models. During the first round of training, it will select all features for training. If n features are obtained after the Super Plane wx * + B = 0wx * + B = 0 of the classification, then the RFE-SVM will select the feature that corresponds to the smallest serial number I of w2iwi2 In the WW component and exclude it. In the second class, there will be n-1 feature numbers, we continue to use the n-1 features and output values to train SVM. Similarly, we remove the features corresponding to the smallest sequence I in w2iwi2. And so on until the remaining number of features meets our needs.
2.3 embedding Feature Selection
The embedding method also uses machine learning to select features. However, the difference between the embedding method and RFE is that it does not constantly screen out features for training, but uses a complete set of features. In sklearn, use the selectfrommodel function to select features.
The most common feature selection is L1 regularization and L2 regularization. In section 6th, where scikit-learn and pandas are used to learn ridge regression, the larger the regularization penalty, the smaller the coefficient of the model. When the regularization penalty item is large to a certain extent, some feature coefficients will change to 0. When the regularization penalty item continues to increase to a certain extent, all the feature coefficients will tend to 0. however, we will find that some feature coefficients are more likely to be changed to 0 first, which can be filtered out. That is to say, we choose features with large feature coefficients. The base learner that uses L1 regularization and L2 regularization to select features is logical regression.
You can also use a decision tree or gbdt. So can all machine learning methods be used as the basis learner of the embedding method? No. Generally, the algorithm that obtains the feature coefficient coef or feature importances can be used as the base learner of the embedding method.
3. Search for advanced features
After we get the existing features, we can find more advanced features as needed. For example, we can obtain the second-level feature of the average speed of a vehicle based on its travel characteristics and time interval features. Based on the speed characteristics of the vehicle, we can obtain the three-level feature of the vehicle's acceleration. Based on the acceleration characteristics of the vehicle, we can obtain the four-level feature of the vehicle's acceleration speed... That is to say, advanced features can be searched continuously.
In algorithm competitions such as kaggle, the high-score team mainly uses methods in addition to integrating learning algorithms, and the rest is mainly about advanced features. Therefore, finding advanced features is one of the necessary steps for model optimization. Of course, when we create a model for the first time, we can choose not to look for advanced features. After obtaining the reference model, we can look for advanced features for optimization.
The most common methods for finding advanced features are:
Several feature addition: we assume that you want to obtain the one-week sales feature based on daily sales. You can sum up the sales for the last seven days.
The difference between several features: assume that you already have weekly sales and monthly sales, you can calculate the sales in the first January of a week.
Product of several features: assume that you have the characteristics of product price and product sales, then you can get the characteristics of sales.
A number of features, except vendors: assume that you have sales for each user and the number of purchased items, the average sales for each item of the user is obtained.
Of course, the method for finding advanced features is far more than that. It requires you to develop advanced features based on your business and model needs, rather than forming advanced features by combining them randomly. This will easily lead to feature explosion, instead, there is no way to get a better model. In my personal experience, when clustering is performed, there are as few advanced features as possible, and when classification regression is performed, there are more advanced features.
4. Summary of Feature Selection
Feature selection is the first step in Feature Engineering. It is related to the upper limit of our machine learning algorithms. Therefore, the principle is to try to have a feature that may be useful, but not to abuse too many features.
Source: Feature Engineering-Feature Selection
Feature Engineering-Feature Selection