Common Feature Selection algorithm

Source: Internet
Author: User
Tags random seed svm

Feature selection is a very important direction in the field of machine learning.

There are two main functions:

(1) Reduce the number of features, dimensionality reduction, make model generalization ability stronger, reduce overfitting

(2) Understanding between enhancement features and eigenvalue values

Several common methods of feature selection

First, remove the characteristics of small changes in value

By examining the variance value of a sample under a certain characteristic, a threshold value can be given, and the characteristics that are less than this threshold will be thrown away.

Two, single variable feature selection

The starting point of the single-variable feature selection is to calculate the relationship between a feature and a categorical variable, so as to calculate the score of each feature, leaving behind the post-ranked features. A more classical approach is chi-square testing.

(1) Peason correlation coefficient, disadvantage: only the linear correlation coefficient is sensitive

(2) Distance correlation coefficient

Comparison: First, Pearson correlation coefficients are fast, which is important when dealing with large-scale data. Second, the Pearson correlation coefficient is [ -1,1], the distance correlation coefficient is [0,1]. This feature allows the Pearson correlation coefficients to characterize a richer relationship, with the sign representing the positive or negative of the relationship and the absolute value representing the strength. Of course, the Pearson correlation is valid only if the variation of the two variables is monotonous.

(3) Feature sequencing based on learning model

The idea of this approach is to use the machine learning algorithm you need to build a predictive model for each individual feature and response variable.

Pearson correlation coefficients are equivalent to normalized regression coefficients in linear regression

Three, linear model and regularization

The single-variable feature selection method independently measures the relationship between each feature and the response variable, and another mainstream feature selection method is based on the machine learning model. A feature can be scored as a wrapper type and not graded as a filter type.

Multiple collinearity: A number of interrelated features, when the model becomes unstable, and the subtle changes in the data can lead to great changes in the model (the changes in the model are essentially coefficients, or parameters, which can be interpreted as W), which makes the prediction of the model difficult.

(1) Regularization model

Regularization is the addition of additional constraints or penalties to existing models (loss functions) to prevent overfitting and improve generalization capability

Regularization of the L1:

Regularization of the L2:

L2 regularization adds the L2 norm of the coefficient vector to the loss function. Since the coefficients in the L2 penalty are two square, which makes L2 and L1 have many differences, the most obvious point is that the L2 regularization will make the values of the coefficients become average. For associative characteristics, this means that they can get a more similar coefficient of correspondence. Or take y=x1+x2 as an example, assuming that X1 and X2 have a strong association, if using L1 regularization, whether the model is Y=X1+X2 or y=2x1, the penalty is the same, are 2alpha. But for L2, the penalty for the first model is 2alpha, but the second model is 4*alpha. It can be seen that the sum of the coefficients is constant, the penalty is the smallest when the coefficients are equal, so the L2 will make the coefficients tend to the same characteristics.

It can be seen that L2 regularization is a stable model for feature selection, and unlike L1 regularization, coefficients fluctuate due to subtle data changes. Therefore, the value of L2 regularization and L1 regularization is different, and L2 regularization is more useful for feature understanding: the coefficients corresponding to strong features are nonzero.

 fromSklearn.linear_modelImportRidge fromSklearn.metricsImportr2_scoresize= 100#We Run the method with different random seeds forIinchRange (10):    Print "Random seed%s"%i np.random.seed (seed=i) X_seed= Np.random.normal (0, 1, size) X1= x_seed + np.random.normal (0,. 1, size) X2= x_seed + np.random.normal (0,. 1, size) X3= x_seed + np.random.normal (0,. 1, size) Y= X1 + X2 + X3 + np.random.normal (0, 1, size) X=Np.array ([X1, X2, X3]). T LR=linearregression () lr.fit (x, y)Print "Linear Model:", Pretty_print_linear (LR.COEF_)

Iv. Random Forest

Random forest has the advantages of high accuracy, good robustness and easy use, which makes it one of the most popular machine learning algorithms. Random forests offer two ways to select features: mean decrease impurity and mean decrease accuracy.

(1) Average non-purity reduction

Here the feature score is actually used by Gini importance. When using a method based on purity, remember that: 1, this method has a bias, it is more advantageous to have more categories of variables, 2, for the existence of multiple characteristics of the association, either of which can be used as an indicator (excellent feature), and once a feature is selected, the other characteristics of the importance of a sharp decline, Because the non-purity has been selected to the characteristics of the lower, the other features will be difficult to reduce so much of the purity, so that only the first selection of the characteristics of the high degree of importance, the other correlation features are often less important. This can be misleading when it comes to understanding the data, leading to the erroneous perception that the first selected feature is important, while the rest of the features are unimportant, but in fact these characteristics are really very close to the response variable (which is very similar to lasso).

It is important to note that there is instability in the scoring of correlation features, which is not only specific to random forests, but most models-based feature selection methods have this problem.

(2) Average accuracy reduction

Another common feature selection method is to directly measure the effect of each feature on the accuracy of the model. The main idea is to disrupt the sequence of eigenvalues of each feature, and to measure the effect of order changes on the accuracy of the model. Obviously, for unimportant variables, the order of disruption does not have much effect on the accuracy of the model, but for important variables, the order of disruption decreases the accuracy of the model.

 fromSklearn.cross_validationImportShufflesplit fromSklearn.metricsImportR2_score fromCollectionsImportdefaultdict X= boston["Data"]y= boston["Target"] RF=Randomforestregressor () scores=defaultdict (list)#Crossvalidate the scores on a number of different random splits of the data forTrain_idx, Test_idxinchShufflesplit (Len (X), 100,. 3): X_train, X_test=X[train_idx], X[test_idx] y_train, Y_test=Y[train_idx], Y[test_idx] r=Rf.fit (X_train, Y_train) ACC=R2_score (Y_test, Rf.predict (x_test)) forIinchRange (x.shape[1]): x_t=x_test.copy () np.random.shuffle (x_t[:, I]) SHUFF_ACC=R2_score (Y_test, Rf.predict (x_t)) Scores[names[i]].append ((ACC-SHUFF_ACC)/ACC)Print "Features sorted by their score:"PrintSorted ([(Round (Np.mean (Score), 4), feat) forfeat, scoreinchScores.items ()], reverse=true)

Characteristics selection of two kinds of top-level feature algorithms

(1) Stability selection

Stability selection is a new method based on quadratic sampling and selection algorithm, and the selection algorithm can be regression, SVM or other similar method. Its main idea is to run feature selection algorithms on different subsets of data and subsets of features, repeating them, and finally summarizing feature selection results, such as the frequency at which a feature is considered to be an important feature (the number of times selected as an important feature divided by the number of times the subset it is tested)

 fromSklearn.linear_modelImportRandomizedlasso fromSklearn.datasetsImportLoad_bostonboston=Load_boston ()#using the Boston housing data.#Data gets scaled automatically by Sklearn ' s implementationX = boston["Data"]y= boston["Target"]names= boston["Feature_names"] Rlasso= Randomizedlasso (alpha=0.025) Rlasso.fit (X, Y)Print "Features sorted by their score:"PrintSorted (Zip (map) (LambdaX:round (x, 4), Rlasso.scores_, names), reverse=true)

(2) Recursive feature elimination

The main idea of recursive feature elimination is to repeatedly build models (such as SVM or regression models) and then select the best (or Worst) features (which can be selected based on coefficients), put the selected features over again, and repeat the process on the remaining features until all the features are traversed. The order in which features are eliminated in this process is the sort of feature. Therefore, this is a greedy algorithm for finding the optimal subset of features.

 fromSklearn.feature_selectionImportRFE? fromSklearn.linear_modelImportlinearregression Boston=Load_boston () X= boston["Data"]y= boston["Target"]names= boston["Feature_names"] #Use linear regression as the modelLR =linearregression ()#rank all features, i.e continue the elimination until the last oneRFE = RFE (LR, n_features_to_select=1) Rfe.fit (x, y)Print "Features sorted by their rank:"PrintSorted (Zip (map) (LambdaX:round (x, 4), rfe.ranking_), names))

Summarize:

(1) For the understanding of data, the structure of data, characteristics, single-variable feature selection is a very good choice. Although it can be used to sort features to optimize the model, because it cannot find redundancy (such as a subset of features, where the characteristics are strongly correlated, it is difficult to consider the problem of redundancy when choosing the optimal feature).

(2) The regularization linear model is a very powerful tool for feature understanding and feature selection. L1 regularization can generate sparse models, which is useful for selecting feature subsets, and is more stable than L1 regularization, L2 regularization, because useful features tend to correspond to non-zero coefficients, so L2 regularization is appropriate for data understanding. Since the response variables and features are often non-linear, it is possible to convert the features into a more suitable space using basis expansion, and then consider using a simple linear model.

(3) Random forest is a very popular feature selection method, it is easy to use, generally do not need to feature engineering, tuning parameters and other cumbersome steps, and many toolkits have provided the average method of reducing the purity. Of its two major problems, 1 are important features that are likely to score very low (associated feature problems), and 2 are the more advantageous (biased) to the characteristic variable category. Still, this approach is well worth a try in your application.

(4) Feature selection is very useful in many machine learning and data mining scenarios. Use it to figure out what your goals are, and then find out which method works for your task. When selecting the optimal feature to improve the performance of the model, it is possible to use cross-validation to verify that a method is better than other methods. When using the method of feature selection to understand the data, it is important to be aware of the stability of the feature selection model, and the poor stability model can easily lead to erroneous conclusions. It is helpful to take two samples of the data and run the feature selection algorithm on the subset, if the results on each subset are consistent, it can be said that the conclusions obtained in this data set are credible and can be used to understand the data by selecting the results of the model.

Common Feature Selection algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.