1. Background 1.1 questions
In the practical application of machine learning, the number of features may be more, in which there may be irrelevant features, there may be correlations between features, easy to lead to the following consequences:
(1) The more the number of features, the more time it takes to analyze features and train the model, the more complex the model will be.
(2) The more the number of features, prone to "dimensional disaster", its extension capacity will be reduced.
(3) The more the number of features, it is easy to cause the problem of sparse features often appearing in machine learning, resulting in the decrease of the model effect.
(4) for the model, it may lead to an illposed situation, that is, the solution of the parameters due to small changes in the sample and large fluctuations.
Feature selection, can eliminate irrelevant, redundant, no difference characterization ability, so as to reduce the number of features, reduce training or running time, improve model accuracy.
1.2 How to Do feature selection
Feature selection, which means selecting a subset of features from all features, makes the model better and more powerful. How to do feature selection, if you want to select an optimal subset from all the features, so that it under certain evaluation criteria, in the current training and test data performance best.
From this level of understanding, feature selection can be viewed as three questions:
(1) Select a fixed number of features from the original feature set, so that the error rate of the classifier is minimized this is an unconstrained combinatorial optimization problem;
(2) for a given allowable error rate, it is a constrained optimization problem to find the subset of features with the smallest dimensionality.
(3) A compromise between the error rate and the dimension of the feature subset.
The above 3 problems are a NP difficult problem, when the feature dimension is small, the implementation is feasible, but when the dimension is large, the complexity of implementation is very large, so practical application is difficult to use. The above three kinds of feature selection are 10 NP difficult problems. Because the computational amount of the optimal solution is too large, it is necessary to find an algorithm which can get better suboptimal solution under a certain time limit. The process of solving the suboptimal solution is described below.
2. General process of feature selection
The general process of feature selection can be represented in Figure 1. First, a feature subset is generated from the complete set of features, and then the evaluation function is used to evaluate the feature subset, and the result of the evaluation is compared with the stop criterion , if the stopping criterion is satisfied, the next group of feature subsets will continue to be produced. Continue with feature selection. The selected subset of features generally also validates its validity.
In summary, the feature selection process generally includes: feature subset generation process, evaluation function, stop criterion, verification process, these 4 parts.
feature subset generation process (Generation Procedure)
A subset selection method is adopted to provide a subset of features for the evaluation function. According to the method of search process, feature selection can be divided into several methods, such as exhaustive, heuristic and random .
The above methods do not change the characteristics of the original properties, and some methods through the spatial transformation of features to remove the correlation. such as PCA, Fourier transform, wavelet transform and so on.
evaluation function (evaluationfunction)
Evaluation function is a criterion to evaluate the quality of a subset of features. The design of the evaluation function will be different in different application scenarios, for example, some will judge the effect of the final model according to whether the distribution is uniform or not. Each evaluation function has its merits and demerits, so it is necessary to choose according to the actual situation. According to different evaluation criteria, it can be divided into: filter model, wrapper model and mixed model . Filter model is the feature selection as a preprocessing process, using the intrinsic characteristics of the data to evaluate the selected feature subset, independent of the learning algorithm. The wrapper model takes the results of subsequent learning algorithms as part of the feature evaluation criteria. According to the difference of the evaluation function (whether it is related to the classification method adopted), the feature selection can be divided into independent criterion and correlation measure .
Filters measure the quality of a feature subset by analyzing the characteristics within it. Filters are generally used as preprocessing, regardless of the selection of classifiers. The principle of the filter is 1:
Figure 1. Filter principle (Ricardogutierrezosuna 2008)
The wrapper is essentially a classifier that encapsulates the sample set with the selected subset of features, and the accuracy of the classification as a criterion for measuring the quality of a subset of features. The wrapper is shown in principle 2.
Figure 2. Wrapper principle (Ricardogutierrezosuna 2008)
Stop Criteria (stoppingcriterion)
The stop criterion is related to the evaluation function, which stops the search when the value of the evaluation function reaches a certain threshold. For example, for the independence criterion, the average spacing between samples can be chosen most, and for the correlation measure, the accurate recall of the classifier can be chosen as the criterion.
validation process (Validation Procedure)
Verify the validity of the selected subset of features on the measurement test data set. It is best to take measures that are not relevant to the previous selection method, which reduces coupling between them.
Figure 3 The process of feature selection (M.dash and H. Liu 1997)
The different methods in these processes can be considered as a component, respectively, to be combined. For example, we can take heuristic feature screening method, combine correlation measure as evaluation function and so on.
3. Feature subset generation process
The generation process is the process of searching for feature subspace. Search algorithms are divided into full search (complete), heuristic search (heuristic), random search (randomly) 3 categories, 4 shown.
Figure 4 Feature subset search process classification
Of course, each method is not mutually exclusive, you can also combine a variety of methods to use, learn from each other. The following is a brief introduction to common search algorithms.
3.1 Full Search (complete)
The full search is divided into two categories: exhaustive search (exhaustive) and nonexhaustive search (nonexhaustive). The full search section considers the correlation between features to better find the optimal set.
A. BreadthFirst search (breadth)
Algorithm Description: Breadthfirst traversal of feature subspace.
STEP1: First put the root node in the queue.
STEP2: Remove the first node from the queue and verify that it is the target.
Substep: If the target is found, end the search and return the result.
Substep: Otherwise, it joins all the direct child nodes that have not yet been tested into the queue.
STEP3: If the queue is empty, it means that all features have been checked. End search and return "target not found".
STEP4: Repeat Step2.
Algorithm evaluation: Enumeration of all feature combinations, is exhaustive search, time complexity is O (2n), practicality is not high.
B. Branch clearance search (Branch and Bound)
Algorithm Description: The branch limit is added on the basis of exhaustive search. For example, if certain branches are not likely to be able to search for a better solution than the best solution currently found, they can be cut off.
C. Targeted searches (Beam search)
Algorithm Description: First select the n highestscoring feature as a subset of features, add it to a priority queue that limits the maximum length, each time the highestscoring subset is removed from the queue, and then exhaustive all the feature sets that are generated after adding 1 features to the subset, adding the feature sets to the queue.
D. Best First Search
Algorithm Description: Similar to directed search, the only difference is that the length of the priority queue is not limited.
3.2 Heuristic Search (heuristic)
Heuristic search uses greedy ideas more, some algorithms do not consider the correlation between features, but only consider the effect of individual characteristics on the final result, but the characteristics of reality may have various correlations. Some algorithms are also improved from these aspects, such as the LR selection algorithm and the sequence floating selection.
A. Sequence forward selection (SFS, sequential Forward Selection)
Algorithm Description: Feature subset x starts with an empty set, each time a feature X is selected to add a feature subset X, which makes the feature function J (x) optimal. Simply put, it is a simple greedy algorithm to choose a feature that makes the evaluation function more superior each time.
Algorithm evaluation: The disadvantage is that you can only add features and not remove features. For example: feature A is completely dependent on features B and C, which can be considered redundant if a feature B and C are added. Assuming that the sequence forward selection algorithm first adds a to the feature set, and then adds B and C, the feature subset contains redundant feature a.
B. Sequence back selection (SBS, sequential backward Selection)
Algorithm Description: Starting from the feature complete O, each time a feature x is removed from the Special solicitation O, the value of the evaluation function is optimized after the culling feature X.
Algorithm evaluation: Sequence back selection and sequence forward selection is the opposite, its disadvantage is that features can only be removed cannot join.
In addition, both SFS and SBS are greedy algorithms, which are easy to get into local optimal values.
C. Bidirectional search (BDS, bidirectional search)
Algorithm Description: Use sequence forward selection (SFS) to start from the empty set, and use sequence back selection (SBS) to start the search from the complete set, and stop the search when both search for an identical subset of features C.
The starting point for twoway search is. As shown, the O point represents the start of the search, and point a represents the search target. The Gray circle represents a oneway search for possible search scopes, and a green 2 circle indicates a search scope for a twoway search, and it is easy to prove that the area of green must be smaller than the grey.
Figure 5. Bidirectional search
D. Add L to R selection algorithm (LRS, plusl minusr Selection)
There are two forms of the algorithm:
The <1> algorithm starts with an empty set, and each wheel is preceded by an L feature, then the R features are removed, which makes the evaluation function optimal. (l> R)
<2> algorithm from the complete set, each round first remove the R features, and then add the L features, so that the evaluation function value is optimal. (l< R)
Algorithm Evaluation: LR selection algorithm combines sequence forward selection and sequence back selection, and the selection of M and S is the key to the algorithm.
E. Sequence floating selection (sequential floating Selection)
Algorithm Description: The sequence floating selection is developed by the LR selection algorithm, the difference between the algorithm and the LR selection algorithm is that the L and r of the sequence floating selection are not fixed, but "floating", that is, the change.
The sequence float selection is based on the search direction, with the following two variants.
<1> sequence floating forward selection (sffs, sequential floating Forward Selection)
Algorithm Description: Starting from the empty set, each round selects a subset X in the not selected feature, makes the subset X postevaluation function to be optimal, then selects the subset Z in the selected feature, so that the rejection subset Z postevaluation function is optimal.
<2> sequence floating back selection (SFBS, sequential floating backward Selection)
Algorithm Description: Similar to sffs, the difference is that Sfbs starts with a complete set, each wheel first rejects the feature, and then adds the feature.
Algorithm evaluation: The sequence floating selection combines the characteristics of sequence forward selection, sequence back selection, Lr selection, and compensates for their shortcomings.
F. Decision Trees (decision Tree Method, DTM)
Algorithm Description: Run the C4.5 or other decision tree generation algorithm on the training sample set, and then run the pruning algorithm on the tree until the decision tree is fully grown. The characteristics of the branches of the final decision tree are the selected subset of features. The decision tree method generally uses the information gain as the evaluation function.
3.3 Stochastic algorithm (random)
A. Randomly generated sequence selection algorithm (RGSS, random Generation plus sequential Selection)
Algorithm Description: Randomly produces a subset of features and then executes the SFS and SBS algorithms on that subset.
Algorithm evaluation: Can be used as a supplement to the SFS and SBS to jump out of the local optimal value.
B. Simulated annealing algorithm (SA, simulated annealing)
Algorithm evaluation: Simulated annealing to some extent overcomes the disadvantage that the sequential search algorithm is easy to get into local optimal value, but if the region of the optimal solution is too small (such as the socalled "golf Hole" terrain), simulated annealing is difficult to solve.
C. Genetic algorithm (GA, genetic algorithms)
The algorithm description: First randomly produces a batch of feature subsets, and evaluates these feature subsets by the evaluation function, then multiplies the next generation's feature subset through the operation of crossover, mutation and so on, and the higher the score of the feature subset is selected to participate in the breeding probability. After the multiplication and the fittest of the n generation, the characteristic subset with the highest value of the evaluation function may be produced in the population.
The common disadvantage of stochastic algorithm is that it relies on stochastic factors, and it is difficult to reproduce the experimental results.
3.4 Feature Transformation method
A. PCA
PCA (Principal componentanalysis), the Chinese name is the main component transformation, is a method of coordinate transformation, can remove redundant features.
In the process of specific feature transformation, the smaller eigenvalues are removed to achieve the purpose of denoising, removing correlation and decreasing feature.
B. wavelet transform
Wavelet is also a method of feature space transformation, compared with Fourier transform, wavelet transform can better adapt to drastic transformation.
4. Evaluation function
The function of evaluation functions is to evaluate the quality of the feature subsets provided by the generating process.
4.1 Independent guidelines
The independent criterion is usually applied to the feature selection algorithm of the filter model, and attempts to evaluate the selected feature subset by the intrinsic characteristics of the training data, independent of the specific learning algorithm. Usually includes: distance degree, information measure, correlation measure and consistency measure .
This approach is recommended when making more general feature selection methods, as this is independent of a specific machine learning algorithm and is suitable for most subsequent machine learning methods.
4.2 Correlation Metrics
The association criterion is usually applied in the feature selection algorithm of the wrapper model, first, a learning algorithm is identified and the performance of the machine learning algorithm is used as the evaluation criterion. For a particular learning algorithm, it is usually possible to find a better subset of features than the filter model, but it requires multiple calls to the learning algorithm, which is generally expensive and may not be suitable for other learning algorithms.
When we do the pattern classification algorithm, we can adopt the correlation measure method according to our own actual situation, so it can be better combined with our classification method, usually can find a better subset.
In summary, the advantages and disadvantages of the two evaluation functions and the application of the situation are summarized as follows:
Method 
Independence Guidelines 
Correlation Metrics 
Advantages 
Universal, independent of specific algorithms 
The associated classification algorithm may be optimal. 
Disadvantages 
General effect 
Not applicable to other algorithms 
Where applicable 


4.3 Common evaluation functions
A. ChiSquare inspection
The most basic idea of chisquare test is to determine the correctness of the theory by observing the deviation between the actual value and the theoretical value. When you do it, it is often assumed that two variables are indeed independent ("original hypothesis"), and then observe the actual value (the observed value) and the theoretical value (the theoretical value is "if the two are really independent") The degree of deviation, if the deviation is small enough, we think that the error is a natural sample error, is not accurate measurement means or accidental occurrence, the two are really independent, at this time accept the original hypothesis, if the deviation is large to a certain extent, So that the error is not likely to happen accidentally or inaccurate measurement, we think that the two are actually related, that is, to negate the original hypothesis, and accept the alternative hypothesis.
The theoretical value is E, the actual value is x, and the degree of deviation is calculated as:
This formula is the difference measure used for the root test. When the observation value of several samples is provided x1,x2,...... XI,...... xn, the root value can be obtained by substituting in the formula, and if the value is greater than the threshold value (that is, the deviation is very large), the original hypothesis is not established. On the contrary, the original hypothesis is considered to be established. [Please refer to one of my other popular articles on Chisquare testing]
B. Relevance (Correlation)
The use of correlation to measure a subset of features is based on the assumption that a good subset of features should have a higher degree of correlation with the classification (high correlation) and less correlation between features (low redundancy).
Linear correlation coefficients (correlationcoefficient) can be used to measure the linear correlation between vectors.
C. Distance (Distance Metrics)
Feature selection using distance measurement is based on the assumption that a good subset of features should make the sample distances belonging to the same class as small as possible, and that the distances between samples belonging to different classes are as far as possible. Also based on this kind of thought, there is Fisher discriminant classification inverse method.
Common distance measures (similarity measures) include Euclidean distance, normalized Euclidean distance, Markov distance, and so on.
D. Information gain (information Gain)
Assuming that the values in the discrete variable y,y include {y1,y2,....,ym}, the probability of Yi appearing is pi. The information entropy of y is defined as:
Information entropy is a kind of description of uncertainty. Has the following characteristics: If the elements of the set Y are unevenly distributed, the information entropy is smaller, and if the Y distribution is more average, the information entropy is greater. In extreme cases: if Y can only take one value, that is, p1=1, then H (Y) takes the minimum value of 0, conversely, if the probabilities of various values are equal, that is, 1/m, then H (Y) takes the maximum value log2m.
For a characteristic t, the system has it and when it does not have the amount of information, the difference between the two is that the characteristics of the system to bring the amount of information. There is the entropy, without which it is the conditional entropy.
Conditional entropy: Calculates how much information the system has when a feature t cannot be changed.
For a feature x, it is possible to have more than n (x1,x2,......, xn), calculate the conditional entropy of each value, and then take the mean value.
In text categorization, the value of the feature word T is only t (representing T) and (for T). So
Finally, the information gain
But the problem with the greatest information gain [for a multiclassification problem, the two classification does not exist] is also that it can only examine the characteristics of the entire system contribution, but not specific to a category, which makes it only suitable for the socalled "global" Feature selection (meaning that all classes use the same set of features), rather than "local" feature selection (each category has its own set of features, because there are words that are very differentiated from this category and less important for another).
At the same time, the information entropy tends to be more distributed than the features, so the improved method is to try the information gain rate.
E. Classifier error rates (Classifier error rate)
Using a specific classifier, the sample set is classified with a given subset of features, and the accuracy of the classification is used to measure the feature subset.
of the above 4 measurement methods, chisquare test, correlation, distance, information gain, belong to the filter, and the classifier error rate belongs to the wrapper. The filter is not related to the specific classification algorithm, so its generalization ability between different classification algorithms is stronger, and the computational amount is also small. However, because of the application of the specific classification algorithm in the process of evaluation, the effect of generalization to other classification algorithms may be poor, and the computational amount is also larger.
5. Application examples
Here is a practical application of chestnuts, the basic method is heuristic search (sequential addition) + relevance criteria (CHIsquare test, maximum entropy) + quasicall stop criteria. The procedure is described in detail below.
STEP1: Counts the chisquare value of each feature.
STEP2: Take the eigenvalues of the TOPN.
STEP3: Bring into the model training and calculate the accuracy and recall on the test set.
STEP4: If standard, stop, otherwise, GOTOSTEP2.
Reference documents
[1] Levin, a number of feature selection algorithms in machine learning, postdoctoral dissertations
[2]http://casparzhang.blog.163.com/blog/static/126626558201332701016809/
Original address: http://blog.csdn.net/iezengli/article/details/32686803
Overview of Feature selection in machine learning