Original: http://www.cnblogs.com/heaad/archive/2011/01/02/1924088.html
1 reviews
(1) What is feature selection
Feature Selection (Feature Selection) is also called feature subset selection (Feature subset Selection, FSS), or attribute selection (Attribute Selection), which refers to the selection of a subset of features from all features to make the construction Model is better.
(2) Why to do feature selection
In the practical application of machine learning, the number of features is often more, in which there may be irrelevant features, there may be interdependence between features, easy to lead to the following consequences:
- The more features you have, the longer it takes to analyze the features and train the model.
- The more the number of features, the more likely to cause "dimension disaster", the more complex the model will be, its generalization ability will decline.
Feature selection can eliminate the characteristics of irrelevant (irrelevant) or Shanyu (redundant), so as to reduce the number of features, improve the accuracy of the model and reduce the running time. On the other hand, the selection of truly relevant features simplifies the model, making it easier for researchers to understand the process of data generation.
2 Feature selection process
2.1 General process of feature selection
The general process of feature selection can be represented in Figure 1. First, a feature subset is produced from the complete set of features, and then the evaluation function is used to evaluate the feature subset, and the result of evaluation is compared with the stop criterion, if the evaluation result is better than the stop criterion, the next group of feature subsets will continue to be produced and the feature selection continues. The selected subset of features generally also validates its validity.
In summary, the feature selection process generally includes the production process, the evaluation function, the stop criterion, the verification process, these 4 parts.
(1) production process (Generation Procedure)
The production process is the process of searching a subset of features and is responsible for providing a subset of features for the evaluation function. There are several ways to search for a subset of features, which are described in section 2.2.
(2) evaluation Functions (Evaluation function)
Evaluation function is a criterion to evaluate the quality of a subset of features. The evaluation function will be introduced in section 2.3.
(3) stop criteria (stopping Criterion)
The stop criterion is related to the evaluation function and is generally a threshold that stops the search when the value of the evaluation function reaches this threshold.
(4) verification Process (Validation Procedure)
Validates the validity of the selected subset of features on the validation data set.
Figure 1. The process of feature selection (M. Dash and H. Liu 1997)
2.2 generation Process
The generation process is the process of searching for feature subspace. Search algorithms are divided into full search (complete), heuristic search (heuristic), random search (randomly) 3 categories, 2 shown.
Figure 2. Generation process Algorithm Classification (m. Dash and H. Liu 1997)
The following is a brief introduction to common search algorithms.
2.2.1 Full Search
The full search is divided into two categories: exhaustive search (exhaustive) and non-exhaustive search (non-exhaustive).
(1) Breadth First search (breadth)
Algorithm Description: Breadth-first traversal of feature subspace.
Algorithm evaluation: Enumeration of all feature combinations, is exhaustive search, time complexity is O (2n), practicality is not high.
(2) Branch Clearance Search (Branch and Bound)
Algorithm Description: The branch limit is added on the basis of exhaustive search. For example, if certain branches are not likely to be able to search for a better solution than the best solution currently found, they can be cut off.
(3) targeted searches (Beam search)
Algorithm Description: First select the n highest-scoring feature as a subset of features, add it to a priority queue that limits the maximum length, each time the highest-scoring subset is removed from the queue, and then exhaustive all the feature sets that are generated after adding 1 features to the subset, adding the feature sets to the queue.
(4) Best First search
Algorithm Description: Similar to directed search, the only difference is that the length of the priority queue is not limited.
2.2.2 Heuristic Search
(1) sequence forward selection (SFS, sequential Forward Selection)
Algorithm Description: Feature subset x starts with an empty set, each time a feature X is selected to add a feature subset X, which makes the feature function J (x) optimal. To put it simply, it is a simple greedy algorithm to choose a feature that makes the evaluation function to the optimal value every time.
Algorithm evaluation: The disadvantage is that you can only add features and not remove features. For example: feature A is completely dependent on features B and C, which can be considered redundant if a feature B and C are added. Assuming that the sequence forward selection algorithm first adds a to the feature set, and then adds B and C, the feature subset contains redundant feature a.
(2) Sequence Back selection (SBS, Sequential backward Selection)
Algorithm Description: Starting from the feature complete O, each time a feature x is removed from the Special solicitation O, the value of the evaluation function is optimized after the culling feature X.
Algorithm evaluation: Sequence back selection and sequence forward selection is the opposite, its disadvantage is that features can only be removed cannot join.
In addition, both SFS and SBS are greedy algorithms, which are easy to get into local optimal values.
(3) bidirectional search (BDS, bidirectional search)
Algorithm Description: Use sequence forward selection (SFS) to start from the empty set, and use sequence back selection (SBS) to start the search from the complete set, and stop the search when both search for an identical subset of features C.
The starting point for two-way search is. As shown, the O point represents the start of the search, and point a represents the search target. The Gray circle represents a one-way search for possible search scopes, and a green 2 circle indicates a search scope for a two-way search, and it is easy to prove that the area of green must be smaller than the grey.
Figure 2. Bidirectional search
(4) increased L - R selection Algorithm (LRS, plus-l minus-r Selection)
There are two forms of the algorithm:
The <1> algorithm starts with an empty set, and each wheel is preceded by an L feature, then the R features are removed, which makes the evaluation function optimal. (L > R)
<2> algorithm from the complete set, each round first remove the R features, and then add the L features, so that the evaluation function value is optimal. (L < R)
Algorithm Evaluation: L-R selection algorithm combines sequence forward selection and sequence back selection, and the selection of M and S is the key to the algorithm.
(5) sequence Floating selection (sequential floating Selection)
Algorithm Description: The sequence floating selection is developed by the L-R selection algorithm, the difference between the algorithm and the L-R selection algorithm is that the L and r of the sequence floating selection are not fixed, but "floating", that is, the change.
The sequence float selection is based on the search direction, with the following two variants.
<1> sequence floating forward selection (sffs, sequential floating Forward Selection)
Algorithm Description: Starting from the empty set, each round selects a subset X in the not selected feature, makes the subset X post-evaluation function to be optimal, then selects the subset Z in the selected feature, so that the rejection subset Z post-evaluation function is optimal.
<2> sequence floating back selection (SFBS, sequential floating backward Selection)
Algorithm Description: Similar to sffs, the difference is that Sfbs starts with a complete set, each wheel first rejects the feature, and then adds the feature.
Algorithm evaluation: The sequence floating selection combines the characteristics of sequence forward selection, sequence back selection, L-r selection, and compensates for their shortcomings.
(6) decision Trees (decision Tree Method, DTM)
Algorithm Description: Run the C4.5 or other decision tree generation algorithm on the training sample set, and then run the pruning algorithm on the tree until the decision tree is fully grown. The characteristics of the branches of the final decision tree are the selected subset of features. The decision tree method generally uses the information gain as the evaluation function.
2.2.3 Stochastic algorithm
(1) randomly generated sequence selection algorithm (RGSS, random Generation plus sequential Selection)
Algorithm Description: Randomly produces a subset of features and then executes the SFS and SBS algorithms on that subset.
Algorithm evaluation: Can be used as a supplement to the SFS and SBS to jump out of the local optimal value.
(2) simulated annealing algorithm (SA, simulated annealing)
The simulated annealing algorithm can refer to the plain English analytic simulated annealing algorithm.
Algorithm evaluation: Simulated annealing to some extent overcomes the disadvantage that the sequential search algorithm is easy to get into local optimal value, but if the region of the optimal solution is too small (such as the so-called "golf Hole" terrain), simulated annealing is difficult to solve.
(3) Genetic algorithm (GA, genetic algorithms)
Genetic algorithms can be used to refer to genetic algorithms for introduction.
The algorithm description: First randomly produces a batch of feature subsets, and evaluates these feature subsets by the evaluation function, then multiplies the next generation's feature subset through the operation of crossover, mutation and so on, and the higher the score of the feature subset is selected to participate in the breeding probability. After the multiplication and the fittest of the n generation, the characteristic subset with the highest value of the evaluation function may be produced in the population.
The common disadvantage of stochastic algorithm is that it relies on stochastic factors, and it is difficult to reproduce the experimental results.
2.3 evaluation Function
The function of evaluation functions is to evaluate the quality of the feature subsets provided by the generating process.
According to its working principle, the evaluation function is divided into two major categories, filter, wrapper (Wrapper).
Filters measure the quality of a feature subset by analyzing the characteristics within it. Filters are generally used as preprocessing, regardless of the selection of classifiers. The principle of the filter is 3:
Figure 3. Filter principle (Ricardo Gutierrez-osuna 2008)
The wrapper is essentially a classifier that encapsulates the sample set with the selected subset of features, and the accuracy of the classification as a criterion for measuring the quality of a subset of features. The wrapper is shown in principle 4.
Figure 4. Wrapper principle (Ricardo Gutierrez-osuna 2008)
The following is a brief introduction to common evaluation functions.
(1) relevance (Correlation)
The use of correlation to measure a subset of features is based on the assumption that a good subset of features should have a high degree of correlation with the classification (high correlation), while the correlation between features is lower (Sanyudo low).
Linear correlation coefficients (correlation coefficient) can be used to measure the linear correlation between vectors.
( 2) distance (Distance Metrics)
Feature selection using distance measurement is based on the assumption that a good subset of features should make the sample distances belonging to the same class as small as possible, and that the distances between samples belonging to different classes are as far as possible.
Common distance measures (similarity measures) include Euclidean distance, normalized Euclidean distance, Markov distance, and so on.
(3) information gain (information Gain)
Assuming that the values in the discrete variable y,y include {y1,y2,....,ym}, the probability of Yi appearing is pi. The information entropy of y is defined as:
Information entropy has the following characteristics: If the element distribution of the set Y is more "pure", the information entropy is smaller, and if the Y distribution is more "disordered", the information entropy is greater. In extreme cases: if Y can only take one value, that is, p1=1, then H (Y) takes the minimum value of 0, conversely, if the probabilities of various values are equal, that is, 1/m, then H (Y) takes the maximum value log2m.
After attaching the condition to another variable x, and knowing X=xi, the conditional information entropy of Y (Conditional Entropy) is expressed as:
The information gain for y before and after the join condition X is defined as
Similarly, the information entropy H (c) of the categorical Mark C can be expressed as:
Conditional information Entropy h (C |) for the classification C of a feature FJ FJ) is expressed as:
The change of the information entropy of C before and after selecting the characteristic FJ becomes the information gain of C (Information Gain), and the formula is as follows:
Suppose there is a feature subset A and a feature subset B, the categorical variable is C, and if IG (c| A) > IG (c| b), it is considered that the classification result of selecting feature subset A is better than B, so it tends to select feature subset A.
(4) Consistency (consistency)
If the sample 1 and sample 2 belong to a different classification, but on the characteristics A, B is exactly the same value, then the feature subset {A, a} should not be selected as the final feature set.
(5) Classifier Error Rate (Classifier error rate)
Using a specific classifier, the sample set is classified with a given subset of features, and the accuracy of the classification is used to measure the feature subset.
of the above 5 measurement methods, correlation, distance, information gain, consistency belong to the filter, and the classifier error rate belongs to the wrapper.
The filter is not related to the specific classification algorithm, so its generalization ability between different classification algorithms is stronger, and the computational amount is also small. However, because of the application of the specific classification algorithm in the process of evaluation, the effect of generalization to other classification algorithms may be poor, and the computational amount is also larger.
Resources
[1] M. Dash, H. Liu, Feature Selection for classification. In:intelligent Data Analysis 1 (1997) 131–156.
[2] Lei Yu,huan Liu, Feature Selection for high-dimensional data:a Fast correlation-based Filter Solution
[3] Ricardo Gutierrez-osuna, Introduction to Pattern analysis (lecture 11:sequential Feature Selection)
Http://courses.cs.tamu.edu/rgutier/cpsc689_f08/l11.pdf
A survey of common algorithms for feature selection