Features of machine learning learning

Last Update:2015-06-09 Source: Internet

Author: User

Tags wrapper

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Draw a map, there is the wrong place to welcome correct:

In machine learning, features are critical. These include the extraction of features and the selection of features. They are two ways of descending dimension, but they are different:

feature extraction (Feature Extraction): creatting A subset of new features by combinations of the exsiting features. In other words, after the feature extraction A feature is a mapping of the original feature.

Feature Selection (Feature Selection): Choosing a subset of all the features (the ones more informative). In other words, feature selection is a subset of the original feature.

Feature Extraction

In the best case, of course, there are experts who know what features to extract, but without knowing the general dimensionality reduction method can be useful: (from wiki)

Principal Component Analysis
Semidefinite embedding
Multifactor dimensionality Reduction
Multilinear SubSpace Learning
Nonlinear dimensionality reduction
Isomap
Kernel PCA
Multilinear PCA
Latent semantic Analysis
Partial least Squares
Independent component Analysis
Autoencoder

(1) Signal representation (signal indication): The goal of the feature extraction mapping is to represent the samples accurately in a l Ow-dimensional space. In other words, feature extraction features to be able to accurately represent the sample information, so that the loss of information is very small. The corresponding method is PCA.

(2) Signal classification (signal classification): The goal of the feature extraction mapping is toenhance the class-discriminatory informat Ion in a low-dimensional space. That is to say, feature extraction after the characteristics, to make the classification of the accuracy is very high, can not be compared with the original characteristics of the accuracy of the classification is low. In terms of linearity, the corresponding method is LDA.

In image processing, there are a wide range of applications.

Feature SelectionMain process:

(1) Production process

2.2.1 Full Search

The full search is divided into two categories: exhaustive search (exhaustive) and non-exhaustive search (non-exhaustive).

(1) Breadth First search (breadth)

Algorithm Description: Breadth-first traversal of feature subspace.

Algorithm evaluation: Enumeration of all feature combinations, is exhaustive search, time complexity is O (2n), practicality is not high.

(2) Branch clearance search (Branch and Bound)

Algorithm Description: The branch limit is added on the basis of exhaustive search. For example, if certain branches are not likely to be able to search for a better solution than the best solution currently found, they can be cut off.

(3) Targeted searches (Beam search)

Algorithm Description: First select the n highest-scoring feature as a subset of features, add it to a priority queue that limits the maximum length, each time the highest-scoring subset is removed from the queue, and then exhaustive all the feature sets that are generated after adding 1 features to the subset, adding the feature sets to the queue.

(4) Best First Search

Algorithm Description: Similar to directed search, the only difference is that the length of the priority queue is not limited.

2.2.2 Heuristic Search

(1) Sequence forward selection (SFS, sequential Forward Selection)

Algorithm Description: Feature subset x starts with an empty set, each time a feature X is selected to add a feature subset X, which makes the feature function J (x) optimal. To put it simply, it is a simple greedy algorithm to choose a feature that makes the evaluation function to the optimal value every time.

Algorithm evaluation: The disadvantage is that you can only add features and not remove features. For example: feature A is completely dependent on features B and C, which can be considered redundant if a feature B and C are added. Assuming that the sequence forward selection algorithm first adds a to the feature set, and then adds B and C, the feature subset contains redundant feature a.

(2) Sequence back selection (SBS, sequential backward Selection)

Algorithm Description: Starting from the feature complete O, each time a feature x is removed from the Special solicitation O, the value of the evaluation function is optimized after the culling feature X.

Algorithm evaluation: Sequence back selection and sequence forward selection is the opposite, its disadvantage is that features can only be removed cannot join.

In addition, both SFS and SBS are greedy algorithms, which are easy to get into local optimal values.

(3) bidirectional search (BDS, bidirectional search)

Algorithm Description: Use sequence forward selection (SFS) to start from the empty set, and use sequence back selection (SBS) to start the search from the complete set, and stop the search when both search for an identical subset of features C.

The starting point for two-way search is. As shown, the O point represents the start of the search, and point a represents the search target. The Gray circle represents a one-way search for possible search scopes, and a green 2 circle indicates a search scope for a two-way search, and it is easy to prove that the area of green must be smaller than the grey.

Figure 2. Bidirectional search

(4) Increased L-R selection algorithm (LRS, plus-l minus-r Selection)

There are two forms of the algorithm:

The <1> algorithm starts with an empty set, and each wheel is preceded by an L feature, then the R features are removed, which makes the evaluation function optimal. (L > R)

<2> algorithm from the complete set, each round first remove the R features, and then add the L features, so that the evaluation function value is optimal. (L < R)

Algorithm Evaluation: L-R selection algorithm combines sequence forward selection and sequence back selection, and the selection of M and S is the key to the algorithm.

(5) Sequence floating selection (sequential floating Selection)

Algorithm Description: The sequence floating selection is developed by the L-R selection algorithm, the difference between the algorithm and the L-R selection algorithm is that the L and r of the sequence floating selection are not fixed, but "floating", that is, the change.

The sequence float selection is based on the search direction, with the following two variants.

<1> sequence floating forward selection (sffs, sequential floating Forward Selection)

Algorithm Description: Starting from the empty set, each round selects a subset X in the not selected feature, makes the subset X post-evaluation function to be optimal, then selects the subset Z in the selected feature, so that the rejection subset Z post-evaluation function is optimal.

<2> sequence floating back selection (SFBS, sequential floating backward Selection)

Algorithm Description: Similar to sffs, the difference is that Sfbs starts with a complete set, each wheel first rejects the feature, and then adds the feature.

Algorithm evaluation: The sequence floating selection combines the characteristics of sequence forward selection, sequence back selection, L-r selection, and compensates for their shortcomings.

(6) Decision Trees (decision Tree Method, DTM)

Algorithm Description: Run the C4.5 or other decision tree generation algorithm on the training sample set, and then run the pruning algorithm on the tree until the decision tree is fully grown. The characteristics of the branches of the final decision tree are the selected subset of features. The decision tree method generally uses the information gain as the evaluation function.

2.2.3 Stochastic algorithm

(1) Randomly generated sequence selection algorithm (RGSS, random Generation plus sequential Selection)

Algorithm Description: Randomly produces a subset of features and then executes the SFS and SBS algorithms on that subset.

Algorithm evaluation: Can be used as a supplement to the SFS and SBS to jump out of the local optimal value.

(2) Simulated annealing algorithm (SA, simulated annealing)

The simulated annealing algorithm can refer to the plain English analytic simulated annealing algorithm.

Algorithm evaluation: Simulated annealing to some extent overcomes the disadvantage that the sequential search algorithm is easy to get into local optimal value, but if the region of the optimal solution is too small (such as the so-called "golf Hole" terrain), simulated annealing is difficult to solve.

(3) Genetic algorithm (GA, genetic algorithms)

Genetic algorithms can be used to refer to genetic algorithms for introduction.

The algorithm description: First randomly produces a batch of feature subsets, and evaluates these feature subsets by the evaluation function, then multiplies the next generation's feature subset through the operation of crossover, mutation and so on, and the higher the score of the feature subset is selected to participate in the breeding probability. After the multiplication and the fittest of the n generation, the characteristic subset with the highest value of the evaluation function may be produced in the population.

The common disadvantage of stochastic algorithm is that it relies on stochastic factors, and it is difficult to reproduce the experimental results.

(2) Evaluation function

(1) correlation (Correlation)------------filter

The use of correlation to measure a subset of features is based on the assumption that a good subset of features should have a high degree of correlation with the classification (high correlation), while the correlation between features is lower (Sanyudo low).

Linear correlation coefficients (correlation coefficient) can be used to measure the linear correlation between vectors.

(2) distance (Distance Metrics)------------Filter

Feature selection using distance measurement is based on the assumption that a good subset of features should make the sample distances belonging to the same class as small as possible, and that the distances between samples belonging to different classes are as far as possible.

Common distance measures (similarity measures) include Euclidean distance, normalized Euclidean distance, Markov distance, and so on.

(3) Information gain (information Gain)------------filter

Assuming that the values in the discrete variable y,y include {y1,y2,....,ym}, the probability of Yi appearing is pi. The information entropy of y is defined as:

Information entropy has the following characteristics: If the element distribution of the set Y is more "pure", the information entropy is smaller, and if the Y distribution is more "disordered", the information entropy is greater. In extreme cases: if Y can only take one value, that is, p1=1, then H (Y) takes the minimum value of 0, conversely, if the probabilities of various values are equal, that is, 1/m, then H (Y) takes the maximum value log2m.

After attaching the condition to another variable x, and knowing X=xi, the conditional information entropy of Y (Conditional Entropy) is expressed as:

The information gain for y before and after the join condition X is defined as

Similarly, the information entropy H (c) of the categorical Mark C can be expressed as:

Conditional information Entropy h (C |) for the classification C of a feature FJ FJ) is expressed as:

The change of the information entropy of C before and after selecting the characteristic FJ becomes the information gain of C (Information Gain), and the formula is as follows:

Suppose there is a feature subset A and a feature subset B, the categorical variable is C, and if IG (c| A) > IG (c| b), it is considered that the classification result of selecting feature subset A is better than B, so it tends to select feature subset A.

(4) Consistency (consistency)-------------filter

If the sample 1 and sample 2 belong to a different classification, but on the characteristics A, B is exactly the same value, then the feature subset {A, a} should not be selected as the final feature set.

(5) Classifier Error rates (Classifier error rate)---------------Wrapper

Using a specific classifier, the sample set is classified with a given subset of features, and the accuracy of the classification is used to measure the feature subset.

of the above 5 measurement methods, correlation, distance, information gain, consistency belong to the filter, and the classifier error rate belongs to the wrapper.

(3) Stop criteria (4) Verification process

Main 3 categories: (from wiki) Filter Method

Thought: Nothing to do with the model. Based on a number of variable feature metrics (that is, each feature is graded to indicate the importance of the feature), sorting removes those features with lower scores.

Main methods:

1.chi-squared Test (Chi-square test)

2.information gain (information gain) or information gain rate

3.correlation coefficient scores (coefficient of correlation)

Advantages: High efficiency in computational time, high robustness for overfitting problems

Cons: tend to choose redundant features because they do not consider the correlation between features, it is possible that a particular feature has a poor classification ability, but it is combined with some other features to get a good result

Wrapper Method

Thought: Build a black box learning based on the predictive effects of different feature sets, and optimize continuously.

Evaluating feature sets with target learning algorithms

If there is a P feature, then there will be a combination of 2p features, each of which corresponds to a model. The idea of the wrapper method is to enumerate all possible cases and choose the best combination of features.

The problem with this approach is that because each feature combination needs to train a model, and the cost of training the model is actually very large, if p is very large, the above approach is obviously not actionable. Two methods of optimization are described below: Forward search (forward lookup) and backward search (back lookup).

Forward search initially assumes that the collection of selected features is an empty set, the algorithm takes the greedy way to gradually expand the collection until the feature number of the set reaches a threshold, which can be pre-set, or can be obtained by cross-validation. The pseudo code of the algorithm is as follows:

For the outer heavy loop of the algorithm, when the F contains all features or the number of features in F reaches the threshold, the loop ends and the algorithm finally selects the optimal feature set throughout the search process.

Backward search initially assumes that the selected feature set F is a complete collection of features, and the algorithm deletes one feature at a time until the characteristic number of F reaches the specified threshold or F is deleted. When choosing which feature to delete, the algorithm is the same as forward search when selecting a feature to add F.

Consider the selection of subsets as a search optimization problem, generate different combinations, evaluate combinations, and compare them with other combinations. In this way, the selection of a subset is considered an optimization problem,

Main methods: Recursive feature elimination algorithm (recursive feature elimination algorithm). Here are a lot of optimization algorithms can be solved, especially some heuristic optimization algorithms, such as GA,PSO,DE,ABC, see "Optimization algorithm- Artificial colony Algorithm (ABC) "," Optimization algorithm-particle swarm algorithm (PSO) ".

Advantages: Considering the correlation between features and features

Cons: 1. When the observed data is small, it is easy to fit; 2. When the number of features is large, the calculation time increases;

Embedded Method (Compromise)

Thought: The advantage of combining the filter and wrapper methods (with less time complexity and considering the combination of features) that can be used only if you know beforehand what a good feature is.

The main methods: regularization, can see "easy to learn machine learning algorithm-Ridge regression (Ridge Regression)", Ridge regression is the basic linear regression in the process of adding a regular term. We know that the L1 features feature selection, which tends to leave relevant features and delete extraneous features. For example, in the text classification, we no longer need to display the feature selection of this step, but directly throw all the features into the model with L1 regularization, the model of the training process to carry out the feature selection.

Pros: The advantages of the previous two methods are assembled

Cons: Must know in advance what is a good choice

REF:

Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of machine learning, 3, 1157-1182.

Hall, M. A. (1999). correlation-based Feature Selection for machine learning (Doctoral dissertation, the University of Waikato). (Chapter 3\4)

Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence, (1), 273-324.

M. Dash, H. Liu, Feature Selection for classification. In:intelligent Data Analysis 1 (1997) 131–156.

Lei Yu,huan Liu, Feature Selection for high-dimensional data:a Fast correlation-based Filter Solution

Ricardo Gutierrez-osuna, Introduction to Pattern analysis (lecture 11:sequential Feature Selection)

Http://courses.cs.tamu.edu/rgutier/cpsc689_f08/l11.pdf

http://blog.csdn.net/henryczj/article/details/41043883

Http://www.cnblogs.com/heaad/archive/2011/01/02/1924088.html

Features of machine learning learning

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More