Feature selection and feature learning of "feature engineering"

Last Update:2016-12-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

http://www.jianshu.com/p/ab697790090f

Feature selection and feature learning

In the practical task of machine learning, it is very important to select a set of representative features for building the model. Feature selection typically chooses a subset of features that are strongly related to the category and are weakly correlated with each other, and the specific feature selection algorithm is represented by defining appropriate subset evaluation functions.
In the real world, data is often complex, redundant, and variable, and it is necessary to discover useful features from raw data. Manually selected features rely on human and professional knowledge, not conducive to the promotion. Therefore, we need to learn and extract features through the machine to facilitate the work of feature engineering more quickly and effectively.

Feature Selection

The goal of feature selection is to find the optimal subset of features. Feature selection can eliminate the characteristics of irrelevant (irrelevant) or redundancy (redundant), so as to reduce the number of features, improve the accuracy of the model and reduce the running time. On the other hand, we choose a truly relevant feature simplification model to help understand the process of data generation.
The general process for feature selection is as follows:

In the diagram, (1) A subset is produced: A subset of candidate features is produced according to a certain search strategy, and (2) subset evaluation: the merits and demerits of a subset of features are evaluated by an evaluation function; (3) The stop condition: determines when the feature selection algorithm stops; (4) Subset validation: Used to validate the final selected subset of features

Search strategies for Feature selection

The search strategy of feature selection is divided into: Full search strategy, heuristic strategy and random search strategy.
feature selection is essentially a combinatorial optimization problem , the most direct method to solve combinatorial optimization problem is search, in theory, we can search all possible feature combinations by exhaustive method, select the feature subset which makes the evaluation standard optimal as the final output, but the search space of n features is 2n, The computation of the exhaustive method increases exponentially with the increase of the feature dimension, and it often encounters hundreds of or even tens of thousands of characters in practical application, so the exhaustive method is simple but difficult to be applied in practice. Other search methods are heuristic search and random search, which can find a good balance between computational efficiency and feature subset quality, which is also the goal of many feature selection algorithms.

Full Search (complete)
- Breadth First search (breadth)
  Breadth first traverses the feature subspace. Enumerate all combinations, exhaustive search, practicality is not high.
- Branch Clearance Search (Branch and Bound)
  The branch limit is added on the basis of poor lifting. For example: Cutting out some of the branches that are not likely to search better than the current optimal solution.
  Other, such as directed search (Beam search), best first search, etc.

Heuristic Search (heuristic)
- Sequence forward selection (sfs,sequential Forward Selection)
  Start with an empty set and add an optimal selection each time.
- Sequence back selection (sbs,sequential backward Selection)
  From the complete set, each time you reduce an optimal selection.
- Increased L-R selection algorithm (lrs,plus-l minus-r Selection)
  Start with an empty set, add L, subtract r, select the optimal (L>r) or start with the complete set, subtract r each time, increase L, select the Best (l<r).
  Other, such as bidirectional search (bds,bidirectional search), sequence floating selection (sequential floating Selection), etc.

Randomly search (random)
- Randomly generated sequence selection algorithm (RGSS, random Generation plus sequential Selection)
  A subset of features is randomly generated and then the SFS and SBS algorithms are executed on that subset.
- Simulated annealing algorithm (SA, simulated annealing)
  Accept a solution that is worse than the current solution with a certain probability, and this probability gradually decreases over time
- Genetic algorithm (GA, genetic algorithms)
  The next generation of feature subsets is propagated by cross-mutation operations, and the higher the score, the higher the probability that the subset of features selected to participate in reproduction.
  The common disadvantage of stochastic algorithm is that it relies on stochastic factors, and the experimental results are difficult to reproduce.

Feature Selection algorithm

Feature selection and machine learning algorithm are closely related, according to feature selection, subset evaluation criteria and subsequent learning algorithms can be divided into embedded (Embedded), filter type (filter) and package (Wrapper) type three.

Embedded Feature Selection
In the embedded feature selection, the feature selection algorithm is embedded in the learning algorithm itself as a component. The most typical decision tree algorithms, such as ID3, C4.5, and cart calculation
method, the decision tree algorithm in the tree growth process of each recursive step must choose a feature, the sample set is divided into smaller subsets, the choice of features is usually divided after the purity of the sub-nodes, the more pure sub-nodes, the better the partition effect, visible decision tree generation process is the process of feature selection.

Filter Feature Selection
The evaluation criterion of filter Feature selection is obtained from the intrinsic nature of the data set, and is independent of the specific learning algorithm, so it has good universality. Typically select a feature or subset of features that have a high degree of correlation with the category. The authors of the filter Feature selection suggest that a higher degree of accuracy can be achieved by a feature or subset of features with greater correlation. The evaluation criteria of filter Feature selection are divided into four kinds, namely distance measure, information measure, correlation measure and consistency measure.
The advantages and disadvantages of the filter Feature selection algorithm are:

Advantages: The universality of the algorithm is strong, eliminating the training steps of the classifier, the algorithm is low in complexity, so it is suitable for large-scale data sets, and can quickly remove a large number of irrelevant features, which is very suitable as the pre-filter of the feature.
Disadvantage: Because the evaluation criterion of the algorithm is independent of the specific learning algorithm, the selected feature subset is usually lower than the wrapper method in the classification accuracy rate.

Encapsulated Feature Selection
The encapsulation feature selection is to evaluate the feature subset by using the performance of the learning algorithm. Therefore, for a subset of features to be evaluated, the wrapper method needs to train a classifier to evaluate the subset of features according to the performance of the classifier. The learning algorithms used to evaluate features in the wrapper method are various, such as decision tree, neural network, Bayesian classifier, nearest neighbor method, support vector machine and so on.
The advantages and disadvantages of the encapsulated feature selection algorithm are:

Pros: Compared to the filter method, the feature subset classification performance found by the wrapper method is generally better.
Disadvantage: The characteristics of the wrapper method are not strong, when changing the learning algorithm, it is necessary to re-select the feature selection for the learning algorithm, because each subset of the evaluation of the classifier for the training and testing, so the computational complexity of the algorithm is very high, especially for large-scale data sets, the algorithm execution time is very long.

Effectiveness Analysis

The validity of the feature is analyzed and the characteristic weights of each feature are obtained, according to whether the model can be divided into:
1. With the model-related feature weights, using all the feature data to train out the model, look at the weight of each feature in the model, due to the need to train the model, the weight of the model is related to the model used in this study. Different models have different weight measurement methods for the model. For example, in a linear model, the weighting coefficients of features, etc.
2. Model-independent feature weights. The main analysis of the relationship between the characteristics and label, such analysis is not related to the model used in this study.
Model-independent feature weight analysis methods include (1) cross-entropy, (2) Information Gain, (3) Odds ratio, (4) Mutual information, (5) KL divergence, etc.

Feature Learning

Feature learning can be divided into supervised feature learning and unsupervised feature learning:
Supervised feature learning includes supervised dictionary learning, neural network, multilayer perceptron; unsupervised feature learning includes unsupervised dictionary learning, principal component analysis, independent component analysis, self-encoder, matrix decomposition, and various forms of clustering algorithms.

Supervised feature learning

Supervised dictionary Learning
Dictionary learning is a dictionary that learns from input data a set of representative elements, each of which can be represented as a weighted sum of elements. The dictionary elements and weights are determined by minimizing the average error with the L1 regular term, and the weights are guaranteed to be sparse.
Supervised dictionary learning uses the implicit structure of input data and tags to optimize dictionary elements.

Neural network
Neural networks are multi-layered networks that describe a series of learning algorithms that are composed of interconnected nodes. It is inspired by the nervous system, where nodes can be thought of as neurons and edges can be seen as synapses. Each edge has a corresponding weight, and the network defines the calculation rules that pass the data from the input layer to the output layer.
Multilayer neural networks can be used for feature learning, as they can learn the representation of output in hidden layers.

Non-supervised feature learning

The objective of unsupervised feature learning is to capture the underlying structure in high-dimensional data and to excavate the features of low-dimensional.
K-means Clustering
K-means clustering is a vector quantization method, given a set of vectors, the K-means algorithm organizes these data into K subsets, making each vector a subset of the nearest mean.
In feature learning, the K-means algorithm can cluster some non-tagged input data and then use the centroid of each category to generate new features.
The simplest method is to include a K two-element feature in each input sample, where the J feature is set to 1 if and only if the J centroid distance is nearest to the sampled data. The other way is to use the distance from the subset as a feature, or a subset distance that is converted by a radial basis function.

Principal component Analysis (Principal component Analyst)
Principal component analysis is mainly used for dimensionality reduction. Given the untagged data set, PCA generates p singular value vectors (p is much smaller than the dimensions of the data) and corresponds to the maximum p singular value in the data matrix. This P-singular value vector is a feature vector that is learned from the input data, which represents the direction in which the data has the largest variance.
PCA is a linear feature learning method, because P singular directivity is a linear equation of data matrix.
PCA has several limitations: first, it assumes that the direction of the maximum variance is the most interesting, while in fact many applications may not. PCA relies on the orthogonal transformation of the original data, it only digs the first order and second moment of the data, which does not have a good representation of the data distribution. Finally, PCA can be well-reduced only if the input data vectors are relevant.

Localized linear embedding (local linear embedding)
Local linear embedding (LLE) is a nonlinear unsupervised learning method, which is used to generate low-dimensional near-preserving representations in high-dimensional inputs that have never been labeled.
The general idea of Lle is to reconstruct the original high-dimensional data by keeping the low-dimensional data of the partial collection characteristics of the original data set. Lle contains two main steps, the first step is near-hold (neighbor-preserving), where each input data Xi is weighted and refactored by the K nearest neighbor data points, and by minimizing the average squared-reconstruction error (Average squared Reconstruction error) to find the optimal weight, the second step is to reduce dimension (dimension reduction), in the low-dimensional space to find the vector, the vector using the first step of the weight can be minimized representation error.
Compared to Pca,lle, it is more powerful to use the implicit structure of data.

Independent component Analysis (independent component analyst)
Independent component analysis is a technique using weighted and learning data representation of independent non-Gaussian components. The mandatory condition for non-Gaussian preconditions is that the weights cannot be determined only when all the components satisfy the Gaussian distribution.

Unsupervised dictionary learning (unsupervised dictionary learning)
Unlike supervised dictionary learning, unsupervised dictionaries learn not to use labels for data, but to optimize dictionary elements using the potential structure of the data. The example of unsupervised dictionary learning is sparse coding, which is used to learn the basic functions (i.e., dictionary elements) used for data representation in the label-free data. Sparse encoding can be used to learn the hyper-complete dictionary (overcomplete dictionary), where the number of dictionary elements is much more than the dimension of the input data. K-SVD is a dictionary used to learn sparse representations of data from untagged data.

Deep learning

The layered nervous system inspires a multi-layered deep learning architecture consisting of simple learning modules for feature learning.
The output of each middle tier in the deep learning system can be seen as a representation of the raw input data, with each layer using the representation generated in the previous layer as input, generating a new representation as output to the higher level. The underlying input is the raw data, and the final layer outputs the last low-dimensional feature or characterization.

Restricted Boltzmann machines (Restricted Boltzmann machine)
First Boltzmann is often used to build multi-layered learning structures. It can be represented by a non-binary graph (undirected bipartite graph) that contains a set of two-dollar implicit variables, a set of visible variables, an edge that joins an implicit node and a visible node, which is a special case of a generalized Boltzmann machine without an inner node connection. Each edge of an RBM has a weight that is linked together to define an energy equation that is based on the joint distribution of visible and suppressed nodes. Based on the topology of RBM, the implied variables and visible variables are conditionally independent, which facilitates the calculation of RBM.
RBM can be seen as a layer of unsupervised feature learning, visible variables corresponding to input data, and implied variables corresponding to feature detectors (feature detectors). Using the contrast divergence algorithm (contrastive divergence) to the maximum visible variable probability, training weights.
Generally speaking, the training problem of the above RBM is a non-sparse representation, and the sparse RBM, as a modified version of the RBM, punishes the deviation of the expected implied variable in the small constant by adding a regularization method to the data-likelihood target function.

Self-encoder (Autoencoder)
The self-encoder consists of encoders and decoders. The encoder uses raw data as input to generate features or representations, and the decoder uses the features extracted by the encoder as input to reconstruct the original input data and output it. Encoders and decoders are composed of multilayer RBM. The parameters in the structure are trained by the greedy way of layers and layers: After a layer of feature detectors are learned, they are provided to the upper layers as visible variables for the training of the RBM, which repeats until the stop condition satisfies the end.

Reprint please indicate the author Jason Ding and its provenance
GitHub Blog Home page (http://jasonding1354.github.io/)
Gitcafe Blog Home page (http://jasonding1354.gitcafe.io/)
CSDN Blog (http://blog.csdn.net/jasonding1354)
Jane Book homepage (http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)
Google search jasonding1354 go to my blog homepage

Wen/jasonding (author of Jane's book)
Original link: http://www.jianshu.com/p/ab697790090f
Copyright belongs to the author, please contact the author to obtain authorization, and Mark "book author".

Feature selection and feature learning of "feature engineering"

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Feature selection and feature learning of "feature engineering"

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Feature selection and feature learning of "feature engineering"

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support