Introduction
In the previous study of machine learning techniques, there was little focus on feature engineering (Feature Engineering), however, the algorithm flow that simply learns machine learning may still not use these algorithms, especially when applied to practical problems, often without knowing how to extract features to model.
The characteristic is the raw material of machine learning system, the influence to the final model is undoubted.
The important significance of feature engineering
Data features directly affect the predictive model you use and the results of your implementation. The better the characteristics of the preparation and selection, the better the result of the implementation.
The factors that influence the prediction result are: model selection, available data and feature extraction .
high-quality features often describe the inherent structure of the data .
Most models can be well learned through the good structure of the data, even if not the optimal model, high-quality features can also get good results. The flexibility of premium features allows you to use simple model calculations faster, easier to understand, and easier to maintain.
High-quality features can make good predictions without using optimal model parameters, so you don't have to struggle to choose the best model and the best parameters.
Feature Engineering definition
Feature engineering is to transform raw data into features, better represent the actual problems of predictive model processing, and improve the accuracy of unknown data. It is to generate, extract, subtract, or combine changes with specific domain knowledge or automated methods that target the problem.
The following figure gives an overview of the feature engineering:
Sub-question of feature engineering 1. Features in machine learning (Feature)
In machine learning and pattern recognition, features are an independent, measurable attribute in observational phenomena. Choosing a large, differentiated, independent feature is a key step in pattern recognition, classification, and regression problems.
Initially the original feature dataset may be too large, or information redundancy, so in machine learning applications, an initial step is to select a subset of features, or to build a set of new feature sets, reduce the function to promote the learning of algorithms, improve generalization ability and interpretative.
In tabular data, an observation data or an instance (a row of a table) is made up of different variables or attributes (one column of the table), where the attribute is actually a feature. But unlike the term attribute, features are useful and meaningful attributes for analyzing and solving problems.
In machine vision, an image is an observation, but the feature may be a line in the graph; in natural language processing, a text is an observation, but the paragraph or word frequency may be a feature; in speech recognition, a speech is an observation, but a SOCAI is a feature.
2. The importance of features (Feature importance)
You can objectively evaluate the practicality of the feature. The importance of discriminant features is a pre-indicator of the selection of features, which are assigned fractions according to their importance, and then sorted according to the scores, where high scores are selected to be placed into the training data set.
This feature may be important if it is highly correlated with dependent variables (predicted things), where correlation coefficients and independent variable methods are common methods.
In the process of building the model, some complex predictive models will evaluate and select the characteristic importance within the algorithm, such as the multivariate Adaptive Regression spline method (multivariate Adaptive Regression splines, MARS), random forest Forest), gradient hoist (Gradient Boosted machines). These models determine the importance of variables during the model preparation phase.
3. Feature extraction (Feature Extraction)
Some observational data, if modeled directly, has too many data in its original state. Like image, audio, and text data, if viewed as tabular data, it contains thousands of attributes.
Feature extraction is the process of automatically descending dimension of original observation and making its feature set small to be modeled.
For tabular data, mapping methods such as Principal Component analysis, clustering, etc. can be used to extract the line or edge (edge) of the image data, and according to the corresponding fields, the image, Video and audio data can be processed with a number of digital signal processing methods.
4. Feature Selection (Feature Selection)
Different characteristics have different effects on the accuracy of the model, some features are not related to the problem to be solved, some features are redundant information, these characteristics should be removed.
Feature selection is the process of automatically selecting the subset of features that are most important to the problem.
Feature selection algorithms can be sorted using scoring methods, and some methods are used to search for feature subsets by repeated experiments, automatically creating and evaluating models to obtain objective, best-predicted feature subsets, and some methods to use feature selection as an additional function of the model, such as the stepwise regression method (stepwise Regression
is an algorithm that automatically makes feature selection in the process of building a model.
5. Feature Building (Feature construction)
Feature importance and selection are objective features that tell the user characteristics, but after these tasks, you need to build the features manually.
Feature building takes a lot of time to process the actual sample data, think about the structure of the data, and how to input the feature data to the predictive algorithm.
For tabular data, feature construction means mixing or combining features to get new features, or by decomposing or slicing features to construct new features, and for textual data, the feature is sufficient to design a text indicator for a specific problem, and for image data, this means automatic filtering to get the relevant structure.
6. Feature Learning (Feature learning)
Feature learning is the automatic identification and use of features in raw data.
Modern deep learning methods have many success stories in the field of feature learning, such as self-encoders and restricted Boltzmann machines. They implement automated learning abstract feature representations (compressed form) in unsupervised or semi-supervised ways, and the results are used to support advanced results such as speech recognition, image classification, object recognition, and other fields.
Abstract feature expressions can be obtained automatically, but you cannot understand and use the results obtained by these studies, only the black-box way to use these features. You cannot easily understand how to create features that are similar or different from those that are well-performing. This skill is very difficult, but at the same time it is also very attractive, very important.
Process of feature Engineering
The conversion process of data in machine learning:
- Select data: Collect consolidated data and plan data into a single dataset
- Preprocessing data: Cleaning, formatting, sampling data
- Transform Data: Feature engineering where
- Modeling data: Building models, evaluating models, adapting models
Iterative process for feature engineering:
- Brainstorm features: In-depth analysis of problems, observation of data features, reference to other issues related to feature engineering methods and application to their own problems
- Feature design: You can automatically extract features, manually construct features, or combine the two
- Feature selection: Using different feature importance scoring methods or feature selection methods
- Evaluation model: Use selected features to predict test data and evaluate model accuracy
Resources
Wiki:feature Learning
Review of data cleansing and feature processing in machine learning
Discover Feature Engineering, how to Engineer Features and how to Get good at It
On the feature engineering in Recommender system
reprint Please indicate the author Jason Ding and its provenance
GitHub Blog Home page (http://jasonding1354.github.io/)
Gitcafe Blog Home page (http://jasonding1354.gitcafe.io/)
CSDN Blog (http://blog.csdn.net/jasonding1354)
Jane Book homepage (http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)
Google search jasonding1354 go to my blog homepage
The feature engineering technology and method of "characteristic engineering"