I. What is characteristic engineering?There is a saying that is widely circulated in the industry: data and features determine the upper limit of machine learning, and models and algorithms only approximate this limit. What is the characteristic project in the end? As the name implies, its essence is an engineering activity designed to maximize the extraction of features from raw data for use by algorithms and models. By summarizing and concluding, it is believed that feature engineering includes the following aspects:
Feature processing is the core part of feature engineering, Sklearn provides a more complete method of feature processing, including data preprocessing, feature selection, dimensionality reduction and so on. The first exposure to Sklearn is often attracted by its rich and convenient algorithmic model library, but the feature processing library presented here is also very powerful.
In this article, the iris (Iris) dataset in Sklearn is used to illustrate feature processing features. The iris DataSet, organized by Fisher in 1936, contains 4 features (Sepal.length (calyx length), sepal.width (calyx width), petal.length (petal length), Petal.width (petal width)), the eigenvalues are positive floating-point numbers in centimeters. The target value is the classification of iris (Iris setosa), Iris versicolour (Variegated Iris), Iris virginica (Virginia Iris). The code for importing the iris DataSet is as follows: [plain] view plain Copy from sklearn.datasets import load_iris #导入IRIS数据集 iris = Load_ Iris () #特征矩阵 iris.data #目标向量 iris.target 2 data preprocessing
With feature extraction, we can get untreated features, where the features may have the following problems:
1, does not belong to the same dimension: that is, the characteristics of the specifications are not the same, can not be put together to compare. Dimensionless can solve this problem.
2, information redundancy: for some quantitative characteristics, it contains effective information for the interval division, such as academic performance, if only concerned about "pass" or not "pass", then need to be quantitative test scores, converted to "1" and "0" for passing and failing. This problem can be solved by the binary value.
3, the qualitative characteristics can not be used directly: some machine learning algorithms and models can only accept the quantitative characteristics of input, then the qualitative features need to be converted to quantitative features. The simplest way is to specify a quantitative value for each qualitative value, but this approach is too flexible to increase the work of the assistant. Qualitative features are usually converted to quantitative features using dumb coding: Assuming there are n qualitative values, the feature is extended to N features, and when the original eigenvalue is a qualitative value of type I, the I extended feature is assigned a value of 1, and the other extended features are assigned a value of 0. In the way of dumb coding, it is not necessary to increase the work of the assistant, and the character of the dummy code can achieve the nonlinear effect for the linear model.
4, there are missing values: Missing values need to be supplemented.
5. Low information utilization: different machine learning algorithms and models use different information in the data, which is mentioned in the linear model, and the non-linear effect is achieved by using the dummy coding of qualitative features. Similarly, the polynomial of quantitative variables, or other conversions, can achieve non-linear effect.
We use the Preproccessing library in Sklearn for data preprocessing, which can cover solutions to the above problems. 2.1 dimensionless
Dimensionless enables different specifications of data to be converted to the same specification. The common dimensionless method is standardized and interval scaling method. The precondition of standardization is that the eigenvalue obeys the normal distribution, and after normalization, it is converted into a standard normal distribution. The interval scaling method uses the boundary value information to scale the range of the feature to a certain feature, such as [0, 1] and so on. 2.1.1 Standardization
Normalization requires the calculation of the mean and standard deviation of a feature, which is expressed as:
The code for standardizing the data using the Standardscaler class of the Preproccessing library is as follows: [plain] view plain Copy from sklearn.preprocessing import Sta Ndardscaler #标准化, the return value is normalized to the data Standardscaler (). Fit_transform (Iris.data ) 2.1.2 Interval Scaling method
There are several ways to scale the interval, the common one is to use two maximum values for scaling, the formula is expressed as:
The code for interval scaling of data using the Minmaxscaler class of the Preproccessing library is as follows: [plain] view plain Copy from sklearn.preprocessing import minm Axscaler #区间缩放, the return value is the data Minmaxscaler () scaled to the [0, 1] interval. Fit_transform (iris.data)
the difference between 2.1.3 standardization and normalization
In simple terms, standardization is based on the column processing data of the characteristic matrix, which transforms the characteristic value of the sample into the same dimension by finding the Z-score method. Normalization is based on the row processing data of the characteristic matrix, which is designed to have a uniform standard when the sample vectors are computed for similarity in point multiplication or other kernel functions, that is to say, "unit vectors". The normalization formula for the rule L2 is as follows:
The code for normalization of the data using the Normalizer class of the Preproccessing library is as follows:
[plain] view plain Copy from sklearn.preprocessing import Normalizer #归一化, return value is normalized data normalizer (). Fit _transform (Iris.data) 2.2 Quantitative characterization of binary values
The core of the quantitative feature binary is to set a threshold value that is greater than the threshold value of 1, the assignment of less than or equal to the threshold is 0, and the formula is expressed as follows:
The code for binary data using the Binarizer class of the Preproccessing library is as follows: [plain] view plain Copy from sklearn.preprocessing import binarize R #二值化, the threshold value is set to 3 and the return value is two Binarizer (threshold=3). Fit_transform (Iris.data)
2.3 Dummy coding for qualitative features
Because the iris dataset is characterized by a quantitative feature, it is dumb-coded using its target value (which is actually not needed). The code that uses the Preproccessing Library's Onehotencoder class to mute data is as follows: [plain] view plain Copy from sklearn.preprocessing import Oneh Otencoder #哑编码, the target value of the iris DataSet, the return value is the dummy encoded data onehotencoder (). Fit_transform (Iris.target.reshape (( -1,1))
2.4 Missing value calculation
Because there is no missing value for the iris DataSet, a new sample is added to the dataset, and 4 features are assigned Nan, indicating that the data is missing. The code to calculate the missing values for the data using the Imputer class of the Preproccessing library is as follows: