Summary of sample selection and feature processing in predictive classification of data mining

Source: Internet
Author: User
Tags min square root wrapper

Based on the characteristic engineering, the user characteristic, combined with the related machine learning algorithm, is used in the field of accurate delivery, forecast, wind control and so on. Whether there is supervised learning classification algorithm or unsupervised clustering, it is necessary to establish feature vectors to preprocess the features, and in the supervised training, sample screening is required. This section explains some of the methodological techniques in sample selection and feature handling.

Before doing the sample training needs to select the sample, need to pay attention to the problem of sample imbalance, such as in the targeted advertising estimate Ctr Two model, the number of clicks (positive sample) and non-click (negative sample) of the amount of data is very large, for the final prediction results will be seriously inclined to the majority of negative sample class, resulting in a high classification However, judging from the accuracy of the evaluation indicators, because of the difference between positive and negative samples, the accuracy rate does not pay attention to the impact of the classification effect evaluation, often the accuracy of classification from the overall view or relatively high.

There are two main ways to solve the imbalance problem of positive and negative sample,

1. Sample sampling:

The positive sample is taken up sampling and the negative sample is sampled downsamping

The upper sampling can be used to simulate the distribution of rare class samples and some samples of the current rare samples.

Generally more the way is to use the next sampling to remove noise, remove redundant samples, the negative samples are clustered, on each cluster layer in proportion to extract part of the sample, in order to achieve a negative sample of the sample as far as possible without affecting the original distribution.

2, the optimization of the algorithm level:

Dividing the sample training set, training the model on each training set, and then integrating the classifier into the integration mode.

The cost-sensitive learning style is cost-sensitivelearning, giving different categories of the cost of the wrong, such as the wrong sub-class samples to do a greater punishment.

SVM, which gives a larger penalty factor for a small number of samples, indicates that we value this part of the sample


I=1...P are all positive samples, and j=p+1...p+q are negative samples.

Adaboost, the initialization of each training case to assign equal weight 1/n, and then use the algorithm to train the T-wheel training set, after each training, training failed training cases to assign a larger weight, that is, to let the learning algorithm in the subsequent learning focus on the more difficult training set to learn. In the unbalanced sample, because the positive class of the cost is much higher than the negative class, so the weight setting is not the same, you can improve the adaboost, the positive class sample set a relatively high weight.

Of course, in the sample selection, the need for first denoising, denoising technology has a lot, such as outlier analysis, sub-box detection and so on, of course, the most intuitive is the same characteristics, the label is inconsistent, this is a separate chapter introduction.

Using machine learning to model, another key factor of success or failure is the choice of features and the preprocessing of features.

The feature selection makes the accuracy and the generalization ability of the model more effective, on the basis of minimizing the experience risk, minimizing the complexity of the model, too many features, the model is too complex, overfitting, and the model's generalization ability is poor. The principle of the Ames razor is that if it is not necessary, do not increase the entity. In addition to removing irrelevant features and avoiding the interdependence of features, too many features can lead to long-time feature analysis training, too complex models, and reduced generalization capabilities.

Let's talk about some of the methods of feature brush selection:

The first thing to do is to understand the business, discuss it with the business, and evaluate all the arguments that affect the dependent variable as much as possible.

After selecting the independent variables, we need to select the features in the following ways: (filter-consider the association between the argument and the target variable, wrapper-offline and online evaluation to add a feature, embedded-using the learner's own filtering function)

Filter method, the main consideration is the association between the independent variable and the target variable.

Correlation coefficients can be used to evaluate correlations between continuous variables, such as Pearson correlation coefficients.

For class type can be used in the way of hypothesis testing, such as chi-square test

For the continuous type of independent variable and two yuan discrete dependent variable, using WOE,IV, through the change of the woe to adjust the best value of the bin, through the IV value, to filter out the high predictive value of the independent variables.

R Squared, the change in a variable has a percentage of how much can be explained by another variable.

And of course, mutual information, information gain, etc.

There is also the need to avoid the problem of collinearity between the independent variables, so-called collinearity, refers to the existence of a strong linear relationship between the independent variables.

Wrapper approach, the main consideration is the offline and online assessment whether to add a feature, through the selection of model evaluation indicators (AUC, MAE, MSE) to evaluate the characteristics of the addition and removal of the model, usually have forward and back two feature selection methods.

Embedded method, through the classification learner itself to the characteristics of automatic brush selection, such as logistic regression L1 L2 penalty coefficient, decision tree based on the maximum entropy of information gain selection characteristics.

The preprocessing of features is mainly in the following ways:

1, outliers and missing value detection processing

2, normalization, the data range of the different independent variables inconsistent, resulting in a more complex, two dimensions of the range of the greater, the slower the gradient decline, may never converge, the use of normalization to speed up convergence.

Normalization of the way

X-min/max-min

Z-score=x–μ/σ

3, change the distribution of data

The original distribution of a continuous type of variable is seriously asymmetric, which interferes with the fitting of the model. Through the transformation of the data into a normal distribution, improve the model of the ability to fit, such as LOG, square root, index and so on.

4. discretization, crossover, derivative variables

The main meanings of discretization are:

On the one hand can weaken the influence of extremes and outliers;

On the other hand, it facilitates the analysis and description of the nonlinear relationship, makes the relationship between the independent variable and the dependent variable clear, the feature Discretization introduces Nonlinearity, enlarges the fitting, and increases the nonlinearity of the linear model (such as logistic regression).

The discretization approach mainly includes:

Segmented, there are many ways, such as frequency, equal interval 、、、

Optimization discretization: The relationship between the independent variable and the dependent variable is considered. A segmentation point is a vertex that causes a significant change in the target variable. Common test indexes are chi-square, information gain, Gini coefficient, WOE (two yuan target variable).

Derived variables, which are processed by raw data, generate new, more commercially meaningful variables that are more suitable for subsequent data modeling.

5, regularization, reduction of dimensions

In order to enhance the generalization ability of the model and solve the problem of overfitting, regularization (penalty) and dimensionality reduction (reducing the dimension of sample) are two common methods. Structural risk minimization is to minimize the experience risk, reduce the training error, but also to reduce the model complexity, regularization is generally in the loss function after the addition of a regularization term, the characteristics of the penalty, to reduce the complexity of the model. Logistic regression in the loss function after the increase of L1, L2, enhance the generalization ability of the model, L1 commonly known as lasso regression, L2 commonly known as Ridge regression, after the maximum likelihood of the weight of the L1 or L2 and other penalties will make the signal weak feature weight is small or even 0. There are many ways to reduce dimensionality, such as mutual information, chi-square test, information gain, subject, etc., in the text of keyword filtering, but also based on the sample data set, select the most frequently occurring keywords as the final feature set.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.