R Language Caret package learning (I.)--Data preprocessing

Source: Internet
Author: User

The caret package (classification and Regression Training) is a collection of functions that attempt to process the process of creating a predictive model. This series will introduce the data preprocessing, feature selection, sampling, model tuning and so on.

This article will introduce the data preprocessing section in the caret package. The main functions are as follows: Model.matrix (), Dummyvars (), Nearzerovar (), Findcorrelation (), Findlinearcombos (), preprocess (), Classdist ( )

Create a virtual variable

Two main functions for creating virtual variables: Model.matrix, Dummyvars

Model.matrix ()

Model.matrix (object, data = Environment (object), Contrasts.arg = null, Xlev = NULL, ...)

In fact, the main parameter is object, a formula; data is a reference.

Here, the equation can be understood as the data to be expanded (in fact, there can be only one factor level of data, so as to facilitate the expansion of data at the same time, cbind other columns of data, so that the next analysis to use the data), the result is returned the matrix type

As shown above, a 3 factor level of a is expanded into 2 columns, the other one is omitted, reducing the problem of multiple collinearity. But, do not understand, how to get out of intercept?????

Dummyvars ()

Dummyvars (formula, data, Sep = ".", Levelsonly = False, Fullrank = False, ...)

Its usage is similar to Model.matrix, the main parameters are still formula and data

The difference is

You can see that two points are different: 1. You need to call the Predict function to display the Matrix 2. is to expand on all level

In addition, Dummyvars can also generate interactive expansion

Of course, you can also pass sep = ";" will be the form of a;1, levelsonly = TRUE will a.1 change to 1, but A.1:b.1 changed to 1:1 no error is reasonable, because such 1:1 words there is no recognition, only le Vel is the only thing that does not repeat.

Note : Because Dummyvars expands without intercept (intercept), and each factor level has a corresponding virtual variable, these parameters may not be available for some models, such as LM, etc.

0 Variance and approximate 0 variance characteristic variable

Identify the cause of the feature variable that clears the approximate 0 variance:

    • In some cases, the data generation mechanism creates only feature variables with a single value (for example, a 0 Variance feature variable). This can cause damage to the model, or instability of data fitting, for most models (excluding tree models).
    • Similarly, there may be some unique values with very low frequency in the feature variable, which may become a characteristic variable of 0 variance when such data uses cross-validation (cross-validation) or self-service (bootstrap) sampling or some samples may have excessive influence on the model.

Methods for identifying this characteristic variable:

    • The frequency of the maximum number of frequencies is the frequency of the number of the second frequency (called the frequency ratio (frequency ratio)), the value of which is close to 1 for the balanced characteristic variable, and the value for highly unbalanced data is very large.
    • The unique value scale is the number of unique values divided by the total number of samples multiplied by 100, which is nearly 0 as the granularity of the data increases.

If the frequency ratio is greater than the pre-set threshold, and the ratio of the unique value is less than a threshold, we can assume that the characteristic variable is an approximate 0 variance.

The function is provided in the caret package: Nearzerovar ()

Nearzerovar (x, freqcut = 95/5, Uniquecut = ten, Savemetrics = false, names = False, foreach = False, Allowparallel = TR UE)

Parameter explanation:

    • X is a numeric vector, matrix, or data frame,freqcut is a frequency-ratio threshold, and uniquecut is a unique-value-proportional-threshold value
    • Savemetrics is a logical value, FALSE returns the position of the near 0 feature, when True, returns the feature information of the data frame
    • Name logical value, FALSE, returns the column's index when True, returns the name of the column
    • Whether foreach calls the foreach package, and if it is ture, uses less memory
    • Whether the allowpatallel is evaluated in parallel through a foreach packet, which, if true, consumes more memory but takes less time to execute

The parameters returned are explained:

Freqratio frequency ratio percentunique unique value ratio zerovar whether there is only one value Nzv is the nearly 0 variance feature

Clear the nearly 0 Variance feature variable:

Identify the associated feature variables

Some models rely on associated feature variables (like partial least squares (pls,partial least-squares regression)), while some models can benefit from reduced correlations between variables.

findcorrelation (x, cutoff = 0.9, verbose = false, names = false, exact = Ncol (x) <)

Parameter explanation:

    • X is a correlation matrix
    • Cutoff is the threshold of the correlation absolute value
    • Verbose whether to show details
    • Names whether to return column names. When false, returns the index of the column
    • Exact logical value, whether the average correlation is recalculated at each step. When the number of dimensions is large, exact calculations will clear fewer features and be slower

Linear correlation

function Findlinearcombos uses QR Decomposition to enumerate the set of linear combinations (if present)

Findlinearcombos (x) only one parameter x,x is a numerical matrix

Returns a list that contains two items:

    • Linearcombos If there is a linear correlation, display the column index vector
    • Remove to remove a linear dependency, the column index should be removed

preprocess function

The Preprocess function can perform many operations on feature variables, including centering and normalization. The Preprocess function estimates the required parameters for each operation and is applied by predict.preprocess to the specified dataset.

Preprocess (x, method = C ("center", "scale"), Thresh = 0.95,                     pcacomp = NULL, Na.remove = TRUE, k = 5,                 knnsummary = Mean, outcome = NULL, fudge = 0.2,                 numunique = 3, verbose = FALSE, Freqcut = 95/5,                 uniquecut = ten, cutoff = 0.9,. ..)        

x is a matrix or data frame. Non-numeric variables are allowed, but are ignored

method handles the string vector of the type. Some of the following are common:

    • Center centering, minus the average of the arguments
    • Scale is normalized by dividing the standard deviation of the independent variable, and the transformed range is [0,1]. If the new sample value is greater than or less than the value in the training set, the value will exceed this range.
    • Boxcox transformations, for arguments, simple, as powerful as power transformations
    • Yeojohnson is similar to the Boxcox transformation, but its arguments can be 0 or negative, and Boxcox can only be positive
    • Expotrans exponential transformation (exponential transformations) can also be used for positive or negative numbers
    • Zv identifies and clears a numeric argument that contains only one value
    • Nzv equivalent to applying nearzerovar to clear an argument of nearly 0 variance
    • Corr Find and filter out the arguments with high correlation, see findcorrelation

These methods are run in the following order: Zero-variance filter, Near-zero variance filter, correlation filter, box-cox/yeo-johnson/exponential Transformation, centering, scaling, range, imputation, PCA, ICA then spatial sign.

Thresh The cumulative variance ratio of PCA

In fact, the Preprocess function is very powerful, it is said that the data can also be interpolated missing values (k nearest neighbor, bagged trees bagged tree)

I have not studied in depth, this is not a narrative

Category Distance Calculation

The Classdist function calculates the class centroid and covariance matrices of the training set, thus determining the Markov distance between the sample and the centroid of each category.

classdist (x, y, groups = 5, PCA = FALSE, keep = NULL, ...)

predict (object, newdata, trans = log, ...) The default distance takes the logarithm, but this can be changed by Predict.classdist's parameter trans

Visible, about diagonal symmetry, and the test is divided into two categories, Blue class, Red class.

Reference:

Http://topepo.github.io/caret/pre-processing.html (in which the text is translated)

R Language Caret package learning (I.)--Data preprocessing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.