R Language Caret package learning (I.)--Data preprocessing

Last Update:2018-01-15 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The caret package (classification and Regression Training) is a collection of functions that attempt to process the process of creating a predictive model. This series will introduce the data preprocessing, feature selection, sampling, model tuning and so on.

This article will introduce the data preprocessing section in the caret package. The main functions are as follows: Model.matrix (), Dummyvars (), Nearzerovar (), Findcorrelation (), Findlinearcombos (), preprocess (), Classdist ( )

Create a virtual variable

Two main functions for creating virtual variables: Model.matrix, Dummyvars

Model.matrix ()

Model.matrix (object, data = Environment (object), Contrasts.arg = null, Xlev = NULL, ...)

In fact, the main parameter is object, a formula; data is a reference.

Here, the equation can be understood as the data to be expanded (in fact, there can be only one factor level of data, so as to facilitate the expansion of data at the same time, cbind other columns of data, so that the next analysis to use the data), the result is returned the matrix type

As shown above, a 3 factor level of a is expanded into 2 columns, the other one is omitted, reducing the problem of multiple collinearity. But, do not understand, how to get out of intercept?????

Dummyvars ()

Dummyvars (formula, data, Sep = ".", Levelsonly = False, Fullrank = False, ...)

Its usage is similar to Model.matrix, the main parameters are still formula and data

The difference is

You can see that two points are different: 1. You need to call the Predict function to display the Matrix 2. is to expand on all level

In addition, Dummyvars can also generate interactive expansion

Of course, you can also pass sep = ";" will be the form of a;1, levelsonly = TRUE will a.1 change to 1, but A.1:b.1 changed to 1:1 no error is reasonable, because such 1:1 words there is no recognition, only le Vel is the only thing that does not repeat.

Note : Because Dummyvars expands without intercept (intercept), and each factor level has a corresponding virtual variable, these parameters may not be available for some models, such as LM, etc.

0 Variance and approximate 0 variance characteristic variable

Identify the cause of the feature variable that clears the approximate 0 variance:

In some cases, the data generation mechanism creates only feature variables with a single value (for example, a 0 Variance feature variable). This can cause damage to the model, or instability of data fitting, for most models (excluding tree models).
Similarly, there may be some unique values with very low frequency in the feature variable, which may become a characteristic variable of 0 variance when such data uses cross-validation (cross-validation) or self-service (bootstrap) sampling or some samples may have excessive influence on the model.

Methods for identifying this characteristic variable:

The frequency of the maximum number of frequencies is the frequency of the number of the second frequency (called the frequency ratio (frequency ratio)), the value of which is close to 1 for the balanced characteristic variable, and the value for highly unbalanced data is very large.
The unique value scale is the number of unique values divided by the total number of samples multiplied by 100, which is nearly 0 as the granularity of the data increases.

If the frequency ratio is greater than the pre-set threshold, and the ratio of the unique value is less than a threshold, we can assume that the characteristic variable is an approximate 0 variance.

The function is provided in the caret package: Nearzerovar ()

Nearzerovar (x, freqcut = 95/5, Uniquecut = ten, Savemetrics = false, names = False, foreach = False, Allowparallel = TR UE)

Parameter explanation:

X is a numeric vector, matrix, or data frame,freqcut is a frequency-ratio threshold, and uniquecut is a unique-value-proportional-threshold value
Savemetrics is a logical value, FALSE returns the position of the near 0 feature, when True, returns the feature information of the data frame
Name logical value, FALSE, returns the column's index when True, returns the name of the column
Whether foreach calls the foreach package, and if it is ture, uses less memory
Whether the allowpatallel is evaluated in parallel through a foreach packet, which, if true, consumes more memory but takes less time to execute

The parameters returned are explained:

Freqratio frequency ratio percentunique unique value ratio zerovar whether there is only one value Nzv is the nearly 0 variance feature

Clear the nearly 0 Variance feature variable:

Identify the associated feature variables

Some models rely on associated feature variables (like partial least squares (pls,partial least-squares regression)), while some models can benefit from reduced correlations between variables.

findcorrelation (x, cutoff = 0.9, verbose = false, names = false, exact = Ncol (x) <)

Parameter explanation:

X is a correlation matrix
Cutoff is the threshold of the correlation absolute value
Verbose whether to show details
Names whether to return column names. When false, returns the index of the column
Exact logical value, whether the average correlation is recalculated at each step. When the number of dimensions is large, exact calculations will clear fewer features and be slower

Linear correlation

function Findlinearcombos uses QR Decomposition to enumerate the set of linear combinations (if present)

Findlinearcombos (x) only one parameter x,x is a numerical matrix

Returns a list that contains two items:

Linearcombos If there is a linear correlation, display the column index vector
Remove to remove a linear dependency, the column index should be removed

preprocess function

The Preprocess function can perform many operations on feature variables, including centering and normalization. The Preprocess function estimates the required parameters for each operation and is applied by predict.preprocess to the specified dataset.

Preprocess (x, method = C ("center", "scale"), Thresh = 0.95,                     pcacomp = NULL, Na.remove = TRUE, k = 5,                 knnsummary = Mean, outcome = NULL, fudge = 0.2,                 numunique = 3, verbose = FALSE, Freqcut = 95/5,                 uniquecut = ten, cutoff = 0.9,. ..)

x is a matrix or data frame. Non-numeric variables are allowed, but are ignored

method handles the string vector of the type. Some of the following are common:

Center centering, minus the average of the arguments
Scale is normalized by dividing the standard deviation of the independent variable, and the transformed range is [0,1]. If the new sample value is greater than or less than the value in the training set, the value will exceed this range.
Boxcox transformations, for arguments, simple, as powerful as power transformations
Yeojohnson is similar to the Boxcox transformation, but its arguments can be 0 or negative, and Boxcox can only be positive
Expotrans exponential transformation (exponential transformations) can also be used for positive or negative numbers
Zv identifies and clears a numeric argument that contains only one value
Nzv equivalent to applying nearzerovar to clear an argument of nearly 0 variance
Corr Find and filter out the arguments with high correlation, see findcorrelation

These methods are run in the following order: Zero-variance filter, Near-zero variance filter, correlation filter, box-cox/yeo-johnson/exponential Transformation, centering, scaling, range, imputation, PCA, ICA then spatial sign.

Thresh The cumulative variance ratio of PCA

In fact, the Preprocess function is very powerful, it is said that the data can also be interpolated missing values (k nearest neighbor, bagged trees bagged tree)

I have not studied in depth, this is not a narrative

Category Distance Calculation

The Classdist function calculates the class centroid and covariance matrices of the training set, thus determining the Markov distance between the sample and the centroid of each category.

classdist (x, y, groups = 5, PCA = FALSE, keep = NULL, ...)

predict (object, newdata, trans = log, ...) The default distance takes the logarithm, but this can be changed by Predict.classdist's parameter trans

Visible, about diagonal symmetry, and the test is divided into two categories, Blue class, Red class.

Reference:

Http://topepo.github.io/caret/pre-processing.html (in which the text is translated)

R Language Caret package learning (I.)--Data preprocessing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

R Language Caret package learning (I.)--Data preprocessing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

R Language Caret package learning (I.)--Data preprocessing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support