"R" How to determine the best machine learning algorithm for a data set-snow-clear data network

Source: Internet
Author: User

    • How "R" determines the machine learning algorithm that best fits the data set
How "R" determines the machine learning algorithm that best fits the data setrelease time: 2016-02-25Hits: 199

Spot check (spot checking) machine learning algorithm is how to find the best algorithm model for a given data set. In this article I will introduce eight machine learning algorithms commonly used for spot checks, including the R language code of each algorithm, which you can save and apply to the next machine learning project.

The best algorithm for your data set

You can't know which algorithm works best for your data set before modeling. You have to find the best algorithm to solve your problem through trial and test, and I call this process spot checking. The problem that we are experiencing is not which algorithm should I use to process my datasets? , but what algorithms should I spot to process my datasets?

What algorithms are sampled?

First, you can think about which algorithms might work for your data set.

Second, I recommend trying to blend the algorithm as much as possible and see which method works best for your dataset.

Attempt to mix algorithms (such as event model and tree model)

Try to mix different learning algorithms (such as different algorithms for working with the same type of data)

Try to mix different types of models (such as linear and nonlinear functions or parametric and nonparametric models)

Let's take a concrete look at how to achieve these ideas. In the next chapter we will see how to implement the corresponding machine learning algorithms in the R language.

How do I spot the algorithm in the R language?

There are hundreds of available machine learning algorithms in the R language. If your project requires high predictive accuracy and you have plenty of time, I recommend that you explore as many different algorithms as you can in the practice process. Normally, we don't have much time for testing, so we need to understand some common and important algorithms.

In this chapter you will be exposed to a number of linear and nonlinear algorithms that are often used for spot checks in the R language, but do not include integration algorithms similar to boosting and bagging. Each algorithm is rendered from two perspectives:

    1. Routine training and forecasting methods
    2. Usage of caret Package

You need to know the packages and functions for a given algorithm, and you need to know how to implement these common algorithms with the caret package, so you can efficiently evaluate the accuracy of the algorithm using the caret package's preprocessing, algorithm evaluation, and parameter tuning capabilities. Two standard datasets will be used in this article:

    1. Regression model: BHD (Boston Housing Dataset)
    2. Classification model: PIDD (Pima Indians diabetes Dataset)

All of the code below is complete, so you can save it and apply it to the next machine learning project.

Linear algorithm

These methods have strict assumptions about the function form of the model, although the operation speed of these methods is fast, but the result bias is larger.

The end result of such a model is usually easy to interpret, so if the results of the linear model are sufficiently accurate, then you do not need to adopt a more complex non-linear model.

Linear regression model

The LM () function in the STAT packet can fit the linear regression model using the least squares estimation.

# load the librarylibrary(mlbench)# load datadata(BostonHousing)# fit modelfit <- lm(mdev~>, BostonHousing)# summarize the fitprint(fit)# make predictionspredictions <- predict(fit, BostonHousing)# summarize accuracymse <- mean((BostonHousing$medv - predictions)^2)print(mse)# caret# load librarieslibrary(caret)library(mlbench)# load datasetdata(BostonHousing)# trainset.seed(7)control <- trainControl(method="cv", number=5)fit.lm <- train(medv~., data=BostonHousing, method="lm", metric="RMSE", preProc=c("center", "scale"), trControl=control)# summarize fitprint(fit.lm)
Rogers regression model

The GLM () function in the stat package can be used to fit a generalized linear model. It can be used to fit the Rogers regression model to deal with the problem of two-tuple classification.

# load the librarylibrary(mlbench)# Load the datasetdata(PimaIndiansDiabetes)# fit modelfit <- glm(diabetes~., data=PimaIndiansDiabetes, family=binomial(link=‘logit‘))# summarize the fitprint(fit)# make predictionsprobabilities <- predict(fit, PimaIndiansDiabetes[,1:8], type=‘response‘)predictions <- ifelse(probabilities > 0.5,‘pos‘,‘neg‘)# summarize accuracytable(predictions, PimaIndiansDiabetes$diabetes)# caret# load librarieslibrary(caret)library(mlbench)# Load the datasetdata(PimaIndiansDiabetes)# trainset.seed(7)control <- trainControl(method="cv", number=5)fit.glm <- train(diabetes~., data=PimaIndiansDiabetes, method="glm", metric="Accuracy", preProc=c("center", "scale"), trControl=control)# summarize fitprint(fit.glm)
Linear discriminant Analysis

The LDA () function in the mass package can be used to fit a linear discriminant analysis model.

# load the librarieslibrary(MASS)library(mlbench)# Load the datasetdata(PimaIndiansDiabetes)# fit modelfit <- lda(diabetes~., data=PimaIndiansDiabetes)# summarize the fitprint(fit)# make predictionspredictions <- predict(fit, PimaIndiansDiabetes[,1:8])$class# summarize accuracytable(predictions, PimaIndiansDiabetes$diabetes)# caret# load librarieslibrary(caret)library(mlbench)# Load the datasetdata(PimaIndiansDiabetes)# trainset.seed(7)control <- trainControl(method="cv", number=5)fit.lda <- train(diabetes~., data=PimaIndiansDiabetes, method="lda", metric="Accuracy", preProc=c("center", "scale"), trControl=control)# summarize fitprint(fit.lda)
Regularization of the regression

The Glmnet () function in the Glmnet package can be used to fit a regularization classification or regression model.

Classification Model:

# load the librarylibrary(glmnet)library(mlbench)# load datadata(PimaIndiansDiabetes)x <- as.matrix(PimaIndiansDiabetes[,1:8])y <- as.matrix(PimaIndiansDiabetes[,9])# fit modelfit <- glmnet(x, y, family="binomial", alpha=0.5, lambda=0.001)# summarize the fitprint(fit)# make predictionspredictions <- predict(fit, x, type="class")# summarize accuracytable(predictions, PimaIndiansDiabetes$diabetes)# caret# load librarieslibrary(caret)library(mlbench)library(glmnet)# Load the datasetdata(PimaIndiansDiabetes)# trainset.seed(7)control <- trainControl(method="cv", number=5)fit.glmnet <- train(diabetes~., data=PimaIndiansDiabetes, method="glmnet", metric="Accuracy", preProc=c("center", "scale"), trControl=control)# summarize fitprint(fit.glmnet)

Regression model:

# load the librarieslibrary(glmnet)library(mlbench)# load datadata(BostonHousing)BostonHousing$chas <- as.numeric(as.character(BostonHousing$chas))x <- as.matrix(BostonHousing[,1:13])y <- as.matrix(BostonHousing[,14])# fit modelfit <- glmnet(x, y, family="gaussian", alpha=0.5, lambda=0.001)# summarize the fitprint(fit)# make predictionspredictions <- predict(fit, x, type="link")# summarize accuracymse <- mean((y - predictions)^2)print(mse)# caret# load librarieslibrary(caret)library(mlbench)library(glmnet)# Load the datasetdata(BostonHousing)# trainset.seed(7)control <- trainControl(method="cv", number=5)fit.glmnet <- train(medv~., data=BostonHousing, method="glmnet", metric="RMSE", preProc=c("center", "scale"), trControl=control)# summarize fitprint(fit.glmnet)
Nonlinear algorithms

The nonlinear algorithm has less limitation on the form of the model function, and this kind of model usually has the characteristics of high precision and large variance.

K Nearest Neighbor Method

The KNN3 () function in the caret package does not establish a model, but rather directly predicts the training set data. It can be used both for classification models and for regression models.

Classification Model:

# knn direct classification# load the librarieslibrary(caret)library(mlbench)# Load the datasetdata(PimaIndiansDiabetes)# fit modelfit <- knn3(diabetes~., data=PimaIndiansDiabetes, k=3)# summarize the fitprint(fit)# make predictionspredictions <- predict(fit, PimaIndiansDiabetes[,1:8], type="class")# summarize accuracytable(predictions, PimaIndiansDiabetes$diabetes)# caret# load librarieslibrary(caret)library(mlbench)# Load the datasetdata(PimaIndiansDiabetes)# trainset.seed(7)control <- trainControl(method="cv", number=5)fit.knn <- train(diabetes~., data=PimaIndiansDiabetes, method="knn", metric="Accuracy", preProc=c("center", "scale"), trControl=control)# summarize fitprint(fit.knn)

Regression model:

# load the librarieslibrary(caret)library(mlbench)# load datadata(BostonHousing)BostonHousing$chas <- as.numeric(as.character(BostonHousing$chas))x <- as.matrix(BostonHousing[,1:13])y <- as.matrix(BostonHousing[,14])# fit modelfit <- knnreg(x, y, k=3)# summarize the fitprint(fit)# make predictionspredictions <- predict(fit, x)# summarize accuracymse <- mean((BostonHousing$medv - predictions)^2)print(mse)# caret# load librarieslibrary(caret)data(BostonHousing)# Load the datasetdata(BostonHousing)# trainset.seed(7)control <- trainControl(method="cv", number=5)fit.knn <- train(medv~., data=BostonHousing, method="knn", metric="RMSE", preProc=c("center", "scale"), trControl=control)# summarize fitprint(fit.knn)
Naive Bayesian algorithm

The Naivebayes () function in the e1071 package can be used to fit the naïve Bayesian model in the classification problem.

# load the librarieslibrary(e1071)library(mlbench)# Load the datasetdata(PimaIndiansDiabetes)# fit modelfit <- naiveBayes(diabetes~., data=PimaIndiansDiabetes)# summarize the fitprint(fit)# make predictionspredictions <- predict(fit, PimaIndiansDiabetes[,1:8])# summarize accuracytable(predictions, PimaIndiansDiabetes$diabetes)# caret# load librarieslibrary(caret)library(mlbench)# Load the datasetdata(PimaIndiansDiabetes)# trainset.seed(7)control <- trainControl(method="cv", number=5)fit.nb <- train(diabetes~., data=PimaIndiansDiabetes, method="nb", metric="Accuracy", trControl=control)# summarize fitprint(fit.nb)
Support Vector Machine algorithm

The KSVM () function in the Kernlab package can be used to fit the support vector machine model in classification and regression problems.

Classification Model:

# Classification Example:# load the librarieslibrary(kernlab)library(mlbench)# Load the datasetdata(PimaIndiansDiabetes)# fit modelfit <- ksvm(diabetes~., data=PimaIndiansDiabetes, kernel="rbfdot")# summarize the fitprint(fit)# make predictionspredictions <- predict(fit, PimaIndiansDiabetes[,1:8], type="response")# summarize accuracytable(predictions, PimaIndiansDiabetes$diabetes)# caret# load librarieslibrary(caret)library(mlbench)# Load the datasetdata(PimaIndiansDiabetes)# trainset.seed(7)control <- trainControl(method="cv", number=5)fit.svmRadial <- train(diabetes~., data=PimaIndiansDiabetes, method="svmRadial", metric="Accuracy", trControl=control)# summarize fitprint(fit.svmRadial)

Regression model:

# Regression Example:# load the librarieslibrary(kernlab)library(mlbench)# load datadata(BostonHousing)# fit modelfit <- ksvm(medv~., BostonHousing, kernel="rbfdot")# summarize the fitprint(fit)# make predictionspredictions <- predict(fit, BostonHousing)# summarize accuracymse <- mean((BostonHousing$medv - predictions)^2)print(mse)# caret# load librarieslibrary(caret)library(mlbench)# Load the datasetdata(BostonHousing)# trainset.seed(7)control <- trainControl(method="cv", number=5)fit.svmRadial <- train(medv~., data=BostonHousing, method="svmRadial", metric="RMSE", trControl=control)# summarize fitprint(fit.svmRadial)
Classification and regression tree

The Rpart () function in the Rpart package can be used to fit the cart classification tree and the regression tree model.

Classification Model:

# load the librarieslibrary(rpart)library(mlbench)# Load the datasetdata(PimaIndiansDiabetes)# fit modelfit <- rpart(diabetes~., data=PimaIndiansDiabetes)# summarize the fitprint(fit)# make predictionspredictions <- predict(fit, PimaIndiansDiabetes[,1:8], type="class")# summarize accuracytable(predictions, PimaIndiansDiabetes$diabetes)# caret# load librarieslibrary(caret)library(mlbench)# Load the datasetdata(PimaIndiansDiabetes)# trainset.seed(7)control <- trainControl(method="cv", number=5)fit.rpart <- train(diabetes~., data=PimaIndiansDiabetes, method="rpart", metric="Accuracy", trControl=control)# summarize fitprint(fit.rpart)

Regression model:

# load the librarieslibrary(rpart)library(mlbench)# load datadata(BostonHousing)# fit modelfit <- rpart(medv~., data=BostonHousing, control=rpart.control(minsplit=5))# summarize the fitprint(fit)# make predictionspredictions <- predict(fit, BostonHousing[,1:13])# summarize accuracymse <- mean((BostonHousing$medv - predictions)^2)print(mse)# caret# load librarieslibrary(caret)library(mlbench)# Load the datasetdata(BostonHousing)# trainset.seed(7)control <- trainControl(method="cv", number=2)fit.rpart <- train(medv~., data=BostonHousing, method="rpart", metric="RMSE", trControl=control)# summarize fitprint(fit.rpart)
Other algorithms

The R language also provides a number of machine learning algorithms that caret can use. I suggest you explore more algorithms and apply them to your next machine learning project.

Caret Model List This page provides the function of the machine learning algorithm in Caret and the mapping of its corresponding package. You can learn how to build machine learning models using caret.

Summarize

Eight commonly used machine learning algorithms are described in this article:

    1. Linear regression model
    2. Rogers regression model
    3. Linear discriminant Analysis
    4. Regularization of the regression
    5. K Nearest Neighbor
    6. Naive Bayesian
    7. Support Vector Machine
    8. Classification and regression tree

From the introduction above, you can learn how to implement these algorithms using packages and functions in the R language. You can also learn how to use the caret package to implement all of the machine learning algorithms mentioned above. Finally, you can also apply these algorithms to your machine learning project.

What's your next plan?

Have you tried the algorithm code in this article?

    • Open your R language software.
    • Enter the code in the above and run it.
    • See the Help documentation to learn more about function usage.

This article is reproduced from the data craftsman, translator Fibears. The original spot Check machine learning algorithms in R (algorithms-Try on your next project) author Jason Brownlee. Reprint please specify the translation link http://datartisan.com/article/detail/86.html

"R" How to determine the best machine learning algorithm for a data set-snow-clear data network

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.