[R language for data mining] decision tree and random forest
1. Create a decision tree using the party
This section uses the ctree () function in the package party to create a decision tree for the dataset iris. Attribute Sepal. length, Sepal. width (catalog Width), Petal. length (Petal Length) and Petal. width is used to predict Species of Iris ). In this package, the function ctree () creates a decision tree, and predict () predicts another dataset.
Before the model is created, the iris dataset is divided into two subsets: the training set (70%) and the test set (30% ). Using random seed to set a fixed random number can make the randomly selected data reusable.
Observe the structure of the Iris dataset
Str (iris)
Set the random number start point to 1234
Set. seed (1234)
Use the sample function to extract samples and divide the observed values in the dataset into two subsets.
Ind <-sample (2, nrow (iris), replace = TRUE, prob = c (0.7, 0.3 ))
The first part of the sample is the training set.
TrainData <-iris [ind = 1,]
The second part of the sample is the test set.
TestData <-iris [ind = 2,]
Load the package party to create a decision tree and detect the predictions we have seen. The ctree () function provides parameters such as MinSplit, MinBusket, MaxSurrogate, and MaxDepth to control the training of decision trees. Next, we will use the default parameter settings to create a decision tree. Can we use the specific parameter settings? Party to view the function documentation. In the following code, Species (type) in the myFormula formula is the target variable, and other variables are independent variables.
Library (party)
Symbol '~ 'Is the symbol connecting the left and right sides of the equation or formula
MyFormula <-Species ~ Sepal. Length + Sepal. Width + Petal. Length + Petal. Width
Create a decision tree
Iris_ctree <-ctree (myFormula, data = trainData)
Predicted value
Table (predict (iris_ctree), trainData $ Species)
The result is as follows:
As shown in the preceding figure, all the observations of setosa (Iris Mountain) 40 are correctly predicted, while one of versicolor observations is mistaken for virginica (Virginia iris) and virginica (Virginia iris) three observations were mistaken for versicolor ).
Print decision tree
Print (iris_ctree)
Plot (iris_ctree)
Plot (iris_ctree, type = "simple ")
In Figure 1, the bar chart of each leaf node shows the probability that the observed values fall into three varieties. In Figure 2, these probabilities are represented by the y value in each leaf node. For example, the label in Node 2 is "n = 40 y = (, 0)", which indicates a total of 40 observations in this category, and all the observed values belong to the first setosa (Iris ).
Next, we need to use the test set test decision tree.
Test decision tree on the test set
TestPred <-predict (iris_ctree, newdata = testData)
Table (testPred, testData $ Species)
The result is as follows:
From the results shown in the preceding figure, we can see that the decision tree still misjudges the identification of the colored iris and the Virginia iris. Therefore, the current version of ctree () does not properly process the ambiguous values of some attributes. In an instance, the left subtree or the right subtree may be identified.
2. Create a decision tree using package rpart
The rpart package is used in this section to create a decision tree based on the 'bodyfat' dataset. The raprt () function can create a decision tree and select the minimum error prediction. The decision tree is then used to predict another dataset using predict.
First, load the bodyfat dataset and view its attributes.
Data ("bodyfat", package = "TH. data ")
Dim (bodyfat)
Attributes (bodyfat)
Bodyfat [1: 5,]
As in section 1st, a dataset is divided into a training set and a test set, and a decision tree is created based on the training set.
Set. seed (1234)
Ind <-sample (2, nrow (bodyfat), replace = TRUE, prob = c (0.7, 0.3 ))
Bodyfat. train <-bodyfat [ind = 1,]
Bodyfat. test <-bodyfat [ind = 2,]
Library (rpart)
Compile the formula myFormula
MyFormula <-DEXfat ~ Age + waistcirc + hipcirc + elbowbreadth + kneebreadth
Training decision tree
Bodyfat_rpart <-rpart (myFormula, data = bodyfat. train,
Control = rpart. control (minsplit = 10 ))
Draw decision tree
Plot (bodyfat_rpart)
Add text labels
Text (bodyfat_rpart, use. n = T)
The result is as follows:
Select the prediction tree with the minimum prediction error value to optimize the model.
Opt <-which. min (bodyfat_rpart $ cptable [, "xerror"])
Cp <-bodyfat_rpart $ cptable [opt, "CP"]
Bodyfat_prune <-prune (bodyfat_rpart, cp = cp)
Plot (bodyfat_rpart)
Text (bodyfat_rpart, use. n = T)
The optimized decision tree is as follows:
The comparison results show that after the model is optimized, the hipcirc <99.5 layer is removed, perhaps because this layer is unnecessary, you can think about the hierarchy of decision trees with the smallest prediction error.
Then, the optimized decision tree is used for prediction, and the prediction result is compared with the actual value. In the following code, use the abline () function to draw a diagonal line. The predicted value of a good model should be close to the actual value, which means that most of the points should fall above or near the diagonal line.
Prediction based on test set
DEXfat_pred <-predict (bodyfat_prune, newdata = bodyfat. test)
Extreme values of predicted values
Xlim <-range (bodyfat $ DEXfat)
Plot (DEXfat_pred ~ DEXfat, data = bodyfat. test, xlab = "Observed ",
Ylab = "Predicted", ylim = xlim, xlim = xlim)
Abline (a = 0, B = 1)
The rendering result is as follows:
3. Random Forest
We use the package randomForest and use iris data to create a prediction model. The randomForest () function in the package has two shortcomings: first, it cannot process missing values, so that users must fill these missing values before using the function; second, the maximum number of attributes for each category cannot exceed 32. If there are more than 32 attributes, those attributes must be converted before randomForest () is used.
You can also use another package 'cforest 'to create a random forest, and the functions in this package are not subject to the maximum number of attributes. However, the high-dimensional classification attribute will consume a lot of memory and time when creating a random forest.
Ind <-sample (2, nrow (iris), replace = TRUE, prob = c (0.7, 0.3 ))
TrainData <-iris [ind = 1,]
TestData <-iris [ind = 2,]
Library (randomForest)
Species ~ . Refers to the equation between Species and all other attributes.
Rf <-randomForest (Species ~ ., Data = trainData, ntree = 100, proximity = TRUE)
Table (predict (rf), trainData $ Species)
The result is as follows:
The results in the preceding figure show that even if there are still errors in the decision tree, the second and third classes will still be misjudged. You can use print (rf) to know that the false positive rate is 2.88%, you can also input plot (rf) to plot the false positive rate of each tree.
Finally, test the random forest created in the training set on the test set, and use the table () and margin () functions to detect the prediction results.
IrisPred <-predict (rf, newdata = testData)
Table (irisPred, testData $ Species)
Plot the probability that each observed value is judged correct
Plot (margin (rf, testData $ Species ))
The result is as follows:
[R language for data mining] Regression analysis
1. Linear regression
Linear regression uses the following predictive functions to predict future observations:
Among them, x1, x2,... and xk are both prediction variables (factors that affect prediction) and y are the target variables to be predicted (predicted variables ).
The data of the linear regression model comes from the CPI data of Australia, and the quarterly data from 2008 to 2011 is selected.
The first parameter in the rep function is the start time of the vector, from 2008 to 2010. The second parameter indicates that each element in the vector is divided into four small time periods.
Year <-rep (2008:2010, each = 4)
Quarter <-rep (1: 4, 3)
Cpi <-c (162.2, 164.6, 166.5, 166.0,
166.2, 167.0, 168.6, 169.5,
171.0, 172.1, 173.3, 174.0)
In the plot function, axat = "n" indicates that there is no horizontal scale annotation.
Plot (cpi, xaxt = "n", ylab = "CPI", xlab = "")
Draw a horizontal axis
Axis (1, labels = paste (year, quarter, sep = "Q"), at =, las = 3)
Next, observe the correlation between CPI and other variables, such as 'Year (year) 'and 'quarter (quarter.
Cor (year, cpi)
Cor (quarter, cpi)
The output is as follows:
Cor (quarter, cpi)
[1] 0.3738028
Cor (year, cpi)
[1] 0.9096316
Cor (quarter, cpi)
[1] 0.3738028
From the figure above, we can see that the relationship between CPI and the year is positive and very close, and the correlation coefficient is close to 1. The correlation coefficient between CPI and the quarter is about 0.37, but only has a weak positive correlation, the relationship is not obvious.
Then, a linear regression model is created using the lm () function. The year and quarter are the prediction factors, and the CPI is the prediction target.
Create model fit
Fit <-lm (cpi ~ Year + quarter)
Fit
The output result is as follows:
Call:
Lm (formula = cpi ~ Year + quarter)
Coefficients:
(Intercept) year quarter
-7644.488 3.888 1.167
Based on the above output results, the following model formula can be created to calculate CPI:
Among them, c0, c1, and c2 are model fit parameters-7644.488, 3.888, and 1.167, respectively. Therefore, the CPI for 2011 can be calculated as follows:
(Cpi2011 <-fit $ coefficients [[1] + fit $ coefficients [[2] * 2011 +
Fit $ coefficients [[3 ))
The quarterly CPI data for 2011 is 174.4417, 175.6083, 176.7750, and 177.9417 respectively.
The specific parameters of the model can be viewed in the following code:
View model attributes
Attributes (fit)
$ Names
[1] "coefficients" "residuals" "effects" "rank" "fitted. values"
[6] "assign" "qr" "df. residual" "xlevels" "call"
[11] "terms" "model"
$ Class
[1] "lm"
Model parameters
Fit $ coefficients
The error between the observed value and the fitting linear model, also known as the residual.
Residuals (fit)
1 2 3 4 5 6 7
-0.57916667 0.65416667-1.38750000-0.27916667-0.46666667-0.83333333-0.40000000
8 9 10 11 12
-0.66666667 0.44583333 0.37916667-0.41250000
In addition to inserting data into the prediction model formula, you can also use predict () to predict future values.
Input prediction time
Data2011 <-data. frame (year = 2011, quarter =)
Cpi2011 <-predict (fit, newdata = data2011)
Set the style (color and shape) of the points corresponding to the observed values and predicted values on the scatter chart)
Style <-c (rep (1, 12), rep (2, 4 ))
Plot (c (cpi, cpi2011), xaxt = "n", ylab = "CPI", xlab = "", pch = style, col = style)
The sep parameter in the tag sets the interval between the year and the quarter.
Axis (1, at = 1:16, las = 3,
Labels = c (paste (year, quarter, sep = "Q"), "2011Q1", "2011Q2", "2011Q3", "2011Q4 "))
The prediction result is as follows:
The red triangle in the figure above is the predicted value.
2. Logistic regression
Logistic regression is to predict the probability of an event by fitting the data to an online line and based on the resume curve model. You can establish a Logistic regression model using the following equations:
Among them, x1, x2,..., xk is the prediction factor, and y is the prediction target. Ling
, The equation above is converted:
Use the glm () function and set the response variable (interpreted variable) to conform to the binary distribution (family = 'Binomial, 'link = 'logit ') to create a Logistic regression model, for more information about the Logistic regression model, see the following link:
· R Data Analysis Examples-Logit Regression
· LogisticRegression (with R)
3. Generalized linear model
The generalized linear model (GLM) is an extension of the simple least squares regression (OLS). The response variable (that is, the dependent variable of the model) can be a positive integer or classified data, its distribution is an exponential distribution family. Secondly, the relationship between the function that responds to the expected value of the variable (connected function) and the predicted variable is linear. Therefore, when GLM modeling, you must specify the distribution type and connection function. The distribution parameters of this model include binomaial (two distributions), gaussian (normal distribution), gamma (gamma distribution), and poisson (poisson distribution.
The generalized linear model can be built using the glm () function. The data used is the bodyfat dataset that comes with the 'th. Data' package.
Data ("bodyfat", package = "TH. data ")
MyFormula <-DEXfat ~ Age + waistcirc + hipcirc + elbowbreadth + kneebreadth
Set the response variable to conform to the normal distribution, and the corresponding connection function to follow the logarithm distribution.
Bodyfat. glm <-glm (myFormula, family = gaussian ("log"), data = bodyfat)
The prediction type is the response variable.
Pred <-predict (bodyfat. glm, type = "response ")
Plot (bodyfat $ DEXfat, pred, xlab = "Observed Values", ylab = "Predicted Values ")
Abline (a = 0, B = 1)
The prediction result test is shown in the following figure:
As shown in the figure above, although the model also has outliers, most of the data falls on or near a straight line, which means that the model is well established and can better fit the data.
4. Nonlinear regression
If the linear model is to fit a straight line closest to the data point, the nonlinear model is to fit a curve through the data. In R, you can use the nls () function to create a non-linear regression model. You can use '? Nls () 'to view the function documentation.