Use r language to dig data "three"

Source: Internet
Author: User
Tags random seed

Decision tree and Random forest One, experiment Description 1. Environment Login

No password automatic login, system user name Shiyanlou, password Shiyanlou

2. Introduction to the Environment

This experiment environment uses the Ubuntu Linux environment with the desktop, the experiment will use the program:

1. LX Terminal (lxterminal): Linux command line terminal, Open will enter the bash environment, you can use the Linux command
2. GVim: Very useful editor, the simplest usage can refer to the course Vim editor.
3. R: Enter ' r ' in the command line terminal, enter the interactive environment of R, the following code is running in the interactive environment.

3. Use of the environment

Use the R language Interactive environment to enter the code and files required for the experiment, and use the LX Terminal (lxterminal) to run the required commands.

After completing the experiment, you can click "Experiment" above the desktop to save and share the results to Weibo to show your friends the progress of your study. The lab building provides a back-end system that can prove to be true that you have completed the experiment.

The Experiment records page can be viewed in the My Home page, which contains each experiment and notes, as well as the effective learning time of each experiment (refers to the time of the experiment desktop operation, if there is no action, the system will be recorded as Daze time). These are the proof of authenticity of your studies.

Ii. introduction of the course

This lesson learns to build predictive models using package ' Party ', ' Rpart ' and ' randomforest '. First use the package ' Party ' to build the decision tree and use the decision tree classification. and use package ' Rpart ' to build a decision tree, and then give an instance using package ' randomforest ' to train a random forest model.

Iii. course content 1. Use package ' Party ' to establish decision tree

This section learns to use the function ctree () inside the package ' Party ' to create a decision tree for the dataset ' Iris '. The Properties ' sepal.length (sepals length) ', ' sepal.width (sepals width) ', ' petal.length (petal length) ' and ' petal.width (petal width) ' are used to predict the ' species ' (species) of iris. In this package, the function Ctree () establishes a decision tree, and predict () predicts another dataset.

Before the model was established, the iris data set was divided into two subsets: the training set (70%) and the test set (30%). Using random seed to set a fixed random number can make the randomly selected data reusable.

# Observe the structure of the IRIS data set > str (IRIS) # Set the random number starting point to 1234> set.seed (1234) # Use the sample function to extract the samples and divide the observed values into two subsets of the data set > IND <-sample (2, Nrow (Iris), Replace=true, Prob=c (0.7, 0.3) # The first part of the sample is divided into the training set > Traindata <-iris[ind==1,]# Sample in the second part of the test set > TestData <-iris[ind==2,]

Load package ' party ' to build a decision tree and detect predictions seen. The function Ctree () provides some parameters such as Minsplit, Minbusket, Maxsurrogate, and maxdepth to control the training of the decision tree. Below we will use the default parameter settings to build the decision tree, as for the specific parameter settings can be viewed through the '? Party ' function document. In the following code, the species (kind) in the Myformula formula is the target variable, and the other variable is an independent variable.

> Library (Party) # symbol ' ~ ' is a connection equation or formula left and right on both sides of the symbol > Myformula <-species ~ sepal.length + sepal.width + petal.length + petal. width# Build Decision tree > Iris_ctree <-ctree (Myformula, Data=traindata) # Detect forecasts > table (Predict (iris_ctree), traindata$ Species)

The results appear as follows:

It is known that setosa (IRIS) 40 observations are all correctly predicted, while versicolor (Iris) has an observed value that was misjudged as Virginica (Virginia Iris), and Virginica (Virginia Iris) had 3 observations that were misjudged as versicolor (discoloured Iris).

# Plot Decision tree > Print (iris_ctree) # Draw decision Tree (see Figure 4.2) > Plot (iris_ctree) # Draw the decision tree diagram (see figure 4.1) > Plot (iris_ctree, type= "simple")

Figure 4.1

Figure 4.2

In Figure 4.1, the bar graph of each leaf's node shows the probability of the observed value falling into three varieties. In Figure 4.2, these probabilities are represented by the Y value in each leaf node. For example: the label in Node 2 is "n=40 y= (1,0,0)", which refers to a total of 40 observations in this category, and all of the observed categories belong to the first category Setosa (Iris).

Next, you need to test the decision tree with a test set.

# test decision tree on test set > Testpred <-predict (iris_ctree, NewData = testData) > table (testpred, testdata$species)

The results are as follows:

The results indicate that the decision tree is still mistaken for the recognition of the current decision tree color Iris and Virginia Iris. Ctree () The current version does not deal well with some of the ambiguous values of the property, and in the instance it is possible to be sentenced to the left subtree, and sometimes also to the right sub-tree.

2. Build decision tree Using package ' Rpart '

The ' Rpart ' package is used in this section to build a decision tree based on the ' bodyfat ' data set. The function raprt () can establish a decision tree and can choose the prediction of the minimum error. The decision tree is then used to predict another dataset using predict ().

First, load the ' bodyfat ' dataset and look at some of its properties.

> Data ("bodyfat", package = "Mboost") > Dim (bodyfat) > Attributes (bodyfat) > Bodyfat[1:5,]

As in section 1th, the dataset is divided into training sets and test sets, and a decision tree is established based on the training set.

> set.seed (1234) > IND <-sample (2, Nrow (bodyfat), Replace=true, Prob=c (0.7, 0.3)) > Bodyfat.train <-Bodyfa t[ind==1,]> bodyfat.test <-bodyfat[ind==2,]> Library (rpart) # Write formula myformula> Myformula <-DEXfat ~ Age + WA Istcirc + Hipcirc + elbowbreadth + kneebreadth# training Decision tree > Bodyfat_rpart <-rpart (myformula, data = bodyfat.train,+ cont Rol = Rpart.control (Minsplit = 10)) # Draw Decision Tree > Plot (bodyfat_rpart) # Add text label > text (Bodyfat_rpart, use.n=t)

The results are as follows:

Figure 4.3

Select a prediction tree that predicts the minimum value of the error to optimize the model.

> Opt <-which.min (bodyfat_rpart$cptable[, "Xerror"]) > CP <-bodyfat_rpart$cptable[opt, "CP"]> Bodyfat_ Prune <-Prune (bodyfat_rpart, CP = CP) > Plot (bodyfat_rpart) > Text (Bodyfat_rpart, use.n=t)

The optimized decision tree is as follows:



Figure 4.4
Comparing the results of Figure 4.3 and Figure 4.4, you will find that the optimization of the model is to remove the hipcirc<99.5 layer, perhaps because the hierarchy is not necessary, then you can think about the selection of the results of the smallest prediction error of the decision tree hierarchy is not so thin.

After that, the optimized decision tree will be used to predict that the predicted results will be compared with the actual values. In the following code, a diagonal line is drawn using the function Abline (). The predictive value of a good model should be about close to the real value, which means that most of the points should fall above the slash or near the slash.

# based on the test set forecast > Dexfat_pred <-predict (Bodyfat_prune, newdata=bodyfat.test) # The extremum of the predicted value > Xlim <-range (bodyfat$ Dexfat) > Plot (dexfat_pred ~ Dexfat, data=bodyfat.test, xlab= "observed", + ylab= "predicted", Ylim=xlim, Xlim=xlim) > Abline (a=0, b=1)

The drawing results are as follows:

3. Random Forest

We use the package ' randomforest ' and use iris data to build a predictive model. The Randomforest () function in the package has a shortage of two points. First, it cannot handle missing values so that the user must supplement these missing values before using the function, and second, the maximum number of each categorical attribute cannot exceed 32, and if the attribute is more than 32, those attributes must be converted before using Randomforest ().

A random forest can also be created through another package ' cforest ', and the function within the package is not constrained by the maximum number of attributes, although the high-dimensional classification attribute makes it consume a lot of memory and time when creating random forests.

> IND <-sample (2, Nrow (Iris), Replace=true, Prob=c (0.7, 0.3)) > Traindata <-iris[ind==1,]> testData <- Iris[ind==2,]> Library (randomforest) # species ~ refers to the equation between species and all other attributes > RF <-randomforest (species ~., data= Traindata, ntree=100, proximity=true) > table (Predict (RF), traindata$species)

The results are as follows:

The result shows that even in the decision tree, there are still errors, the second and third types of words will still be misjudged, can be entered ' print (RF) ' To know the error rate of 2.88%, but also through the input ' plot (RF) ' to draw the false rate of each tree.

Finally, the random forest created on the training set is tested on the test set, and the prediction results are detected using the table () and margin () functions.

> irispred <-Predict (RF, newdata=testdata) > table (irispred, testdata$species) # Plot each observed value to be judged correctly by probability plot > plot ( Margin (RF, testdata$species))

The results appear as follows:

thinking: The advantages and disadvantages of random forest and decision tree classification methods.

Use r language to dig data "three"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.