Cross-sectional data classification--based on R

Source: Internet
Author: User

Resources:

Complex data statistics methods & Network & Help Files

Application: When the dependent variable is a categorical variable and the argument contains more than one categorical variable or a higher level of categorical variable.

One.

(i) Introduction and examples

Data Source: http://archive.ics.uci.edu/ml/datasets/Cardiotocography

Independent variable: LB-FHR baseline (beats per minute)

AC-# of accelerations per second
FM-# of fetal movements per second
UC-# of uterine contractions per second
DL-# of light decelerations per second
DS-# of severe decelerations per second
DP-# of prolongued decelerations per second
Astv-percentage of time with abnormal short term variability
Mstv-mean value of short term variability
Altv-percentage of time with abnormal long term variability
Mltv-mean value of long term variability
Width-width of FHR Histogram
Min-minimum of FHR Histogram
Max-maximum of FHR Histogram
Nmax-# of Histogram peaks
Nzeros-# of histogram zeros
Mode-histogram mode
Mean-histogram Mean
Median-histogram Median
Variance-histogram Variance
Tendency-histogram Tendency
CLASS-FHR Pattern Class Code (1 to 10)

Dependent variable:

Nsp-fetal State Class Code (N=NORMAL; S=suspect; P=pathologic)

(ii) generation of cross-validation datasets

1.10 Cross-validation concept (Baidu Encyclopedia)

The English name is called 10-fold cross-validation, which is used to test algorithm accuracy. is a common test method. The data set is divided into very, in turn, 9 of them as training data, 1 as test data, for testing. Each test will be given a corresponding accuracy (or error rate). The average of the accuracy of the 10 times (or error rate) as an estimate of the precision of the algorithm, it is generally necessary to perform multiple 10 percent cross-validation (such as 10 10 percent cross-validation), and then the mean value, as an estimate of the accuracy of the algorithm.
The choice to divide the dataset into 10 is due to a large number of experiments using a large number of datasets and different learning techniques, suggesting that 10 percent is the right choice for obtaining the best possible error estimates, and there are some theoretical evidence to prove this. But this is not the final diagnosis and the controversy persists. And it seems that the results of 50 percent or 20 percent and 10 percent are comparable.

Fold=function (z=Ten, w,d,seed=7777) {n=Nrow (w) d=1: Ndd=list () e=levels (W[,d]) T=Length (e)Set. Seed (Seed) for(Iinch 1: T) {D0=d[w[,d]==e[i]]j=Length (d0) ZT=rep (1: Z,ceiling (j/z)) [1: J]id=cbind (Sample (Zt,length (ZT)), D0) Dd[[i]=id}mm=list () for(Iinch 1: Z) {u=NULL; for(jinch 1: T) u=c (u,dd[[j]][dd[[j]][,1]==i,2]) Mm[[i]]=U}return(mm)}
#读入数据w=read.csv ("CTG. Naomit.csv") #因子化最后三个哑元变量F=:  #三个分类变量的列数   F) w[,i]=factor (W[,i]) D=  #因变量的位置Z=ten  #折数n =nrow (w) #行数mm=fold (z,w,d,8888)

Two. Decision tree Classification (classification tree)

Library (Rpart.plot) (a=rpart (nsp~., W)) #用决策树你和全部数据并打印输出rpart. Plot (A,type=2, extra=  4)

Rpart.plot parameter Explanation:

X:

An Rpart object. The only required argument.

Type

Type of plot. Five Possibilities:

0 the default. Draw a split label at each split and a node label at each leaf.

1 Label all nodes, not just leaves. Similar to Text.rpart ' s all=true.

2 like 1 but draw the split labels below the node labels. Similar to the plots of the CART book.

3 Draw separate split labels for the left and right directions.

4 like 3 but the label all nodes, not just leaves. Similar to Text.rpart ' s fancy=true. See also Clip.right.labs.

Extra:

Display extra information at the nodes. Possible values:

0 No Extra Information (the default).

1 Display The number of observations that fall in the node (per class for class objects; prefixed by the number of events For Poisson and EXP models). Similar to Text.rpart ' s use.n=true.

2 Class Models:display The classification rate at the node, expressed as the number of correct classifications and the Nu Mber of observations in the node. Poisson and Exp Models:display the number of events.

3 Class models:misclassification Rate at the node, expressed as the number of incorrect classifications and the number of Observations in the node.

4 class models:probability per class of observations in the node (conditioned to the node, sum across a node is 1).

5 class Models:like 4 but does not display the fitted Class.

6 class models:the probability of the second class only. Useful for binary responses.

7 Class Models:like 6 but does not display the fitted Class.

8 class models:the probability of the fitted class.

9 Class models:the probabilities times the fraction of observations in the node (the probability relative to all Observat Ions, sum across all leaves is 1).

Branch

Controls The shape of the branch lines. Specify a value between 0 (V shaped branches) and 1 (square shouldered branches). Default is if (fallen.leaves) 1 else. 2.

Branch=0

Branch=1

Digits:

The number of significant digits in displayed numbers. Default 2.

Rpart.plot (a,extra=4, digits=4)

Cross-sectional data classification--based on R

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.