Absrtact: The aim of this experiment is to learn and master the classification regression tree algorithm. The cart provides a common tree growth framework that can be instantiated into a variety of different decision trees. The cart algorithm uses a binary recursive segmentation technique to divide the current sample set into two sub-sample sets, so that each non-leaf node of the resulting decision tree has two branches. Therefore, the decision tree generated by the cart algorithm is a two-fork tree with simple structure. The program is programmed on MATLAB platform, which realizes the creation, application and approximate pruning operation of the non-pruning complete binary tree, and the algorithm is extended to the multi-fork tree.
First, technical discussion
1. Non-metric methods
The method of distance measurement between samples or vectors (distance metric) is often used in the various pattern classification algorithms previously studied. The most typical is the nearest neighbor classifier, the concept of distance is the fundamental idea of this classification method. In a neural network, if two input vectors are similar enough, their output is similar. In general, most of the pattern recognition methods in the study of such problems, because eigenvectors are real data, so naturally have the concept of distance.
In the real world, another type of phenolic problem is the use of "semantic data" (nominal data), also known as "nominal" or "nominal data". These data are often discrete, with no concept of similarity or even sequence of relationships. A simple example is given below:
Try the information on the teeth to classify fish and marine mammals. The teeth of some fish are small and delicate (such as giant baleen whales), which are used to sift out tiny plankton in the sea, and other rows of teeth (such as sharks); some marine animals, such as walruses, have long teeth, while others, such as squid, have no teeth at all. There is no clear concept of the similarity or distance measurement of teeth, for example, baleen whales and walrus teeth are no more alike than sharks and squid. The purpose of this experiment is to transform a pattern represented in a real vector form into a pattern represented by a non-metric (nonmetric) semantic attribute.
2. Decision Tree
It is a natural and intuitive approach to use a series of query answers to judge and classify a pattern. A reference to a problem depends on the answer to the previous question. This "questionnaire" approach is particularly effective for non-metric data, as the answer to the question is "Yes/No", "true/False", "attribute value", etc., and does not involve any distance measure concept. These practical problems can be expressed in the form of a decision tree. Structurally, the first node of a tree, also known as the root node, exists at the top and is connected in an orderly manner with other nodes through branches. Continuation of the above construction process, knowing that there is no subsequent leaf node, a decision tree is given an example.
From the root node, ask for a value for a property of the pattern. The different branches that are connected to the root node, corresponding to the different values of this attribute, are directed to the subsequent child nodes of the response according to different results. It is important to note that the branches of the tree must be mutually exclusive, and the branches cover the entire possible value space.
Make the same branch judgment as the node that is currently reachable as the new root node. Continue this process until you reach the leaf node. Each leaf node has a corresponding category tag, and the test sample is marked as the category tag of the leaf node to which it arrives.
One of the advantages of the decision tree algorithm compared with other classifiers (such as neural networks) is that the semantic information embodied in the tree is easily expressed directly in the logical expression. And the other one in the tree classifier is a bit faster to classify. This is because the tree provides a natural mechanism for embedding a priori knowledge of human experts, which in practice is often very effective when the problem is simple and the training sample is small.
3. Classification and regression tree (CART) algorithm
In combination with the above concepts, a question is discussed here, which is the question of constructing or "generating a decision tree" based on the training sample. Suppose a training sample set D, which has a category tag, and a set of attributes for determining the pattern. For a decision tree, the task is to gradually divide the training samples into smaller subsets, an ideal case is that all the samples in each subset have the same category tag, the tree classification operations to this end, such subsets are called "pure" subset. In general, the category tag is not unique, you need to perform an action, either accept the current "defect" of the verdict, stop the classification; or select an attribute to further grow the tree, a process that is a recursive structure of the tree growth process.
From the data structure point of view, it is represented on each node, either the node is already a leaf node (itself already has an explicit category tag), or another attribute is used to continue to split the child nodes. Classification and regression trees are the only common tree growth algorithms. The cart provides a common framework that can be instantiated by the user into a variety of different decision trees.
3.1 The number of branches of the cart algorithm
A single discriminant at the node is called a branch, which divides the training samples into subsets. The branches at the root node correspond to all training samples, and each decision is a subset process. In general, the number of branches of a node is determined by the designer of the tree and may have different values on a tree. The number of branches separated from a node is sometimes referred to as the branch rate of the node (branching ratio), which is represented in B. It is necessary to illustrate the fact that each discriminant can be represented by a two value discriminant. The binary tree is widely used because of its universality and convenient structure.
Query selection and node impurity in 3.2 cart algorithm
In the decision tree design process, a key point is to consider which properties should be selected at each node to be tested or queried. We know that for the numerical data, the classification boundary obtained by the decision tree method has a more self-macroscopic geometric interpretation, whereas for non-numerical data, the process of querying and dividing the data at the node has no direct geometrical interpretation.
A basic principle of the process of constructing a tree is simplicity. The decision tree we expect to get is simple and compact, with very few nodes. In line with this goal, an attempt should be made to find such a query T, which can make the subsequent node data as "pure" as possible. There is a need to define "non-purity" indicators. The "impurity" of node n is denoted by I (n), when the pattern data on the node comes from the same class, I (n) = 0, and if the class mark is evenly distributed, I (n) should be larger. One of the most popular measurements is called "Entropy impurity" (entropy impurity), also known as information purity (information impurity):
where P (ωj) is the frequency at node n belonging to the number of samples in the Ωj class pattern. According to the entropy characteristic, if all pattern samples are from the same class, then the purity is zero, otherwise it is greater than 0. When and only if all categories appear equal probabilities, the entropy value is the maximum value. Here are a few other common definitions of purity:
"Square impurity", according to the idea that when the node sample is from a single category, is not pure 0, to define the impurity under the polynomial, the value is related to the total distribution variance of the two types of distribution:
"Gini not pure", used for multi-class classification problems of variance is not pure (also when the category tag of node n is arbitrarily selected when the corresponding error rate):
The kurtosis of "Gini" index is better than "entropy impurity" when the probability of category marking is equal.
"False classification impurity", used to measure the minimum probability of the classification error of the training sample at node N:
The indicator has the best peak characteristics when the equal probability markers are in the non-purity index discussed previously. However, there is a problem when searching for the maximum value in a continuous parameter space due to discontinuous derivative values.
When given a method of non-purity calculation, another key question is: For a tree, now has grown to node n, require the node as a property query T, how to choose the value to query s? One way of thinking is to choose the one that makes the purity drop the fastest, and the non-purity descent formula can be written as:
where n_l and N_r respectively represent the left and right child nodes, I (n_l) and I (N_r) are corresponding to the impurity. p_l is the probability that the tree grows from N to n_l when the query T is adopted. If the entropy non-purity index is used, the difference of the non-purity is the information gain that this query can provide. Since each query of the binary tree gives only the yes/no answer, the difference in the entropy of purity caused by each branch will not exceed 1 bits.
Second, the experimental steps
Training data:
Write a generic program that generates a two-fork classification tree and use the data from the previous table to train the tree, branching with entropy impurity during training. Use the Treeplot function to draw a decision binary tree using the decision conditions in the process, as shown in.
Use the above procedure to train a good non-pruning complete tree, classify the following patterns:
{A,e,i,l,n},{d,e,j,k,n},{b,f,j,k,m},{c,d,j,l,n}
Using the above procedure to train a good non-pruning complete tree, select one of the two leaf nodes to prune, so that the entropy of the tree after pruning the increment as small as possible.
Third, the experimental results
To classify using a tree that is not pruned:
Use the pruning tree to classify:
Four, MATLAB code
Main function:
Clear all; CLC Close all;percent data preprocessing%Training sample W1 = [' aehkm ';' Beilm ';' Agiln ';' bghkm ';' Agilm '];W2 = [' Bfilm ';' Bfjln ';' Beiln ';' cgjkn ';' CGJLM ';' dgjkm ';' Bdilm '];W3 = [' Dehkn ';' Aehkn ';' DEHLN ';' Dfjln ';' Afhkn ';' DEJLM ';' CFJLM ';' DFHLM '];w = [W1; w2; W3];C= [Ones (5,1);2*ones (7,1);3*ones (8,1)]; % category label% data range Region= [' AD ';' EG ';' HJ ';' KL ';' MN '];% test SampleT1=' Aeiln ';T2=' dejkn ';T3=' bfjkm ';T4=' CDJLM 'The% string matrix data is converted to the corresponding natural number matrix, which facilitates subsequent processing w = w +0; % W=abs (W) Region= ABS ( Region); globalTree; globalFlag;percent non-pruning complete tree%Build a one or two-fork tree and train it with samplesTree=Cart_makebinarytree(W,C, Region) p=[0 1 2 3 4 4 3 2 8 9 Ten Ten 9 8 1 the the];treeplot (p)% draw two fork tree% using the classification treeW1=Cart_usebinarytree(Tree,T1)W2=Cart_usebinarytree(Tree,T2)W3=Cart_usebinarytree(Tree,T3)W4=Cart_usebinarytree(Tree,T4)Percent B pruning W1=cart_pruningbinarytree (tree,t1) w2=cart_pruningbinarytree (tree,t2) w3=cart_pruningbinarytree (Tree, T3) w4=cart_pruningbinarytree (TREE,T4)%% multi-fork tree category%Anytree=Cart_makeanytree(W,C, Region)%Multitree=Cart_makemultitree(W,C, Region)
functionTree = Cart_makebinarysorttree (Train_samples, trainingtargets, region)% non-pruning complete binary tree is implemented recursively based on entropy impurity input variable:% train_samples:n D-dimensional training sample, (n * d) for matrix% trainingtargets: corresponding class attribute for (n *1) Matrix% Region: the upper bound of the eigenvector dimension order (d *2) matrix (eigenvalues take discrete natural number intervals, left small right Large)% output variable: A basic tree node trees% basic tree node structure% one: label (record the current node to determine the dimension used, the table leaves are empty);%: threshold (Record the threshold of the currently used dimension, leaf node time table category); Three: Zodi (less than equal to the threshold of the target to be attributed to this, table leaves are empty)% four: the right branch (greater than the threshold value of this, table leaves are empty) [N,Dim] = size (train_samples); [T,m] = size (region);if Dim~= T | | M ~=2Disp' parameter error, please check ');ReturnEnd% checks if the category property has only one property, and if it is currently a leaf node, it needs to continueif(Length (Unique (trainingtargets)) = =1) Tree.label = []; Tree.value = Trainingtargets (1); % no left and right child nodes Tree. Right= []; Tree. Left= []; Tree.num = n; ReturnEnd% if two samples for two classes directly set to the left and right leaves% difference maximum dimension as query item% alone handle such cases, as an optimization method to deal with the defects mentioned laterifLength (trainingtargets) = =2[M,p] = max (ABS(Train_samples (1,:)-train_samples (2,:))); Tree.label = p; Tree.value = ((Train_samples (1, p) +train_samples (2, p))/2); Tree.num = n; Branchright. Right= []; Branchright. Left= []; Branchright.label = []; Branchright.num =1; Branchleft. Right= []; Branchleft. Left= []; Branchleft.label = []; Branchleft.num =1;ifTrain_samples (1, p) > Tree.value branchright.value = trainingtargets (1); Branchleft.value = Trainingtargets (2);ElseBranchright.value = Trainingtargets (2); Branchleft.value = Trainingtargets (1);EndTree. Right= Branchright; Tree. Left= Branchleft; ReturnEnd% determines the label of the node (the dimension used for determining the current node), the maximum dimension of the drop in entropy is selected, and the percentage of the purity caused by each dimension is computed in turn calculates the maximum value of the optional value in each dimension represents the Dvp=zeros (Dim,2); % record the largest impurity and corresponding threshold value in each dimension fork=1:Dimei=- -*ones (k,2)-region (k,1)+1,1); Iei=0; form = Region (K,1): Region (K,2) Iei =iei +1; % calculate temporary classification results go to the right of the record as1CpI = Train_samples (:, k) > m; SUMCPI = SUM (CpI);ifSUMCPI = = N | | SUMCPI = =0% to one side went, improper, direct inspection of the next continue;EndCpI = [ not(CPI), CPI]; EIt = Zeros (2,1); % statistics are expected to be new to the left and right two branches of the category and the corresponding ratio, and then to obtain entropy of the impurity forj =1:2Cpt = Trainingtargets (CpI (:, j));if(Length (Unique (Cpt)) = =1)% should be hist () The exception problem that exists when processing the same element Pw =0;ElsePw = hist (Cpt,unique (CPT)); Pw=pw/length (CPT); % is divided into the ratio of the class pw=pw.*log2 (PW);EndEIt (j) = SUM (Pw);EndPr = Length (Cpt)/n; EI (Iei) = EIt (1) *(1-PR) + EIt (2) *PR;End[Maxei, p] = max (EI); Nmaxei = SUM (EI = = Maxei);ifNmaxei >1% if the maximum has multiple, take the middle one, slightly improved the default to take only the first maximum value of the defect t = find (EI = = Maxei); p =round(Nmaxei/2); p = t (p);EndDVP (k,1) = Maxei; DVP (k,2) = Region (k,1) +p-1;End% Update node label and threshold [MAXDV, p] = max (DVP (:,1)); NMAXDV = SUM (DVP (:,1) = = MAXDV);ifNMAXDV >1% if the maximum value has more than one dimension attribute with a smaller range, it is slightly improved by default to take only the first maximum value of the defect t = Find (DVP (:,1) = = MAXDV); [D,p] = min (Region (T,2)-region (T,1)); p = t (p);EndTree.label = p; Tree.value = DVP (P,2);% divides the training sample into two categories, forming a two-fork tree for the left and right child nodes, which requires recursive invocation of CprI = Train_samples (:, p) > Dvp (P,2); Cpli = not(CprI); Tree.num = n; Tree. Right= Cart_makebinarysorttree (Train_samples (CprI,:), Trainingtargets (CprI), region); Tree. Left= Cart_makebinarysorttree (Train_samples (cpli,:), Trainingtargets (cpli), region);
% for the two tree trees generated by the function Cart_makebinarytree (), give the Test class W function W = cart_usebinarysorttree(tree, testsamples) when the current node does not have left and right child nodes, it is determined to be a leaf node and return its category properties .if IsEmpty(Tree.right) &&IsEmpty(Tree.left) W =Tree.Value;return;End% non-leaf node decision processifTestsamples (Tree.Label) >Tree.Value W = Cart_usebinarysorttree (Tree.Right,testsamples);ElseW = Cart_usebinarysorttree (Tree.Left,testsamples);End
functionW = Cart_pruningbinarysorttree (tree, Samples)% for two tree trees generated by function cart_makebinarytree (), Traverse tree% by sample Samples if Found that there are long branches with the leaf co-parent node, and its number is not more than the number of samples in the leaves, the parent node is set to the leaf, the category attributes to the majority because the program was not able to use the MATLAB language to implement the two-fork tree modification, so there is no real pruning tree, but the effect of the pruning, When a leaf is encountered, it can return its category attribute Templift=tree. Left; Tempright=tree. Right; Leftempty =IsEmpty(Templift. Right) &&IsEmpty(Templift. Left); rightempty=IsEmpty(Tempright. Right) &&IsEmpty(Tempright. Left);ifLeftempty && Rightempty% encountered hanging with two leaf nodes, perform pruning, choose multiple collationsifTemplift.num > Tempright.num w=templift.value;ElseW=tempright.value;EndReturnElseIfLeftempty | | Rightempty% encountered hanging with a leaf and a child parent node, compare, select multiple collationsifLeftempty && (Templift.num > Tempright.num/3) W=templift.value; ReturnElseIfRightempty && (Tempright.num > Templift.num/3) W=tempright.value; ReturnEnd EndifSamples (tree. Label) > Tree. ValueifRightempty W = tree. Right. Value; ReturnElseW = Cart_pruningbinarysorttree (tree. Right, Samples);EndElse ifLeftempty W = tree. Left. Value; ReturnElseW = Cart_pruningbinarysorttree (tree. Left, Samples);EndEnd
Reference book: Richard O. Duda, Peter E. Hart, David G. Stork "Pattern Classification"
Pattern Recognition: Research and implementation of categorical regression decision tree Cart