I. Overview of decision tree classification algorithms the decision tree algorithm is based on data attributes (or features) and uses attributes as the basis to divide different classes. For example, for the following dataset (Dataset), the first and second columns are attribute (feature), the last column is category tag, 1 indicates yes, and 0 indicates no. The idea of the decision tree algorithm is to classify data based on attributes. For the above data, we can obtain the following decision tree model (decision tree model). First, a part of data is partitioned based on the first attribute, separate the remaining partitions based on the second attribute. Decision Tree algorithms include ID3, C4.5, and cart. Next we will introduce the ID3 algorithm. Ii. Overview of ID3 algorithm ID3 algorithm was first proposed by Quinlan, which is based on information theory and measured by information entropy and information gain, so as to implement data induction and classification. First, the ID3 algorithm needs to solve the problem of selecting features as the criteria for dividing datasets. In the ID3 algorithm, select the attribute with the largest information gain as the current feature for dataset classification. The concept of information gain is described below, and the dataset is continuously divided by selecting features. Secondly, the ID3 algorithm needs to solve the problem of determining the end of the division. There are two cases. The first case is that the divided classes belong to the same class. For example, the leftmost "non-fish" in the class is the data of 5th rows and 6th rows in the dataset; the rightmost "fish" is the 2nd rows and 3rd rows of data in the dataset. Second, there is no attribute available for further splitting. This is the end. Through iteration, we can obtain such a decision tree model. (Basic process of ID3 algorithm) 3. Data Division based on ID3 algorithm is a classification algorithm that measures information entropy and information gain. 1. The concept of entropy mainly refers to the degree of information confusion. The greater the uncertainty of a variable, the greater the entropy value. The entropy formula can be expressed as: Where ,, the probability that a category appears in the sample. 2. Information Gain refers to the change of entropy before and after division. It can be expressed by the following formula: In this formula, it represents the attributes of the sample and a set of values of all attributes. Yes, one of the attribute values is a sample set of values in. Iv. Experiment Simulation 1. data preprocessing the following data is used as an example to implement the ID3 algorithm. (Data-based tables) 2. Experiment results (original data) (partition 1) (partition 2) (partition 3) (final decision tree) Matlab code main program
[Plain]View plaincopy
- % Demo-tree
- % ID3
- % Import data
- % DATA = [, 1;, 1;, 0;, 0;, 0];
- Data = [0, 2, 0, 0;
- 0, 2, 0, 0;
- 1, 2, 0, 0, 1;
- 2, 1, 0, 0, 1;
- 2, 0, 1, 0, 1;
- 2, 0, 1, 1, 0;
- 1, 0, 1, 1;
- 0, 1, 0, 0;
- 0, 0, 1, 0, 1;
- 2, 1, 1, 0, 1;
- 0, 1, 1, 1;
- 1, 1, 1, 1;
- 1, 2, 1, 0, 1;
- 2, 1, 0, 0];
- % Generate decision tree
- Createtree (data );
Generate Decision Tree
[Plain]View plaincopy
- Function [output_args] = createtree (data)
- [M, N] = size (data );
- Disp ('original data :');
- Disp (data );
- Classlist = data (:, N );
- Classone = 1; % record the number of the first class
- For I = 2: m
- If classlist (I, :) = classlist (1 ,:)
- Classone = classone + 1;
- End
- End
- % Categories are all the same
- If classone = m
- Disp ('final data :');
- Disp (data );
- Return;
- End
- % All features used up
- If n = 1
- Disp ('final data :');
- Disp (data );
- Return;
- End
- Bestfeat = choosebestfeature (data );
- Disp (['bestfeat: ', num2str (bestfeat)]);
- Featvalues = unique (data (:, bestfeat ));
- Numoffeatvalue = length (featvalues );
- For I = 1: numoffeatvalue
- Createtree (splitdata (data, bestfeat, featvalues (I ,:)));
- Disp ('-------------------------');
- End
- End
Select the feature with the largest information gain
[Plain]View plaincopy
- % Select the feature with the largest information gain
- Function [bestfeature] = choosebestfeature (data)
- [M, N] = size (data); % get the dataset size
- % Number of Statistical Features
- Numoffeatures = n-1; % The last column is a category.
- % Original entropy
- Baseentropy = calentropy (data );
- Bestinfogain = 0; % initialization information gain
- Bestfeature = 0; % initialize the best feature bit
- % Select the best feature bit
- For j = 1: numoffeatures
- Featuretemp = unique (data (:, j ));
- Numf = length (featuretemp); % number of attributes
- Newentropy = 0; % entropy after Division
- For I = 1: numf
- Subset = splitdata (data, J, featuretemp (I ,:));
- [M_1, n_1] = size (subset );
- Prob = M_1./m;
- Newentropy = newentropy + prob * calentropy (subset );
- End
- % Gain Calculation
- Infogain = baseentropy-newentropy;
- If infogain> bestinfogain
- Bestinfogain = infogain;
- Bestfeature = J;
- End
- End
- End
Calculate entropy
[Plain]View plaincopy
- Function [entropy] = calentropy (data)
- [M, N] = size (data );
- % Get category items
- Label = data (:, N );
- % Processed label
- Label_deal = unique (Label );
- Numlabel = length (label_deal );
- Prob = zeros (numlabel, 2 );
- % Statistics tag
- For I = 1: numlabel
- Prob (I, 1) = label_deal (I ,:);
- For j = 1: m
- If label (J, :) = label_deal (I ,:)
- Prob (I, 2) = prob (I, 2) + 1;
- End
- End
- End
- % Calculate entropy
- Prob (:, 2) = prob (:, 2)./m;
- Entropy = 0;
- For I = 1: numlabel
- Entropy = entropy-prob (I, 2) * log2 (prob (I, 2 ));
- End
- End
Divide data
[Plain]View plaincopy
- Function [subset] = splitdata (data, axis, value)
- [M, N] = size (data); % to get the data size to be divided
- Subset = data;
- Subset (:, axis) = [];
- K = 0;
- For I = 1: m
- If data (I, axis )~ = Value
- Subset (I-K, :) = [];
- K = k + 1;
- End
- End
- End
Http://blog.csdn.net/google19890102/article/details/28611225