[Reprinted] Easy-to-learn machine learning algorithm-ID3 algorithm of decision tree

Last Update:2014-06-27 Source: Internet

Author: User

Tags id3

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Overview of decision tree classification algorithms the decision tree algorithm is based on data attributes (or features) and uses attributes as the basis to divide different classes. For example, for the following dataset (Dataset), the first and second columns are attribute (feature), the last column is category tag, 1 indicates yes, and 0 indicates no. The idea of the decision tree algorithm is to classify data based on attributes. For the above data, we can obtain the following decision tree model (decision tree model). First, a part of data is partitioned based on the first attribute, separate the remaining partitions based on the second attribute. Decision Tree algorithms include ID3, C4.5, and cart. Next we will introduce the ID3 algorithm. Ii. Overview of ID3 algorithm ID3 algorithm was first proposed by Quinlan, which is based on information theory and measured by information entropy and information gain, so as to implement data induction and classification. First, the ID3 algorithm needs to solve the problem of selecting features as the criteria for dividing datasets. In the ID3 algorithm, select the attribute with the largest information gain as the current feature for dataset classification. The concept of information gain is described below, and the dataset is continuously divided by selecting features. Secondly, the ID3 algorithm needs to solve the problem of determining the end of the division. There are two cases. The first case is that the divided classes belong to the same class. For example, the leftmost "non-fish" in the class is the data of 5th rows and 6th rows in the dataset; the rightmost "fish" is the 2nd rows and 3rd rows of data in the dataset. Second, there is no attribute available for further splitting. This is the end. Through iteration, we can obtain such a decision tree model. (Basic process of ID3 algorithm) 3. Data Division based on ID3 algorithm is a classification algorithm that measures information entropy and information gain. 1. The concept of entropy mainly refers to the degree of information confusion. The greater the uncertainty of a variable, the greater the entropy value. The entropy formula can be expressed as: Where ,, the probability that a category appears in the sample. 2. Information Gain refers to the change of entropy before and after division. It can be expressed by the following formula: In this formula, it represents the attributes of the sample and a set of values of all attributes. Yes, one of the attribute values is a sample set of values in. Iv. Experiment Simulation 1. data preprocessing the following data is used as an example to implement the ID3 algorithm. (Data-based tables) 2. Experiment results (original data) (partition 1) (partition 2) (partition 3) (final decision tree) Matlab code main program [Plain]View plaincopy

% Demo-tree
% ID3
% Import data
% DATA = [, 1;, 1;, 0;, 0;, 0];
Data = [0, 2, 0, 0;
0, 2, 0, 0;
1, 2, 0, 0, 1;
2, 1, 0, 0, 1;
2, 0, 1, 0, 1;
2, 0, 1, 1, 0;
1, 0, 1, 1;
0, 1, 0, 0;
0, 0, 1, 0, 1;
2, 1, 1, 0, 1;
0, 1, 1, 1;
1, 1, 1, 1;
1, 2, 1, 0, 1;
2, 1, 0, 0];
% Generate decision tree
Createtree (data );

Generate Decision Tree [Plain]View plaincopy

Function [output_args] = createtree (data)
[M, N] = size (data );
Disp ('original data :');
Disp (data );
Classlist = data (:, N );
Classone = 1; % record the number of the first class
For I = 2: m
If classlist (I, :) = classlist (1 ,:)
Classone = classone + 1;
End
End
% Categories are all the same
If classone = m
Disp ('final data :');
Disp (data );
Return;
End
% All features used up
If n = 1
Disp ('final data :');
Disp (data );
Return;
End
Bestfeat = choosebestfeature (data );
Disp (['bestfeat: ', num2str (bestfeat)]);
Featvalues = unique (data (:, bestfeat ));
Numoffeatvalue = length (featvalues );
For I = 1: numoffeatvalue
Createtree (splitdata (data, bestfeat, featvalues (I ,:)));
Disp ('-------------------------');
End
End

Select the feature with the largest information gain [Plain]View plaincopy

% Select the feature with the largest information gain
Function [bestfeature] = choosebestfeature (data)
[M, N] = size (data); % get the dataset size
% Number of Statistical Features
Numoffeatures = n-1; % The last column is a category.
% Original entropy
Baseentropy = calentropy (data );
Bestinfogain = 0; % initialization information gain
Bestfeature = 0; % initialize the best feature bit
% Select the best feature bit
For j = 1: numoffeatures
Featuretemp = unique (data (:, j ));
Numf = length (featuretemp); % number of attributes
Newentropy = 0; % entropy after Division
For I = 1: numf
Subset = splitdata (data, J, featuretemp (I ,:));
[M_1, n_1] = size (subset );
Prob = M_1./m;
Newentropy = newentropy + prob * calentropy (subset );
End
% Gain Calculation
Infogain = baseentropy-newentropy;
If infogain> bestinfogain
Bestinfogain = infogain;
Bestfeature = J;
End
End
End

Calculate entropy [Plain]View plaincopy

Function [entropy] = calentropy (data)
[M, N] = size (data );
% Get category items
Label = data (:, N );
% Processed label
Label_deal = unique (Label );
Numlabel = length (label_deal );
Prob = zeros (numlabel, 2 );
% Statistics tag
For I = 1: numlabel
Prob (I, 1) = label_deal (I ,:);
For j = 1: m
If label (J, :) = label_deal (I ,:)
Prob (I, 2) = prob (I, 2) + 1;
End
End
End
% Calculate entropy
Prob (:, 2) = prob (:, 2)./m;
Entropy = 0;
For I = 1: numlabel
Entropy = entropy-prob (I, 2) * log2 (prob (I, 2 ));
End
End

Divide data [Plain]View plaincopy

Function [subset] = splitdata (data, axis, value)
[M, N] = size (data); % to get the data size to be divided
Subset = data;
Subset (:, axis) = [];
K = 0;
For I = 1: m
If data (I, axis )~ = Value
Subset (I-K, :) = [];
K = k + 1;
End
End
End

Http://blog.csdn.net/google19890102/article/details/28611225

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More