[Reprinted] Easy-to-learn machine learning algorithm-ID3 algorithm of decision tree

Source: Internet
Author: User
Tags id3
I. Overview of decision tree classification algorithms the decision tree algorithm is based on data attributes (or features) and uses attributes as the basis to divide different classes. For example, for the following dataset (Dataset), the first and second columns are attribute (feature), the last column is category tag, 1 indicates yes, and 0 indicates no. The idea of the decision tree algorithm is to classify data based on attributes. For the above data, we can obtain the following decision tree model (decision tree model). First, a part of data is partitioned based on the first attribute, separate the remaining partitions based on the second attribute. Decision Tree algorithms include ID3, C4.5, and cart. Next we will introduce the ID3 algorithm. Ii. Overview of ID3 algorithm ID3 algorithm was first proposed by Quinlan, which is based on information theory and measured by information entropy and information gain, so as to implement data induction and classification. First, the ID3 algorithm needs to solve the problem of selecting features as the criteria for dividing datasets. In the ID3 algorithm, select the attribute with the largest information gain as the current feature for dataset classification. The concept of information gain is described below, and the dataset is continuously divided by selecting features. Secondly, the ID3 algorithm needs to solve the problem of determining the end of the division. There are two cases. The first case is that the divided classes belong to the same class. For example, the leftmost "non-fish" in the class is the data of 5th rows and 6th rows in the dataset; the rightmost "fish" is the 2nd rows and 3rd rows of data in the dataset. Second, there is no attribute available for further splitting. This is the end. Through iteration, we can obtain such a decision tree model. (Basic process of ID3 algorithm) 3. Data Division based on ID3 algorithm is a classification algorithm that measures information entropy and information gain. 1. The concept of entropy mainly refers to the degree of information confusion. The greater the uncertainty of a variable, the greater the entropy value. The entropy formula can be expressed as: Where ,, the probability that a category appears in the sample. 2. Information Gain refers to the change of entropy before and after division. It can be expressed by the following formula: In this formula, it represents the attributes of the sample and a set of values of all attributes. Yes, one of the attribute values is a sample set of values in. Iv. Experiment Simulation 1. data preprocessing the following data is used as an example to implement the ID3 algorithm. (Data-based tables) 2. Experiment results (original data) (partition 1) (partition 2) (partition 3) (final decision tree) Matlab code main program [Plain]View plaincopy
  1. % Demo-tree
  2. % ID3
  3. % Import data
  4. % DATA = [, 1;, 1;, 0;, 0;, 0];
  5. Data = [0, 2, 0, 0;
  6. 0, 2, 0, 0;
  7. 1, 2, 0, 0, 1;
  8. 2, 1, 0, 0, 1;
  9. 2, 0, 1, 0, 1;
  10. 2, 0, 1, 1, 0;
  11. 1, 0, 1, 1;
  12. 0, 1, 0, 0;
  13. 0, 0, 1, 0, 1;
  14. 2, 1, 1, 0, 1;
  15. 0, 1, 1, 1;
  16. 1, 1, 1, 1;
  17. 1, 2, 1, 0, 1;
  18. 2, 1, 0, 0];
  19. % Generate decision tree
  20. Createtree (data );

Generate Decision Tree [Plain]View plaincopy
  1. Function [output_args] = createtree (data)
  2. [M, N] = size (data );
  3. Disp ('original data :');
  4. Disp (data );
  5. Classlist = data (:, N );
  6. Classone = 1; % record the number of the first class
  7. For I = 2: m
  8. If classlist (I, :) = classlist (1 ,:)
  9. Classone = classone + 1;
  10. End
  11. End
  12. % Categories are all the same
  13. If classone = m
  14. Disp ('final data :');
  15. Disp (data );
  16. Return;
  17. End
  18. % All features used up
  19. If n = 1
  20. Disp ('final data :');
  21. Disp (data );
  22. Return;
  23. End
  24. Bestfeat = choosebestfeature (data );
  25. Disp (['bestfeat: ', num2str (bestfeat)]);
  26. Featvalues = unique (data (:, bestfeat ));
  27. Numoffeatvalue = length (featvalues );
  28. For I = 1: numoffeatvalue
  29. Createtree (splitdata (data, bestfeat, featvalues (I ,:)));
  30. Disp ('-------------------------');
  31. End
  32. End

Select the feature with the largest information gain [Plain]View plaincopy
  1. % Select the feature with the largest information gain
  2. Function [bestfeature] = choosebestfeature (data)
  3. [M, N] = size (data); % get the dataset size
  4. % Number of Statistical Features
  5. Numoffeatures = n-1; % The last column is a category.
  6. % Original entropy
  7. Baseentropy = calentropy (data );
  8. Bestinfogain = 0; % initialization information gain
  9. Bestfeature = 0; % initialize the best feature bit
  10. % Select the best feature bit
  11. For j = 1: numoffeatures
  12. Featuretemp = unique (data (:, j ));
  13. Numf = length (featuretemp); % number of attributes
  14. Newentropy = 0; % entropy after Division
  15. For I = 1: numf
  16. Subset = splitdata (data, J, featuretemp (I ,:));
  17. [M_1, n_1] = size (subset );
  18. Prob = M_1./m;
  19. Newentropy = newentropy + prob * calentropy (subset );
  20. End
  21. % Gain Calculation
  22. Infogain = baseentropy-newentropy;
  23. If infogain> bestinfogain
  24. Bestinfogain = infogain;
  25. Bestfeature = J;
  26. End
  27. End
  28. End

Calculate entropy [Plain]View plaincopy
  1. Function [entropy] = calentropy (data)
  2. [M, N] = size (data );
  3. % Get category items
  4. Label = data (:, N );
  5. % Processed label
  6. Label_deal = unique (Label );
  7. Numlabel = length (label_deal );
  8. Prob = zeros (numlabel, 2 );
  9. % Statistics tag
  10. For I = 1: numlabel
  11. Prob (I, 1) = label_deal (I ,:);
  12. For j = 1: m
  13. If label (J, :) = label_deal (I ,:)
  14. Prob (I, 2) = prob (I, 2) + 1;
  15. End
  16. End
  17. End
  18. % Calculate entropy
  19. Prob (:, 2) = prob (:, 2)./m;
  20. Entropy = 0;
  21. For I = 1: numlabel
  22. Entropy = entropy-prob (I, 2) * log2 (prob (I, 2 ));
  23. End
  24. End

Divide data [Plain]View plaincopy
  1. Function [subset] = splitdata (data, axis, value)
  2. [M, N] = size (data); % to get the data size to be divided
  3. Subset = data;
  4. Subset (:, axis) = [];
  5. K = 0;
  6. For I = 1: m
  7. If data (I, axis )~ = Value
  8. Subset (I-K, :) = [];
  9. K = k + 1;
  10. End
  11. End
  12. End
Http://blog.csdn.net/google19890102/article/details/28611225

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.