1. Deciding on classification attributes;
2. For the current data table, establish a node n;
3. If the data in the database belongs to the same class, N is the leaf and the class is marked on the leaf.
4. If there are no other attributes in the data table to consider, then n is also the leaf, then the category is marked on the leaves according to the principle of minority obedience.
5. Otherwise, select an optimal attribute as the test attribute of the node based on the average information expectation E or gain value.
6. When the node attribute is selected, for each value in the property:
Generate a branch from N and collect data from that branch in the data table to form the data table of the branch node, from the column of the last node attribute in the table, if the branch data table is not empty, use the above algorithm
Original table:
Count |
Age |
Income |
Students |
Credibility |
Category: Buy calculation Machine. |
64 |
Green |
High |
Whether |
Liang |
Don't buy |
64 |
Green |
High |
Whether |
Excellent |
Don't buy |
128 |
In |
High |
Whether |
Liang |
Buy |
60 |
Old |
In |
Whether |
Liang |
Buy |
64 |
Old |
Low |
Is |
Liang |
Buy |
64 |
Old |
Low |
Is |
Excellent |
Don't buy |
64 |
In |
Low |
Is |
Excellent |
Buy |
128 |
Green |
In |
Whether |
Liang |
Don't buy |
64 |
Green |
Low |
Is |
Liang |
Buy |
132 |
Old |
In |
Is |
Liang |
Buy |
64 |
Green |
In |
Is |
Excellent |
Buy |
32 |
In |
In |
Whether |
Excellent |
Buy |
32 |
In |
High |
Is |
Liang |
Buy |
63 |
Old |
In |
Whether |
Excellent |
Don't buy |
1 |
Old |
In |
Whether |
Excellent |
Buy |
First, we calculate the mutual information of the age attribute
by table:
Age |
Count |
Age |
Income |
Students |
Credibility |
Category: Buy a computer. |
60 |
Old |
In |
Whether |
Liang |
Buy 4 |
64 |
Old |
Low |
Is |
Liang |
Buy 5 |
64 |
Old |
Low |
Is |
Excellent |
Not buy 6 |
132 |
Old |
In |
Is |
Liang |
Buy 10 |
63 |
Old |
In |
Whether |
Excellent |
Not buy 14 |
1 |
Old |
In |
Whether |
Excellent |
Buy 15 |
64 |
Green |
High |
Whether |
Liang |
Not buy 1 |
64 |
Green |
High |
Whether |
Excellent |
Not buy 2 |
128 |
Green |
In |
Whether |
Liang |
Not buy 8 |
64 |
Green |
Low |
Is |
Liang |
Buy 9 |
64 |
Green |
In |
Is |
Excellent |
Buy 11 |
128 |
In |
High |
Whether |
Liang |
Buy 3 |
64 |
In |
Low |
Is |
Excellent |
Buy 7 |
32 |
In |
In |
Whether |
Excellent |
Buy 12 |
32 |
In |
High |
Is |
Liang |
Buy 13 |
Matlab code:
Clear
Clc
SM=[64 64 128 60 64 64 64 128 64 132 64 32 32 63 1]; % total population
% Category: Buy computer, do not buy computer;--U1, U2
% Age A1: green, medium and old;
% income A2: Low, medium and high;
% Student A3: yes, no;
% Credit A4: good, excellent;
% seeking prior entropy (for category)
M=sum (SM); % total population
BM=SM (1) +sm (2) +SM (6) +SM (8) +SM (14); % not buy total number of people
MM=M-BM; % total number of buyers
pu1=mm/m;
pu2=bm/m;
hu=-(PU1*LOG2 (PU1) +pu2*log2 (PU2));
%----------------------------------
% posterior entropy (to A1): v1 = cyan, v2=, v3= old
Q1=SM (1) +sm (2) +SM (8) +SM (9) +SM (11); % of the total number of young people
Z1=SM (3) +SM (7) +sm (+) +SM (13); % Middle-aged total
L1=M-Q1-Z1; % of the total number of older persons
pv1=q1/m;
pv2=z1/m;
pv3=l1/m;
% of Qinghai
QM=SM (9) +SM (11); % of young people who buy computers
BM=Q1-QM; % of young people who do not buy computers
pu1v1=qm/q1;
pu2v1=bm/q1;
huv1=-(PU1V1*LOG2 (pu1v1) +pu2v1*log2 (PU2V1));
% for medium
ZM=SM (3) +SM (7) +sm (+) +SM (13);
bm=0;
Pu1v2=1;
pu2v2=0;
huv2=-(PU1V2*LOG2 (PU1V2) +pu2v2*log2 (pu2v2+eps));
% for old
LM=SM (4) +SM (5) +SM (15) +sm;
BM=L1-LM;
PU1V3=LM/L1;
PU2V3=BM/L1;
huv3=-(PU1V3*LOG2 (pu1v3) +pu2v3*log2 (pu2v3));
% conditional entropy (to A1)
T1=[pv1 Pv2 Pv3];
T=[huv1 HUv2 HUv3];
Disp (' H (computer | age): ');
Huv=sum (T.*T1)
% Mutual information (for A1)
Ia1=hu-huv
Secondly, the mutual information of income is calculated.
By the table
Income |
Count |
Age |
Income |
Students |
Credibility |
Category: Buy a computer. |
64 |
Old |
Low |
Is |
Liang |
Buy 5 |
64 |
Old |
Low |
Is |
Excellent |
Not buy 6 |
64 |
Green |
Low |
Is |
Liang |
Buy 9 |
64 |
In |
Low |
Is |
Excellent |
Buy 7 |
64 |
Green |
High |
Whether |
Liang |
Not buy 1 |
64 |
Green |
High |
Whether |
Excellent |
Not buy 2 |
128 |
In |
High |
Whether |
Liang |
Buy 3 |
32 |
In |
High |
Is |
Liang |
Buy 13 |
60 |
Old |
In |
Whether |
Liang |
Buy 4 |
132 |
Old |
In |
Is |
Liang |
Buy 10 |
63 |
Old |
In |
Whether |
Excellent |
Not buy 14 |
1 |
Old |
In |
Whether |
Excellent |
Buy 15 |
128 |
Green |
In |
Whether |
Liang |
Not buy 8 |
64 |
Green |
In |
Is |
Excellent |
Buy 11 |
32 |
In |
In |
Whether |
Excellent |
Buy 12 |
Code is similar.
Finally, the result is:
Age information Gain =0.9537-0.6877 = 0.2660 (1) revenue information Gain =0.9537-0.9361 = 0.0176 (2) Learn Health information Gain =0.9537-0.7811 = 0.1726 (3) reputation information gain =0.9537-0.9048 = 0.0453 (4)
It can be considered that age as the root node (to be continued)