ID3 of decision tree
Calculation of information gain:
Information Entropy:
Information entropy (entropy ". Assume that the target attribute in the training set is C and C has C1, C2 ,..., Cm values. The proportion of each value is P1, P2 ,..., Pm. The information entropy is defined:
Example 3-2: The existing weather dataset is as follows. How does one obtain the information entropy of the dataset about the target attribute play ball?
Solution: The weather dataset is recorded as S. The target attribute "play ball" has two values: "yes" and "no. Among them, "yes" accounts for 9/14, and "no" accounts for 5/14. Therefore, the information entropy of the dataset's target attributes is:
Information Gain:
The information Gain is recorded as Gain (S, A), and the formula is:
Which of the following EntrZ functions? Http://www.bkjia.com/kf/ware/vc/ "target =" _ blank "class =" keylink "> coding + DQo8aW1nIGFsdD0 =" here write picture description "src =" http://www.bkjia.com/uploads/allimg/150604/041232FK-4.png "title =" \ "/>
Where, "Si | (I, = 1, 2 ,? K) indicates the number of samples in the Si of the Sample Subset. | S | indicates the number of samples in the sample set S.
Example 3-3: based on the data in the weather dataset in the preceding table, set the data set to S. Assume that S is used to divide S and calculate the information gain of S for the attribute wind.
Solution: the attribute wind has two values: "weak" and "strong". The data set S is divided into two parts, marked as S1 and S2.
S1 has 8 samples and S2 has 6 samples. Entropy of sample sets S1 and S2:
The entropy after dividing the property wind into S is:
Therefore, the information gain obtained from dividing data set S by attribute wind is:
The preceding steps are used to obtain information gain. In fact, the main difficulty of ID3 is to obtain information gain.
Tree construction:
For the construction of decision trees, just like the construction of common trees, we can use recursion to create nodes. The special feature of ID3 is that when selecting an attribute for a node, it always selects the attribute with the largest information gain as the decision node.
For example, in the preceding example, which attribute should be used as the first decision Node Based on the ID3 algorithm?
Solution: Calculate the information gain of each attribute for the target attribute:
Gain (S, outlook) = 0.246, Gain (S, temperature) = 0.029, Gain (S, humidity) = 0.152, Gain (S, wind) = 0.049
Among them, Gain (S, outlook) is the largest, so select the outlook attribute as the root node.
Which of the following attributes of the sunny branch is used as a subnode?
Note that the data set used at this time is the subset Ssunny, rather than the original weather data set.
Evaluate the information gain of temperature, humidity, and wind for the target attribute play ball respectively, and then take the maximum value as the subnode.
......
......
The final decision tree is as follows:
Advantages and disadvantages of ID3:
Advantages:
The main idea of ID3 is to use information gain as the selection standard when selecting attributes on non-leaf nodes of a decision tree, each non-leaf node selects the attribute with the largest information gain among the current candidate attributes. This is a greedy algorithm, so that each non-leaf node can obtain the maximum class information about the tested data during the test, so as to minimize the entropy of the entire decision tree.
Disadvantages:ID3 can only process discrete data and cannot process continuous data. For example, the wind attribute in the preceding section only has two values: strong and weak, which are discrete. If it is changed to 0 ~ The real number of 100 represents the intensity of wind, and ID3 cannot be processed. If the data of an attribute contains missing values, ID3 cannot be processed. Defect of information gain: the calculation result of information gain is greater than the calculation result of attributes with fewer categories, which leads to ID3 algorithm to select attributes with more branches, the attribute with many branches is not necessarily the optimal choice.