ID3 Decision Tree: The most typical and influential decision tree algorithm in decision tree algorithm is the problem of attribute selection. The ID3 decision Tree algorithm uses the information gain degree as the selection test attribute. where P (AI) represents the probability of the AI occurring. Assuming that there are N incompatible event A1,a2,a3,...., An, and only one of them occurs, the average amount of information can be measured as follows: The logarithm base can be any number, and different values correspond to different units of entropy. Usually take 2, and specify when P (ai) = 0 o'clock =0entropy (s,a) =∑ (| sv|/| s|) * Entropy (SV) Formula 2 to go to play badminton for example a: properties: Outlooks: Training Sample Set ∑: All possible values of attribute A: sunny, overcast, RAINYSV is a subset of S with v values for attribute a: subset with Sunny | sv| Number of elements in SV | s| Number of elements in S information gain gain value: (Entropy of the decision attribute)-(entropy of the Conditional attribute) Entropy (Sv)--The entropy of getting the conditional attribute assumes that there is a property with two values v1/v2p1 = v1/(v1+v2) P2 = v2/(v1+v2) I (P1, P2) = p1*log2 (1/P1) +p2*log2 (1/P2) =-(P1*log2 (p1) +p2*log2 (p2)) Entropy (Sv) =∑i (P1,P2) *[(P1+P2)/(alltotal)]-----> To find out is the average expectation note: When the decision-making attribute is entropy, because the decision attribute is the information that all the data will have, so the entropy of the decision-making attribute can not be expected, and it is equal to the expectation. The value of the information gain gain: (entropy of the decision attribute)-(entropy of the conditional attribute)-The value of the information gain for all attributes is then compared to the value of the information of the maximum information gain value corresponding to the attribute as a branch node, followed by the operation as above. Baidu Encyclopedia: ID3 algorithm is the first proposed by Quinlan. The algorithm is based on information theory, which is used to measure the entropy and the information gain, and to realize the classification of the data. Here are some basic concepts of information Theory: definition 1: If there are n messages with the same probability, the probability p of each message is 1/n, and a message is-LOG2 (1/n) defined 2: If there are n messages, the given probability distribution is p= (P1,P2...PN), The amount of information passed by the distribution is called the entropy definition of P 3: If a record set T is divided into separate class C1C2 according to the value of the category attribute. Ck, the amount of information required to identify which class A element of T belongs to is info (T) =i (P), where P is c1c2 ... CK's OverviewRate distribution, i.e. p= (| c1|/| T|,..... | ck|/| t|) Definition 4: If we first divide T into set t1,t2 according to the value of the non-categorical attribute x ... Tn, the amount of information for an element class in T is determined by determining the weighted average of TI, which is the weighted average of info (TI): info (X, T) = (I=1 to n summation) ((| ti|/| t|) Info (Ti)) Definition 5: Information gain is the difference between two volumes, one of which is to determine the information of one element of T, and the other is the amount of information that must be determined after the value of the obtained attribute X, the entropy formula is: Gain (X, T) =info (t) The-info (X, T) ID3 algorithm calculates the information gain for each attribute and selects the attribute with the highest gain as the Test property for a given set. Creates a node for the selected test property, and creates a branch for each value of the attribute with the attribute tag of that node to divide the sample. Note: ID3 algorithm is a classical decision tree learning algorithm, proposed by Quinlan in 1979. The basic idea of the ID3 algorithm is to use information entropy as a measure to select the attribute of the decision tree node, and to select the most informative attribute each time, that is, to make the entropy change to the smallest attribute, in order to construct a decision tree with the lowest entropy value, the entropy value at the leaf node is 0. At this point, the instances in the instance set corresponding to each leaf node belong to the same class. variables as shown in table 1, the arguments are age, occupation, gender, and the dependent variable is the result (the frequency of eating large stalls). Table 1 Data Sheet Age a occupation B sex c result 20-30 student male occasional 30-40 worker male often 40-50 teacher female never 20-30 worker female occasional 60-70 teacher male never 40-50 worker female never 30-40 teacher male occasional 20-30 student female never 20 below male occasional 20 worker female occasional 20-30 workers male often 20 The following student male occasional 20-30 teacher male occasional 60-70 teacher female fromNo, 30-40 workers, women, 60-70 workers. Men never calculate a process: 1. The frequency at which the result option appears first is calculated: table 2 results frequency table never p1 often p2 occasionally p30.3750.1250.52, calculates the expected information of the dependent variable: E (Result) =-(P1*log2 (p1) +p2*log2 (p2) +p3*log2 ( P3) ) =-(0.375*log2 (0.375) +0.125*log2 (0.125) +0.5*log2 (0.5) ) =1.406 Note: Here pi corresponds to the above frequency 3, calculate the expected information of the independent variable (in age a as an example): E (a) =∑count (Aj)/count ( A) * (-(P1J*LOG2 (p1j) +p2j*log2 (p2j) +p3j*log2 (p3j) ) 3.1 Formula Description: Count (Aj): Number of J options for Age A, J is the following table 35 options any one table 3 Age record Quantity table option 20-3020 below 30-4040-5060-70 quantity 53323Count (A): Total Age record number p1j =count ( A1J)/count (AJ): Age A J option selects the number of "never" in the result as a percentage of the number of J options of age A; P2j =count (a2j)/count (AJ): Age A J option in the result selected "occasional" The number of items in the number of the age a-j option; P3j =count (A3J)/count (Aj): Age A J option in the result, the number of "regular" is selected as the proportion of the number of the age a J option; 3.2 Formula analysis whether the independent variables are significantly affected in the decision treeThe decision criteria of the dependent variable can lead to different results depending on the variable choice, for example, if the old people do not go to the big stalls, middle-aged are often go, and teenagers are occasionally to go, then the age factor is definitely decided whether to eat the main factor; according to assumptions , that is, different age groups will have a definite effect on the results, to table 3 of the age of 20 below 3 people, for example, assuming they have selected the "occasional" option in the results, at this time: P2j =count (a2j)/count (Aj) =1,p1j =count (a1j )/count (AJ) =0, P3j =count (a3j)/count (AJ) =0; (P1J*LOG2 (p1j) +p2j*log2 (p2j) +p3j*log2 (p3j) ) →0; specifically: LIM (p2j→1) p2j*log2 (p2j) →lim (p2j→1) 1*0→0lim (p1j→0) p1j*log2 (p1j) →lim (p1j→0) log2 (p1j) /(1/p1j) →lim (p1j→0) p1j* log2 (e) →0lim (p3j→0) p1j* LOG2 (p3j) →lim (p3j→0) log2 (p3j) /(1/p3j) →lim (p3j→0) p3j* log2 (e) →0 (P1J*LOG2 (p1j +P2J*LOG2 (p2j) +p3j*log2 (p3j) ) →0+0+0=0 can be seen, if each age group has a definite effect on the results, then the expected information of the non-weighted age group (P1J*LOG2 (p1j +P2J*LOG2 (p2j) +p3j*log2 (p3j) ) is very small, so that E (a) is very small or even nearly 0, 4, the expected information of the independent variable 4.1, E (a) Calculation from table 4 See, There are two age groups that have different effects on the results, calculated as follows: E (30-40) = count (Aj)/count (A) * (-(P1J*LOG2 (p1j) +p2j*log2 (p2j) +p3j*log2 (p3j) ) =3/16* (-(2/3* log2 (2/3) +1/3*log2 (1/3) )) =0.172e (20-30 ) = Count (Aj)/count (A) * (-(P1J*LOG2 (p1j) +p2j*log2 (p2j) +p3j*log2 (p3j) )) = 5/16* (-(1/5* log2 (1/5) +3/5* log2 (3/5) +1/5*log2 (1/5) )) = 0.428 Final calculations: E (A) = e (30-40) + E (20-30) =0.172+0.428=0.6 Table 4 Age Information Table Age a result 60-70 never 60-70 never 60-70 never 40-50 never 40-50 never 30-40 often 30-40 occasionally 30-40 the following occasionally 20 the following occasionally 20 the following occasionally 20 occasionally the 20-30 never 20-30 times 20-30 occasionally 20-30 , the information gain of the age variable is calculated as: Gain (a) =e (result)-E (a) = 1.406-0.6=0.806 the same can be calculated Gain (B), Gain (C); Note: Information gain is a good way to reduce the degree of disorder before division, Therefore, the first Division of the decision tree depends on which variable information gain is large and which division; 6, the division process if the variable option is less than the decision tree as in the example above, assuming the age variable information gain is the largest, then the First Division is: 40-70 years old never patronize under 20 occasionally 20-30 years old data again by occupation and genderCalculate information gain find rules 30-40 years old data to calculate information gain by occupation and gender the actual division of the rules is based on the criteria of the segmentation threshold: A, numeric variable--sort the values of the records from small to large, Calculates the heterogeneity statistics for each of the child nodes produced as a critical point. The critical value that can reduce the degree of heterogeneity is the best dividing point. B, categorical variables--lists all possible combinations divided into two subsets to calculate the heterogeneity of the generated sub-nodes under each combination. Similarly, find the combination that minimizes the degree of heterogeneity as the best dividing point. Note Two questions: the root node must produce two subsets, if the production of three subsets, four subsets, how many subsets have the standard? I guess it's not a result of multiple subsets. 22 The difference is significant to continue splitting, and if the new subset does not have a significant difference from the results of any subset of the original, stop dividing it? How do you build this threshold statistic? 7. The standard of dividing stop meet one of the following stop growth. (1) The node achieves complete purity, (2) The depth of the number tree reaches the user-specified depth, and (3) the number of samples in the node is less than the number specified by the user, and (4) The decrease in the heterogeneity index is less than the user-specified amplitude. Pruning: A complete decision tree may be "too precise" to describe the characteristics of the training sample (affected by the noise data), the lack of a general representation and can not better use the new data to make the classification of predictions, "over-fitting" appears. --Remove the division that has little effect on the precision of the tree. Use the cost complexity method, that is, to measure both the risk of errors and the complexity of the tree, so that the smaller the better. Pruning method: A, pre-pruning (prepruning): Stop growth strategy B, after pruning (postpruning): On the basis of allowing the decision tree to get the most full growth, and then according to a certain rule, the bottom-to-floor pruning.
Correlation principle of decision tree algorithm ID3 algorithm for data mining