Decision Tree Classification
The decision tree algorithm uses the branch structure of the tree to realize classification, when the decision tree chooses the splitting point, it always chooses the best attribute as the classification attribute, that is, the category of the records of each branch is as pure as possible. Common attribute selection methods include information gain (information Gain), gain ratio (Gain ratio), and Gini index (Gini indexes).
The information gain is based on the Shannon theory, and it finds that the properties R has the following characteristics: The information gain before and after the splitting of the attribute R is greater than the other properties. Here the information (actually entropy) is defined as follows:
where m represents the number of Class C in DataSet D, Pi represents the probability that any record in D belongs to CI, and pi= (the number of records in the collection of CI classes in D) is calculated/| d|). Info (d) indicates the amount of information required to separate the classes of DataSet D from each other.
Suppose we choose the attribute R as the split attribute, DataSet D, R has K different values {v1,v2,..., Vk}, so d according to the value of R into K-group {d1,d2,..., Dk}, after splitting by R, the amount of information required to separate the different classes of DataSet D is:
information gain is defined as before and after the split, two of the amount is only poor:
The following example uses Python to illustrate a decision tree construct using the information gain method:
The main steps include the following:
1. Calculate the entropy of the original data set
2. Calculate the information gain for each feature and pick one of the largest start-up split points
1) Here the bread contains two steps, the first is a subset of the data that is split according to each split point
2) Then the information gain for each split point, and then get the maximum split point
3) The tree structure of each split subset is obtained by recursive method
is to actually find the root node (variable and corresponding value) for each dataset.
Common machine learning algorithms Principles + Practice Series 4 (decision tree)