Watermelon Book fourth chapter decision Tree

Source: Internet
Author: User

Reading notes Zhou Zhihua Teacher's "machine learning"

4.1 Basic Flow

A decision tree consists of a root node, several internal nodes and several leaf nodes, and the leaf nodes correspond to the decision results, and each node corresponds to a property test; Each node contains a collection of samples that are divided into sub-nodes based on the results of the attribute tests, and the root node contains the complete set of samples. The path from the root node to each leaf node corresponds to a decision test sequence.

Decision Tree Generation is a recursive process, in the decision Tree Basic algorithm, there are three cases will lead to recursive return: (1) The current node contains a sample of all belong to the same category, do not need to divide, (2) The current property set is empty, or all the samples are the same value on all attributes, cannot be divided; The category is set to the most sample of the node, using the posterior distribution of the current node (3) The current node contains the sample set is empty, cannot be divided, the current node is marked as a leaf node, but its category is set to its parent node contains the most samples of the category, Here is the sample distribution of the parent node as a priori distribution of the current node.

4.2 Division Selection

You want the branch node of the decision tree to contain as many samples as possible in the same category, where the "purity" of the nodes is purity higher.

Information gain

The "Information entropy" information entropy is the most commonly used metric for measuring the purity of a sample set, assuming that the fraction of the K-class sample in the current sample set D is P_k (k=1,2,..., |y|), the information entropy of D is defined as

The smaller the value of End (d), the higher the purity of D.

The "information gain" obtained by dividing the sample set D with attribute a information gain:

In general, the greater the information gain, the greater the "purity boost" obtained by using attribute A to divide, so the attribute is divided,

Gain rate

If the number of each sample as an attribute, each branch has a sample, the purity of these branch nodes has reached the maximum, but such a decision tree does not have the generalization ability, the new samples can not be effectively predicted, the information gain criterion of the number of desirable values of the properties have preference.

The well-known C4.5 decision Tree algorithm uses the "gain rate" gain ratio to select the optimal partitioning attribute.

The gain-rate criterion has many preference for attributes with a lower number of desirable values, so the C4.5 algorithm does not directly select the candidate partitioning attribute with the greatest gain ratio, but instead uses a heuristic: the attributes of the information gain above average are first identified from the candidate partitioning attributes, and then the highest gain rate is selected.

Gini coefficient

Cart decision tree Classification and Regression tree is a well-known decision Tree Learning algorithm, both classification and regression are available, and the cart decision tree uses the Gini coefficient Gini index to select the partitioning attribute. The purity of DataSet D can be measured by the Gini value:

Gini (d) reflects the probability of randomly extracting two samples from DataSet D with inconsistent class markers, so the smaller the Gini (d), the higher the purity of DataSet D.

The Gini index of property A is defined as

In the candidate attribute set a, we select the attribute that makes the lowest Gini index as the best partitioning attribute, i.e.

4.3 Pruning treatment

Pruning pruning is a decision tree learning algorithm to deal with "over-fitting" the main means, too many decision tree branches, so that some of the training set itself as the general nature of all the data have led to overfitting, "pre-pruning" prepruning and "post-pruning" Postpruning is two basic strategies for pruning of decision Tree, which refers to the estimation of each node in the decision tree generation process, and if the partitioning of the current node does not bring about the generalization performance of decision tree, then stop dividing and mark the current node as a leaf node. , after pruning is to create a complete decision tree from the training set, and then from the bottom up to the non-leaf node, if the node corresponding to the subtree replaced with a leaf node can bring the decision tree generalization performance improvement, the subtree is replaced with the leaf node.

Discriminant decision tree Generalization performance promotion: randomly divided data set for training set and validation set, according to some criteria in the previous section to select the attribute division, pre-pruning according to the validation set prediction results before and after the Division to continue division, reduce the risk of overfitting, significantly reduce the decision tree training time and test time cost, However, although the current division of some branches can not improve generalization performance, even lead to decline, but on the basis of subsequent division may lead to significantly improved performance, pre-pruning may lead to "less fit" risk. After pruning from the training set to generate a complete decision tree, the bottom-up decision whether pruning, the post-pruning decision tree is usually more branches than the pre-pruning decision tree, the general generalization ability is better than the pre-pruning decision tree, less than the risk of fitting, but the post-pruning decision tree after the formation of a complete decision tree, And to study all non-leaf nodes in the tree from the bottom up, the training time overhead is much larger than that of pre-pruning and pruning.

4.4 Continuous and missing values

The continuous-value attribute is usually processed by dichotomy, and the attribute interval is divided into the median, which is different from the discrete attribute, and if the current node partition attribute is a continuous attribute, the attribute can also be used as the dividing attribute of its descendant nodes.

Missing value processing: incomplete samples

How do I divide attribute selection in case of a missing attribute?

Given the partitioning attribute, how can the sample be divided if the value of the sample on the attribute is missing?

If the value of the sample x on the partition attribute A is unknown, the X is simultaneously crossed into all the child nodes, and the sample weights are adjusted to ~r_v*w_x in the sub-nodes corresponding to the attribute value a=v.

4.5 Multivariate decision Tree

A non-leaf node is no longer a property only, but a linear combination of attributes is tested.

Decision Tree Advantages: Simple calculation, explanatory strong, more suitable for dealing with missing attributes of the sample, to deal with irrelevant features

Disadvantages: Easy to fit, followed by random forest, reducing the over-fitting phenomenon

Watermelon Book fourth chapter decision Tree

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.