Classical classification algorithm--decision tree

Source: Internet
Author: User

Decision Tree is a top-down recursive method, and its basic idea is to construct a tree with entropy as a measure, and the entropy value at the leaf node is zero, and the instances in each leaf node belong to a class.

The advantage of a decision tree learning algorithm is that it can learn from itself. In the learning process, the user does not need to know too much background knowledge, only need to do a good example of labeling, you can learn. Decision trees belong to supervised learning. The classification rules of reasoning Decision tree representation from a class of disordered and irregular things.

Establishment of Decision tree

The key to building a decision tree is to select which attribute to classify in the current state. According to different objective functions, there are three main algorithms for establishing decision tree:

ID3 (information gain)
C4.5 (Information gain rate)
CART (Gini index)

The information gain indicates the degree to which the information of the class X is reduced by the information of the feature a. The information gain for the training dataset D is defined by the feature A to G (D,a), the empirical entropy H (d) for the set D of the G (D,a), and the empirical conditional entropy H (d|) for the given condition of D. A) The difference between G (D,a) =h (D)-H (d| A). Select the feature with the greatest information gain as the current splitting feature.

on the formula, the empirical entropy, (d represents the training data set, | D| represents the number of samples, indicating that there are k classes, the number of samples belonging to, there is:). Note: Empirical entropy is the summation of a label about a class.

Empirical conditional entropy, characteristic A has n different values, according to the value of the characteristic A to divide D into n subsets, the number of samples, there is:, a subset of the class of the sample is a collection of samples, the number of samples. Note: The empirical conditional entropy is the summation of the characteristics about attributes.

Information gain rate: g (d,a)/h (D)

Gini factor . Note that the Gini coefficient is also known as the gap coefficient.

pruning of decision trees

When the decision tree is created, because of noise and outliers in the data, many branches reflect anomalies in the training data. Pruning methods are used to deal with this problem of overfitting data. Usually pruning methods use statistical measures to cut off the most unreliable branches.

Pruning general idea: starting from the complete tree, pruning part of the node to get T1, again pruning part of the knot to get T2 ... Until only the tree with roots is left; the tree with the smallest loss function is evaluated on the validation data set by the K-tree respectively.

The pruning factor determines:

1. According to the original loss function;

2. The more leaf nodes, the more complex the decision tree, the greater the loss, correction:

When it is 0 o'clock, the loss of the pruning decision tree is minimal;

When it is positive infinity, the loss of the decision tree of the single node is minimal.

3. Assuming that the current subtree is pruned with the root of R, only the R itself is retained and all leaves are deleted after pruning;

4. Investigating a subtree with the roots of R

The loss function after pruning:

Loss function prior to pruning:

5. To make the upper two equations equal, the pruning coefficient of the node R is obtained.

Pruning algorithm:

1. Calculate the pruning coefficient of all internal nodes;

2. Find the minimum pruning coefficient of the node, pruning decision tree;

3. Repeat 1.2 until the decision tree has only 1 nodes;

4. Get the decision tree sequence;

5. Select the optimal subtree using the validation sample set.

Classical classification algorithm--decision tree

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.