ML (4): Decision Tree algorithm

Source: Internet
Author: User

In many classification models, the two most widely used classification models are decision tree (decision tree model) and naive Bayesian model (Naive Bayesian MODEL,NBC). The decision tree model solves the classification problem by constructing a tree. First, the training data set is used to construct a decision tree, and once the tree is set up, it can generate a classification for the unknown sample. The use of decision tree models in classification issues has many advantages:

    • Decision trees are easy to use and efficient;
    • Rules can be easily constructed according to decision trees, and rules are often easy to interpret and understand;
    • Decision tree can be well extended to large databases , and its size is independent of the size of the database;
    • Another great advantage of a decision tree model is that you can construct a decision tree for datasets with many attributes.

Compared with decision tree model, naive Bayesian model originates from classical mathematics theory, has solid mathematical foundation and stable classification efficiency . At the same time, the NBC model has few parameters to estimate, less sensitive to missing data, and simpler algorithm . In theory, the NBC model has the smallest error rate compared to other classification methods. But this is not always the case, because the NBC model assumes that the properties are independent of each other, and this hypothesis is often not true in practice, which has a certain effect on the correct classification of the NBC model. when the number of attributes is more or the correlation between attributes is large, the efficiency of the NBC model is inferior to the decision tree model . The performance of the NBC model is best when the attribute correlation is small

A decision tree is a predictive model; he represents a mapping between object properties and object values. Each node in the tree represents an object, and each fork path represents a possible property value, and each leaf node represents the value of the object represented by the path from the root node to the leaf node. The decision tree has only a single output, and if you want to have complex output, you can create an independent decision tree to handle different outputs . Decision Tree models also have some drawbacks, such as difficulties in dealing with missing data, over-fitting problems, and ignoring correlations between attributes in a dataset.

Decision tree is a long tree, mainly composed of root nodes, branches, leaf nodes, each branch is a rule, mainly used for classification . The algorithm of decision tree mainly includes ID3, C4.5, CART, the most popular thing C4.5 algorithm. There are two main problems to solve for each decision tree algorithm:

    • Which attribute is selected to split?
    • When does the tree stop growing?

The reason why C4.5 is popular is:

    • Using information gain rate to select attribute splitting
    • Pruning in the process of constructing a tree
    • Ability to process continuous data and incomplete data

Information gain

How to choose attribute splitting is appropriate for the rule, which leads to the concept of "entropy". In short, "entropy" is a measure of the degree of Chaos , and the more chaotic the entropy, that's why I let the desk mess up. The corresponding concept is "order", is a regular. The more orderly, the more pure, the less entropy, the more chaotic, the greater the entropy, the more impure ; from the point of view of mathematics, the formula of entropy is as follows:

whereP (i|t) represents the proportion of class I in the node T ; In order to determine the splitting result of the selection attribute, we use the difference between the purity of the former (parent node) and the impurity of the partition (child node) to measure the result of the attribute splitting, which becomes the information gain, that is, the difference of entropy, The calculation formula is as follows:

I (parent) is the value of the node is impure, K is the number of attributes, decision tree Induction is usually selected to maximize the information gain properties to do the splitting; but there is one drawback: the measurement of the Gini based on entropy and the tendency to divide the attributes of a category. It can be understood that the two-dollar division is actually equivalent to merging multiple attributes, which can also be proved by formula. So how do we solve this problem? There are two ways: one is to limit the test properties can only be two Yuan division (this is the idea of the cart), and the second is to modify the measurement rules, so C4.5 proposed to use the information gain rate to divide, the attribute test produces the output number is also taken into account, the formula is as follows:

The formula for split Info is as follows:

The error of the model

    • In general, we divide the data into training data sets to test the data set, train the model by training the data set, and then test the model through the test data set. The error of the model in training data is called the training error , the error of the model on the test data becomes the generalization error, and the generalization error is the expected error of the model on the unknown record. A good model should have low training errors and low generalization errors.
    • One of the most common cases is that the model has a small error on the training data set, but the generalization error is large, which is often called the overfitting of the model. There are two main reasons why the model is over-fitted:
      1. Noise
      2. The sample lacks representativeness;
    • In order to solve the problem of overfitting of the model, we often use the method of pruning the decision tree , so that we can build a big tree in the process of building the model, and then prune the tree based on the support degree of data and the understanding of the business.

Reference: http://blog.csdn.net/x454045816/article/details/44726921

Http://www.zgxue.com/167/1677903.html

ML (4): Decision Tree algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.