Supervised Learning-classification decision tree (1)

Source: Internet
Author: User
Tags id3

Supervised Learning-classification decision tree (1)

Decision tree(Demo-tree)Is a basic classification and regression method. The tree structure can be considered as a set of if-else rules. The main advantage is that the classification is readable and fast. There are usually three steps: feature selection, decision tree generation, and decision tree pruning.


A decision tree consists of a node and a directed edge. A node has two types: an internal node (representing a feature or attribute) and Leaf nodes (indicating a class or decision result ). Decision tree learning sample set. The learning goal is to build a decision tree model based on a given training dataset, which can correctly classify the test set and instances.By default, the training set and test set have the same or similar distribution probability model.


Selecting the optimal decision tree from all possible decision trees is an NP-complete problem. Therefore, we usually use a heuristic method to approximate this optimal problem. The resulting decision tree is a sub-Z-optimal? Http://www.bkjia.com/kf/ware/vc/ "target =" _ blank "class =" keylink "> memory + 9rLfyvex7cq + memory/nS1Mnux7Oyu82stcS + memory/memory + O3qKOpo6y + Memory + 8wsfIq77W1 + Memory + PC9wPgo8cD4KPHN0cm9uZz7M2NX30aHIodTa09rRodTxtt TRtcG3yv2 + 3 byvvt/release ++ release + NDQt9bA4LXEveG5 + 9Pry + a7 + release/release + i7Ng8L3A + CjxwPgrQxc + release + bG + release DFwO3C26Gjy/zIz86qo7o8L3A + CjxwPgoxoaLQxc + release/q4ycjFu7e + release/release + ivMfOqlW2 + release/release = "V ). The channel model is a conditional probability matrix P (U | V ).

Before actual communication, it is impossible to know what information the sink source sends. It is called the sink has uncertainty about the source State. Because this uncertainty occurs before communication, it is called the prior uncertainty. The uncertainty after receiving the information is called the posterior uncertainty.



Information refers to the elimination of uncertainty.

Shannon draws on the concept of thermodynamic and calls the redundant average information in the information as "information entropy", and provides a mathematical expression for calculating information entropy. The greater the uncertainty of a variable, the greater the entropy, and the greater the information required for figuring out the variable.


Information entropy is a concept of measuring information in information theory. The more orderly a system is, the smaller its information entropy is. On the contrary, the larger the information entropy of a chaotic system is. It can be considered that,Information entropy is a measure of the degree of system orderliness.


2) Information Gain

Information Entropy, also known as a prior entropy, is the mathematical expectation of the amount of information before the information is sent. posterior entropy refers to the mathematical expectation of the amount of information from the perspective of sink after the information is sent. Generally, the anterior entropy is greater than the posterior entropy,The difference between the anterior entropy and the posterior entropy is the so-called information gain, which reflects the degree to which information eliminates random uncertainty.

The information gain in Decision Tree Learning is equivalent to the mutual information between classes and features in the training set. This parameter reduces the uncertainty of the training set due to feature A and provides information gain. The information gain g (D, A) of feature A on training set D is defined as empirical entropy H (D) of set D) difference from empirical Conditional Entropy H (D "A) given by feature:


3) information gain rate

The information gain value is relative to the training set, and has no absolute significance.

For example, a train can speed up from 10 Mb/s to 100 Mb/s within 9 s, but now a motorcycle can speed up from 1 MB/s to 11 Mb/s within 1 s, although the speed of a motorcycle is not as high as that of a train after acceleration, it has the same acceleration capability as that of a train!

When the empirical entropy of the training set is large, the information increase value will also be too large, and the information increase value will be too small to reflect the real uncertain improvement ability. Information gain ratio can be used to correct this problem.

The information gain ratio gainRatio (D, A) of feature A on training set D is defined as the empirical entropy H (D) of information gain g (D, A) and training dataset D) ratio:




ID3 algorithm for Decision Tree Learning

ID3 is a type of decision tree algorithm. Before learning about the ID3 algorithm, we must first understand the concept of Occam Razor (Occam "sRazor, Ockham's Razor ), it was proposed by William of Occam of the 14th century logistician and skorham of St. Francis, from 1285 to 1349, he said in Article 15 of Proverbs book note, "Do not waste a lot of things and do things that can be done well with less things '. To put it simply, it is: be simple. Therefore, the smaller the decision tree, the better the larger the decision tree (be simple theory ).

ID3 algorithm implementation ideas:

1) Top-down greedy traversal of decision tree space to construct a decision tree;

2) Calculate the ability of the training sample set for individual feature attributes classification, and select an attribute (the best attribute) with the best classification capability as the root node of the decision tree, this indicates that this attribute is more differentiated than other attributes.

3) for each of the root node attributes that may generate a branch, traverse the sample set again and divide it into different branches.

4) recursively repeat each branch. 2) 3 ). Until a leaf node appears, that is, only the prediction result is displayed under the branch.

ID3 is equivalent to choosing a probability model using the maximum likelihood method.


C4.5 algorithm: ID3 algorithm improvement

Since C4.5 is an improved ID3 algorithm, what are the improvements made by C4.5 compared to ID3?

1) UseInformation gain rateTo select attributes. The ID3 selection attribute uses the information gain of the subtree. Many methods can be used to define the information here. ID3 uses entropy (entropy, which is a non-purity measurement criterion ), that is, the entropy change value, while C4.5 uses the information gain rate. Yes, the difference is that one is information gain and the other is information gain rate.

2) pruning is performed during tree construction. When constructing a decision tree, nodes with several elements are prone to overfitting.

3) Non-discrete data can also be processed. Can process incomplete data.

Decision Trees are used when feature values are discrete. Continuous features are usually processed as discrete features (many articles do not express the key features or concepts of decision trees ). In practical applications, the overfitting of decision trees is more serious. Therefore, boosting is generally required. Classifier performance is not good. The main reason is that the feature identification is insufficient, rather than the quality of the classifier. Good features have a good classification effect, and the classifier is weak.
Pruning of Decision TreesThe Decision Tree Generation Algorithm recursively generates the decision tree, knowing that it cannot continue. The resulting decision tree is often very accurate to the training set, but not necessarily accurate to the Data Classification of the unknown test set, this is Overfitting)Symptom. The reason for over-fitting is that the correct classification of the high training set is optimized too much during learning, and the decision tree is too complex to be constructed. To solve this problem, it is necessary Simplified Decision Tree. A simple pruning algorithm learned in the book: It is achieved by minimizing the loss function of the decision tree,



Tree Pruning Algorithm: input (the entire tree T generated by the algorithm, parameter a), output (the trimmed subtree Ta) 1) to calculate the empirical entropy of each node. 2) recursively scale back from the leaf node of the tree. Assume that the loss function values of a group of leaf nodes before and after the parent node are respectively, and pruning should be performed to make the parent node a new leaf node. 3) Repeat (2) until it cannot be continued, and obtain the sub-tree Ta with the minimum loss function. Can the entire algorithm be implemented using a dynamic planning algorithm (DP )? (For further consideration)

Summary: It mainly involves some basic knowledge of decision tree learning. code is required for further study, and implementation is the king. Camefrom: Yu meng_happy Road

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.