Machine learning Classic Algorithms and Python implementations-decision trees (decision tree)

Source: Internet
Author: User
Tags id3

(i) Understanding decision Trees1, decision tree Classification principle

Recent surveys have shown that decision trees are also the most frequently used data mining algorithms, and the concept is simple. One of the most important reasons why a decision tree algorithm is so popular is that the user does not have to understand the machine learning algorithm, nor does it have to delve into how it works. Intuitively, a decision tree classifier is like a flowchart that consists of a judgment block and a terminating block, and the terminating block represents the result of the classification (that is, the leaf of the tree). The Judgment module represents the judgment of a feature value (the feature has several values and the module has several branches).

If efficiency is not considered, then the judgment cascade of all the characteristics of a sample will eventually divide a sample into a class termination block. In fact, some of the characteristics of the sample are decisive in classifying, and the decision tree construction process is to find these decisive characteristics, according to its decisive degree to construct an inverted tree-the most decisive feature as the root node, Then recursively find the defining characteristics of the sub-datasets in each branch, until all the data in the child dataset belongs to the same class. Therefore, the process of constructing a decision tree is essentially a recursive process of classifying data sets based on data characteristics, and the first question we need to address is which feature on the current dataset plays a decisive role in classifying the data.

In order to find the decisive features and to classify the best results, we must evaluate each feature contained in the data set and look for the best features of the categorical datasets. After the evaluation is completed, the original dataset is divided into several subsets of data. These subsets of data are distributed across all branches of the first decision point. If the data under a branch is of the same type, then the branch processing is completed, called a leaf node, which determines the classification. If the data within a subset of data is not of the same type, the process of dividing the subset of data needs to be repeated. The algorithm for dividing a subset of data is the same as dividing the original dataset until all data with the same type is within a subset of data (leaf nodes). such as a decision tree instance (the target is two categories-see or not, each sample has age, appearance, income, whether civil servants four characteristics):


2, the learning process of decision tree

The generation process of a decision tree is divided into the following 3 parts:

  • feature selection: Feature selection refers to the selection of a feature from many features in the training data as the split standard of the current node, and how to choose features with many different quantitative evaluation criteria, thus deriving different decision tree algorithms.

  • Decision Tree Generation : Based on the selected feature evaluation criteria, the child nodes are generated recursively from top to bottom until the dataset is not divided and the decision tree stops growing. A recursive structure is the easiest way to understand a tree structure.

  • pruning : Decision trees are easy to fit, generally need pruning, reduce the size of tree structure and alleviate overfitting. There are two kinds of pruning techniques: pre-pruning and post-pruning.

    3, three decision tree algorithms based on information theory

    The biggest principle of partitioning a dataset is to make the unordered data orderly. If there are 20 features in a training data, which is the basis of the selection? This must use the quantification method to judge, the quantification divides the method to have the multiplicity, one of them is "The Information theory measure classification". The decision tree algorithm based on information theory has ID3, cart and C4.5 algorithm, in which C4.5 and cart two kinds of algorithms derive from ID3 algorithm.

    The ID3 algorithm, invented by Ross Quinlan, is based on the "Ames Razor": the smaller the decision tree is superior to the larger decision tree (being simple theory). According to the information gain evaluation and selection feature in the ID3 algorithm , the maximum feature of information gain is chosen each time to make a judgment module. The ID3 algorithm can be used to divide the nominal data set, without pruning, in order to eliminate the problem of over-data matching, you can cut and merge adjacent leaf nodes that cannot generate a large amount of information gain (for example, setting the information gain threshold). The use of information gain is actually a disadvantage, that is, it is biased to a large number of attributes-that is, in the training set, a property of the number of different values, the more likely to take it as a splitting attribute, and this is sometimes meaningless, and ID3 can not handle the continuous distribution of data features, Then there is the C4.5 algorithm.

    C4.5 is an improved algorithm of ID3, which inherits the advantages of ID3 algorithm. The C4.5 algorithm uses the information gain rate to select the attribute , overcomes the disadvantage of choosing the value of the attribute when selecting the attribute with the information gain to prune in the tree construction process, can complete the discretization of the continuous attribute, and can process the incomplete data. The classification rules produced by the C4.5 algorithm are easy to understand, the accuracy is high, but the efficiency is low, because in the tree construction process, the data sets need to be scanned and sorted several times. The C4.5 is only suitable for datasets that reside in memory because of the need to scan multiple datasets.

    The full name of the CART algorithm is classification and Regression Tree, which uses the Gini exponent (the least characteristic s of the Gini index) as the dividing standard, and it also contains the post-pruning operation. Although the ID3 algorithm and the C4.5 algorithm can excavate the information as much as possible in the learning of the training sample set, the decision tree branches are larger and larger in scale. In order to simplify the decision tree size and improve the efficiency of spanning decision tree, a decision tree algorithm for selecting test attributes based on Gini coefficients is presented.

    4, decision tree pros and cons

    The decision tree is suitable for numerical and nominal (discrete data, the result of the variable is only in the limited target set value), can read the data collection, extracts some of the data contained in the column rules. There are many advantages of using decision tree model in classification problem, decision tree is not high in complexity, easy to use, and efficient, decision tree can deal with data with irrelevant characteristics, can easily construct easy to understand rules, and rules are usually easy to explain and understand. The decision tree model also has some drawbacks, such as difficulty in dealing with missing data, overfitting, and ignoring the correlation between attributes in the data set.

    (ii) Mathematical principles of the ID3 algorithm

    It has been mentioned that C4.5 and cart are evolved from ID3, where the ID3 algorithm is described in detail, laying the groundwork.

    information theory basis of 1,ID3 algorithm

    The information theory basis of decision tree can be referred to "decision tree 1-Modeling Process"

    (1) Information entropy

    Information entropy: In probability theory, information entropy gives us a method of measuring uncertainty, which is used to measure the uncertainty of random variables, and entropy is the expectation of information. If the things to be classified may be divided into n classes, respectively, X1,X2,......,XN, each of which is P1,P2,......,PN, then the entropy of X is defined as:

    , from the definition: 0≤h (X) ≤log (n)

    When a random variable takes only two values, that is, the distribution of X is P (x=1) =p,x (x=0) =1?p,0≤p≤1 The entropy is: H (X) =?plog2 (P)? ( 1?p) log2 (1?p).

    The higher the entropy value, the higher the kind of data mixing, the implication is that a variable may change more (instead of having any relationship with the specific value of the variable, only the type of the value and the probability of occurrence), the greater the amount of information it carries. Entropy is a very important concept in information theory, and many machine learning algorithms take advantage of this concept.

    (2) Conditional entropy

    Suppose there is a random variable (x, y) with a joint probability distribution of: P (x=xi,y=yi) =pij,i=1,2,?, n;j=1,2,?, m

    The conditional entropy ( H (y∣x) ) represents the uncertainty of the random variable Y under the condition of the known random variable x , which is defined as x the entropy of the conditional probability distribution of Y under given conditions is the mathematical expectation of X :


    (3) Information gain

    The information gain (information gain) indicates the degree to which the uncertainty of Y is reduced after the information of the characteristic X is learned. Defined as:

         

    2,ID3 Algorithm Derivation

    (1) Classification System Information entropy

    Suppose a sample space of a classification system (D,Y), D represents a sample (with M characteristics), Y represents n categories, and the possible value is C1,C2,...,CN. The probability of each category appearing is P (C1), P (C2), ..., P (Cn). The entropy of this classification system is:


    In discrete distributions, the probability P (CI) of a category Ci appears, and the number of occurrences of that category is removed to get the total number of samples. For the continuous distribution, it is often necessary to obtain the discrete processing of the blocks.

    (2) Conditional entropy

    According to the definition of conditional entropy, the conditional entropy in a classification system refers to the entropy of information when a characteristic x of a sample is fixed. Since the possible value of this feature X will be (X1,X2,......,XN), when the conditional entropy is calculated and needs to be fixed, each may have to be fixed, and then the statistical expectation is obtained.

    So the probability of the sample characteristic x to be Xi is pi, and the conditional information entropy of the characteristic being fixed to the value XI is H (c| X=XI), then

    H (c| x) is the conditional entropy (x= (X1,X2,......,XN)) of the feature x being fixed in the classification system:


    If the feature of the sample has only two values (x1 = 0,x2=1) corresponding (appears, does not appear), such as the occurrence of a word in the text classification. So for the case of the eigenvalue, we use T to represent the feature, and T for T to appear, indicating that the feature appears. So:



    Compared with the previous conditional entropy formula, P (t) is the probability that T appears, which is the probability that T does not appear. Combined with the calculation formula of information entropy, we can get:


         

    The probability P (t) of a feature t appears, as long as the number of samples that have appeared in T is divided by the total number of samples, and P (ci|t) indicates the probability of category CI appearing when T is present, as long as the number of samples that appear t and belong to category CI is divided by the number of samples that show T.

    (3) Information gain

    According to the information gain formula, the information gain of feature X in the classification system is: Gain (D, X) = h (C)-H (c| X

    Information gain is for one characteristic, that is, to look at a feature x, the system has it and when it does not have the amount of information, the difference between the two is the characteristics of the system to bring the message gain. The process of selecting a feature each time is to calculate the information gain after dividing the data set by each eigenvalue, and then select the feature with the highest information gain.

    For the case where the characteristic value is two, the information gain that the characteristic T brings to the system can be written as the difference between the original entropy of the system and the conditional entropy after the fixed characteristic t:


    (4) After the above-mentioned round of information gain calculation will be a feature as the root node of the decision tree, the feature has several values, the root node will have several branches, each branch will produce a new subset of data DK, the rest of the recursive process is to repeat the process for each DK, until the child datasets belong to the same class.

    This can happen during the decision tree construction: All features are exhausted as split features, but subsets are not pure (the elements within the collection are not part of the same category). In this case, since no more information is available, a "majority vote" is generally made on these subsets, even if the most frequently occurring category in this subset is used as the node category, and then the node is used as a leaf node.

    (iii) C4.5 algorithm1, the information gain ratio chooses the best characteristic

    When the information gain is used to classify the decision, there are some characteristics which tend to the higher value. So in order to solve this problem, people have developed a classification decision-making method based on the information gain ratio, namely C4.5. C4.5 and ID3 are solved by greedy algorithm, but the basis of classification decision is different.

    Therefore, the C4.5 algorithm is identical to the ID3 in structure and recursion, and the difference is that the maximum information gain ratio is chosen when choosing the decision feature.

    The information gain ratio metric is defined by the gain metric gain (D,X) and the split information metric splitinformation (D,X) in the ID3 algorithm. The split information metric splitinformation (D,X) is equivalent to the characteristic x (the value is X1,X2,......,XN, and the probability of each is p1,p2,...,pn,pk is the entropy of the number of characters x in the sample space in addition to the total number of this sample space).

    Splitinformation (d,x) =-p1 log2 (P1)-p2 log2 (P)-,...,-pn log2 (PN)

    Gainratio (d,x) = Gain (d,x)/splitinformation (d,x)

    In the ID3 with the information gain selection attribute is biased to select the branch more than the attribute value, that is, the value of a lot of properties, in C4.5 because divided by Splitinformation (d,x) =h (X), can weaken this effect.

    2, processing continuous numerical characteristics

    C4.5 can handle both discrete description properties and continuity description properties. When you select a branch property on a node, the C4.5 is treated the same as ID3 for discrete description properties. For continuous distribution characteristics, the processing method is:

    The continuous attribute is converted to a discrete attribute before processing. Although the value of the property is continuous, but for the finite sampling data it is discrete, if there are n samples, then we have N-1 species Discretization method: <=vj to the left sub-tree, >VJ to the right sub-tree. Calculate the maximum information gain rate for this N-1 case. In addition, for sequential attributes to be sorted first, a cut is required only if the decision attribute (that is, where the classification has changed) is changed, which can significantly reduce the computational volume. It has been proved that the gain is used when deciding the dividing point of the continuous feature, and the gain rate is used when selecting the attribute, which can not only suppress the tendency of the continuous value attribute, but also can choose the best classification feature.

    In C4.5, the processing of successive attributes is as follows:

    1 . Sort the values of the features

    2 . The midpoint between the two feature values as a possible split point, divides the dataset into two parts, Calculates the information gain (Inforgain) for each possible split point. The optimization algorithm is to calculate only those feature values whose classification attributes have changed.

    3 . The information gain (Inforgain) on each split point is fixed: minus log2 (N-1)/| d|

    4 . Select the most modified information gain (Inforgain), the split point as the best split point of the feature

    5 . Calculate the Optimal split point information gain rate (Gain Ratio) as a feature of the Gain Ratio

    6 . Select Gain ratio the largest feature as the split attribute

    TBD

    3, Leaf cut

    Http://blog.sina.com.cn/s/blog_7399ad1f010153ap.html

    Avoid a high degree of uncontrolled growth of the tree and avoid overfitting of data. There are two common pruning methods for decision trees, namely pre-pruning (pre-pruning) and post-pruning (post-pruning).

    (1) Pre-pruning-in the pre-pruning method, the most direct way is to set the maximum depth of decision tree growth, so that the decision tree can not grow fully, and thus achieve pre-pruning purposes. The core problem of pre-pruning is how to specify the maximum depth of the tree beforehand, and if the maximum depth of the setting is not appropriate, it will result in too limiting the growth of the tree, so that the expression rules of the decision tree tend to be general, and the new datasets cannot be better categorized and predicted. In addition to defining the maximum depth of the decision tree in advance, there is another way to implement the pre-pruning operation, which is to test the sample set corresponding to the current node using the test technique, and if the sample number of the sample collection is less than the minimum allowable value specified beforehand, stop the node from continuing to grow and turn the node into a leaf node. , or you can continue to expand the node.

    (2) After pruning--the post-pruning technique refers to the decision tree which is completely grown, and according to certain rules and criteria, the tree which does not have the general representative in the trees is cut off, and the leaf node is replaced, and then a small new tree is formed. In the decision tree method, the CART, ID3 and C4.5 algorithms mainly adopt the post-pruning technique. The post-pruning operation is an edge trimming test process, the general rule is: in the process of continuous pruning of decision tree, the original sample set or new data set as the test data, test decision tree to test the accuracy of the data, and calculate the corresponding error rate, If the decision tree after cutting a subtree does not degrade the predictive precision or other measure of the test data, then the subtree is cut off.

    The C4.5 algorithm uses the post-pruning, after pruning the paper-cut condition TBD.

    (iii) Python implements ID3, C4.5 algorithm decision tree

    Python does not need to construct a new data type to store the decision tree, using the dictionary dict to easily store node information, permanent storage can be pickle or JSON to write the decision tree dictionary to the file, this package uses JSON. The trees module in the package defines a DecisionTree object that supports both ID3 and C4.5 two algorithms (the processing of C4.5 algorithm continuous features and post-pruning is not implemented), and the properties in the object are shown in the function __init__:

    Source Code:Copy
    1. class DecisionTree (object):
    2. def __init__ (Self, dsdict_id3 = None,dsdict_c45 = None, features = None, **args):
    3. "' currently support ID3 and C4.5, the default type is C4.5, CART TBD
    4.            
    5. " "
    6. Obj_list = Inspect. Stack () [1][-2]
    7. Self. __name__ = obj_list[0].split (' = ') [0].strip ()
    8. self. Dsdict_id3 = Dsdict_id3
    9. self. Dsdict_c45 = Dsdict_c45
    10. #self. Classlabel = Classlabel
    11. self. Features = Features

    The input data for the train function of the DecisionTree object is the sample list and the sample feature list, where the last item of each element of the sample list represents the classification of the element, as shown in the following format:

    1. DataSet = [[[], ' yes '],
    2. [All, ' yes '],
    3. [1,0, ' no '],
    4. [0,1, ' no '],
    5. [0,1, ' no ']
    6. ]
    7. Labels = [' No surfacing ', ' flippers ']

    In addition, with Python's matplotlib annotation function, you can draw a tree diagram for easy presentation. The DecisionTree object provides a Treeplot method, implemented through a function in the module treeplotter.

    The Testds module uses decision trees to determine whether a patient is able to wear contact lenses, and the characteristics that need to be considered include four-' age ', ' prescript ', ' astigmatic ', ' tearrate '. Is the decision tree that is generated using the DecisionTree object.

    Decision Tree Algorithm Learning package:


    (iv) Decision tree applications

    The main use of decision tree technology in data-based operation is embodied in: as a typical support technology for classification and prediction, it has a wide application prospect in user division, Behavior prediction and rule carding.

    Reference

    C4.5 Processing Continuous properties

    C4.5 Decision Tree

    The Bayesian classification algorithm, EM and HMM are discussed from the decision tree learning.

    Decision Tree 2-id3,c4.5,cart

  • Machine learning Classic Algorithms and Python implementations-decision trees (decision tree)

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.