Machine learning "Four" decision tree

Source: Internet
Author: User
Tags id3

1, decision tree Introduction 1.1 Decision Tree Overview

Decision Tree algorithm is a tree-based classification algorithm, which can extract the tree classification model from a given disordered training sample, and the tree contains the Judgment module and the terminating module. It is a typical classification algorithm, first processing the data, using the inductive algorithm to generate readable rules and decision trees, and then using the decision to analyze the new data. In essence, decision trees are the process of classifying data through a series of rules. Decision trees are a kind of supervised learning, so you need to give class labels and data sets in advance.

The principle of the decision tree: Each decision tree describes a tree structure, which is categorized by its branches to attribute objects of that type. Each decision tree can rely on the partitioning of the source database for data testing. This process can be used to prune the tree recursively. The recursive process is complete when no more splits or a single class can be applied to a branch.

Commonly used decision tree algorithms have id3,c4.5 the two algorithms are essentially the same. ID3 The purpose of this algorithm is to reduce the depth of the tree, but ignores the study of the number of leaves. The C4.5 algorithm has been improved on the basis of ID3, and has made great improvements in the fields of missing value processing, pruning techniques and derivation rules of predictor variables, which are suitable for classification and regression problems.

Decision trees in the form of flowcharts

1.2 ID3 algorithm

The basic ID3 algorithm learns by constructing a decision tree from top to bottom.

The construction process is from "which property will be tested at the root node of the tree?" "The question began.

To answer this question, use statistical testing to determine the ability of each instance attribute to classify training samples separately.

(1) The best attribute of the classification ability is chosen as the test of the root node of the tree.

(2) A branch is then generated for each possible value of the root node attribute, and the training sample is arranged below the appropriate branch (that is, the branch of the property value for the sample).

(3) Then repeat the entire process, using the training sample associated with each branch node to select the best properties to be tested at that point.

This creates a greedy search for a qualifying decision tree (greedy search), where the algorithm never goes back and re-considers the previous selection.

The core problem of the ID3 algorithm is to select the properties to be tested at each node of the tree.

We want to select the attributes that are most useful for classifying instances. So what is a good quantitative criterion for measuring the value of a property? This defines a statistical attribute called "Information gain (information gain)", which is used to measure the ability of a given attribute to differentiate a training sample. The ID3 algorithm calculates the information gain of each attribute, considers that the information gain is good attribute, the attribute with the highest information gain is divided into criteria, and repeats the process until a decision tree is created that can perfectly classify the training sample.

1.3 Information gain

The principle of dividing a data set is to make the disordered data more orderly, and one way to organize the data is to use information theory to measure it.

The change in information before and after partitioning the dataset is called information gain. By calculating the information gain obtained by dividing the data set by each eigenvalue, it is the best choice to obtain the highest information gain.

Entropy (entropy): Measures the uncertainty of random variables. Definition: Suppose that the possible value of the random variable x is x1,x2, ..., xn, for each possible value XI, its probability P (x=xi) = Pi, (i = ..., n) so the entropy of the random variable x: \ ({h=-\sum_{n}^{i=1}p (x_{i}) Log_{2}p (X_{i})}\)

Generalize to sample set D, random variable x is the category of the sample, that is, assuming that the sample has k categories, the probability of each category is, where | Ck| represents the number of samples of category K, | D| represents the total number of samples

For sample Set D, entropy (empirical entropy) is:

\ (H (D) =-\sum_{k=1}^{k}\frac{\left | C_{_{k}} \right |} {\left | D \right |} Log_{2}\frac{\left | C_{_{k}} \right |} {\left | D \right |} \)

understanding of information gain: for the data set to be divided D, its Entroy (former) is certain, but after the division of the Entropy Entroy (after) is uncertain, Entroy (after) The smaller the uncertainty (i.e., the higher the purity) of the subset using this feature, the greater the Entroy (former)-Entroy (later) difference, the higher the purity of the data set D with the current feature. When we build the optimal decision tree, we always hope to reach a higher purity of the set, which can refer to the gradient descent algorithm in the optimization algorithm, each step along the negative gradient method to minimize the loss function is the negative gradient direction is the function of the fastest reduction in the direction of the value. Similarly: In the process of building a decision tree, we always want the set to reach the fastest-growing subset of purity, so we always choose the feature that makes the most of the information gain to divide the current dataset D.

#计算给定数据集的熵from Math Import logdef calcshannonent (DataSet):    Numentroiese = Len (DataSet)    labelcounts = {}    # Create a dictionary for all possible classifications for the    Featvec in DataSet:        CurrentLabel = featvec[-1]        if CurrentLabel not in Labelcounts.keys ():            Labelcounts[currentlabel] = 0        Labelcounts[currentlabel] + = 1    shannonent = 0.0 for    key in labelcounts:< C10/>prob = float (Labelcounts[key])/Numentroiese        shannonent-=prob * log (prob,2)    return shannonent

  

Machine learning "Four" decision tree

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.