Decision Tree of machine learning algorithm

Last Update:2017-02-03 Source: Internet

Author: User

Tags id3

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

decision tree of machine learning algorithm

What is a decision tree

Decision Trees (decision tree) are simple but widely used classifiers. By training data to build decision tree, the unknown data can be efficiently classified. The decision-making number has two advantages:1 Thedecision tree model can be read well, descriptive, and helpful for manual analysis;2) High efficiency, the decision tree only need to build once, repeated use, each time the maximum number of predictions does not exceed the depth of the decision tree.

A decision tree is a tree structure (which can be a two-tree or a non-binary tree), a non-leaf node represents a test on a feature attribute, each branch represents an output on a range, and each leaf node holds a category.

Test is to follow from the root node down, until the leaf node, to get the decision results.

Structure

The key step in constructing a decision tree is the split attribute, which is divided into three different situations:

1, the attribute is discrete and does not require a two-fork decision tree to be generated. At this point, each partition of the attribute is used as a branch.

2, attributes are discrete values and require a two-fork decision tree to be generated. At this point, a subset of attribute partitioning is used for testing, divided into two branches according to "belongs to this subset" and "does not belong to this subset".

3, the attribute is a continuous value. At this point a value is determined as the split point split_point, and two branches are generated according to >split_point and <=split_point .

the key content of constructing decision tree is to make the attribute selection measure, the attribute selection measure is a selective splitting criterion, and it is the heuristic method of dividing D "best" into the individual class of the data of the training set of the given class tag, which determines the topological structure and the splitting point split_ the choice of point.

Splitting algorithm

ID3 Algorithm

from the information theory knowledge, we expect that the smaller the message, the greater the information gain, and the higher the purity. Therefore , the core idea of the ID3 algorithm is to divide the attribute of the information gain metric attribute selection, and select the most of the message gain after splitting , in fact, the decision tree that can correctly classify the training set is more than one tree. the Quinlan ID3 algorithm can get the least node decision tree.

Let's define a few concepts to use here.

Set D to divide the training tuple with category (Y), then the entropy of D (entropy) isexpressed as:

where pi represents the probability that the class I category appears in the entire training tuple, you can estimate the number of elements belonging to this category divided by the total number of element of the training tuple. The actual meaning of entropy is the average amount of information required for the class label of a tuple in d.

Now we assume that the training tuple D is divided by attribute a(X) , then the expected information for A to D is:

The information gain is the difference between the two values:

The ID3 algorithm calculates the gain rate for each attribute each time a split is needed, and then chooses the attribute with the highest gain rate to split.

For a feature attribute to be a continuous value, you can use the ID3 algorithm as follows:

first the elements in D are sorted by feature attribute, then the midpoint of each two adjacent elements can be considered as potential splitting point, starting from the first potential splitting point, splitting D and calculating the expected information of two sets, the point with the minimum expected information is called the best splitting point of this attribute. Its information is expected as information expected for this attribute. (the smaller the expectation, the greater the gain)

Stop condition

This can happen during the decision tree construction: All properties are exhausted as split properties, but some subsets are not pure, that is, the elements within the collection do not belong to the same category. In this case, since no more information is available, a "majority vote" is generally made on these subsets, which is to set a threshold that stops splitting when the current subset of records is below this threshold, uses the most frequently used category in this subset as the node category, and then takes this node as a leaf node.

C4.5 Algorithm

one problem with the ID3 algorithm is that it favors multi-valued attributes, for example, if there is a unique identity attribute ID, then ID3 chooses it as the split attribute (theinfo (Dj) section gets 0 ,gain is info (D)), which makes the division sufficiently pure, but it is almost useless for classification (which can be considered overfitting). the successor algorithm of ID3 C4.5 uses the gain rate (gain ratio) information gain expansion to try to overcome this bias.

The C4.5 algorithm first defines "split information", which can be expressed as:

(as you can see , attributes such as ID will get a big spilit_infoand be punished)

Each symbolic meaning is the same as the ID3 algorithm, and then the gain rate is defined as:

C4.5 chooses the attribute with the maximum gain rate as the splitting attribute, and its specific application is similar to ID3.

CART Algorithm

Classification and Regression tree, which is the classification regression trees algorithm, the cart algorithm is a binary recursive segmentation technique, dividing the current sample into two sub-samples, so that each non-leaf node generated has two branches,

therefore , the decision tree generated by the CART algorithm is a simple structure of two-fork tree. Since the CART algorithm consists of a binary tree, it can only be "yes" or "no" at each step of the decision, even if a feature has multiple values, it also divides the data into two parts (discrete segmentation method is "a value and not a value ", continuous is greater than a value or less than or equal to a value.) The CART algorithm is divided into two main steps :

(1) Recursive partitioning of samples for the construction process

set X1, X2, X3 ... Xn represents n properties for a single sample , and Y represents the category to which it belongs. The CART algorithm divides the n -dimensional space into a non- overlapping rectangle by recursion. The partitioning steps are as follows:

(a) Select an independent variable XI, then select a value VI of XI, vi divides the n- dimensional space into two parts, and a part of all points satisfies xi<= Vi, the other part of all points satisfy Xi>vi, the value of the property value is only two for a discontinuous variable, that is, equal to the value or not equal to the value.

(b) recursive processing, the two parts obtained by step (a) to re-select an attribute continues to divide until the entire N- dimensional space is divided.

at the time of division, there is a question, what is the criterion for dividing it ? For a variable property, its dividing point is the midpoint of a pair of continuous variable attribute values. Assuming that a collection of samples has an attribute of m consecutive values, then there will be a m-1 split Point, each of which is the mean of two contiguous consecutive values. The division of each attribute is sorted according to the amount of impurities that can be reduced, while the amount of impurity reduction is defined as the sum of the ratio of the impurity mass division of each node divided by the pre-divided impurities minus the division. While the impurity measurement method commonly used Gini indicators, assuming a sample of a common class C, then one of the properties of a class in the Gini of the purity can be defined as :

where pi represents the probability of belonging to Class I, when Gini (a) =0, all samples belong to the same class, all classes in the node with equal probability, Gini (a) maximization (the most impure). The Gini value of the last argument is ∑ P (v) *gini (v), which is divided by the smallest of the values.

(2) pruning with validation data (using cost complexity pruning method )

Accurate rate estimation

When the decision tree T is built, it is necessary to estimate the forecast accuracy rate. Visual description, such as N test data,X predict the correct number of records, then you can estimate the ACC = x/n is the accuracy of T. However, this is not very scientific. Because we estimate the accuracy of the sample, there is a good chance of bias. So, a more scientific approach is to estimate an accuracy interval, where the confidence interval (Confidence Interval) of the statistics is used.

SetTThe accuracy ratePis an objective value,XThe probability distribution isX ~ B (n,p), i.e.XFollow the probability ofP, the number ofn Two distributions (binomial distributione (x) = N*pvar (x) = n*p* (1-p) nnormal distribution nx ~ N (np,n*p* (1-p)) Span lang= "ZH-CN". It can be calculated that acc = x/n expectation e (ACC) = e (x/n) = e (X)/n = Pvar (ACC) = var (x/n) = var (X)/N2 = p* (1-p)/NACC ~ N (p,p* (1-p)/n)

Optimization scenarios

Optimization Scenario 1: Pruning foliage

Decision tree transition fitting is often because it is too "lush", that is, too many nodes, so it is necessary to crop (prune tree) foliage. The strategy of cutting foliage has a great effect on the accuracy of decision trees, in order to deal with the problem of overfitting caused by noise and outliers in the data, Considering extreme cases, if we make all the leaf nodes contain only one data point, Then we can ensure that all training data can be accurately classified, but it is possible to get high prediction error, because all the noise data in the training data are "accurately divided", enhance the role of noise data. There are two main cropping strategies :

The pre-cropping is stopped early in the process of building the decision tree. Then, the conditions of the Sharding node are set very harshly, resulting in a small decision tree. The result is that the decision tree is not optimal. The practice proves that this kind of strategy can't get better results.

After the crop decision tree is built, the crop is not started. Two methods are used: 1 the single leaf node is substituted for the whole subtree, and the leaf node is classified by the most important sub-tree; 2) completely replaces one word with another subtree. The problem with post cropping is that computational efficiency, some nodes are cropped after calculation, resulting in a bit of waste.

Post-pruning has three algorithms:REP(Error rate pruning),PEP(Pessimistic pruning),CCP(cost complexity).

Optimization Scheme 2:k-fold Cross Validation

First, the whole decision Tree T is calculated, the number of leaf nodes is recorded as N, and the set I belongs to [1,n]. For each I, the decision tree is calculated using the K-fold Validataion method and clipped to the I node, calculating the error rate, and finally finding the average error rate. In this way, I as the final decision tree with the minimum error rate corresponding to the size, the original decision tree is cropped to obtain the optimal decision tree.

Optimization Scenario 3:Random Forest(next time I'll write it)

The random forest uses training data to randomly calculate a number of decision trees, forming a forest. Then use this forest to predict the unknown data and select the most popular categories. It is proved by practice that the error rate of this algorithm has been reduced by one step. The principle behind this method can be summed up with the proverb "Three Stooges set a Zhuge Liang". The probability that a tree predicts the correct probabilities may not be high, but the probability of a collective prediction being correct is high.

Reference documents:

Entropy:https://en.wikipedia.org/wiki/Entropy

Information gain rate:https://en.wikipedia.org/wiki/Information_gain_ratio

Confidence interval:https://en.wikipedia.org/wiki/Confidence_interval

Decision Tree of machine learning algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More