Decision Tree algorithm

Last Update:2017-04-23 Source: Internet

Author: User

Tags id3

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Use the ID3 algorithm to determine whether a particular day is suitable for playing tennis.

(1) category attribute information entropy calculation because there are 14 instances in the training dataset before partitioning,

There are 9 instances of the Yes class (suitable for playing tennis) and 5 instances of no (not suitable for tennis),

So the entropy of the pre-partitioning category attribute is:

(2) The calculation of non-class attribute information entropy, select Outlook attribute first.

(3) The information gain for Outlook attributes is:

(4) Similarly calculates the information gain of the other 3 non-class attributes, taking the largest attribute as the split node, in this case the largest is Outlook, which gets as shown:

(5), for the sub-training data set branch in sunny, there are two categories, there are 3 instances of the branch of the No class, 2 instances belong to the Yes class, the new information entropy of category properties

(6) The information entropy of 3 non-categorical attributes is obtained, and the information gain of each attribute is obtained, and the attribute humidity of the maximum information gain is selected.

(7) The same can be done by:

(8) Cool corresponds to a subset of data is no, so directly write no, no need to split. Mild corresponding subset of data, humidity and windy information gain is the same, because in this group, the ratio of the yes tuple is larger than the no tuple, so write directly yes, the resulting decision tree graph:

However, the use of information gain is actually a disadvantage , that is, it is biased towards properties with a large number of values. What do you mean? That is, in a training set, the more different values a property takes, the more likely it is to take it as a splitting attribute. For example, a training set has 10 tuples, for a zodiac A, it takes 1-10 of these 10 numbers, if splitting a will be divided into 10 classes, then for each class info (D_j) = 0, the formula (2) is 0, the attribute divides the resulting information gain (3) is the largest, but obviously, This division is meaningless.

It is for this reason that the C4.5 behind ID3 uses the concept of information gain rate . Information gain rate normalizes the information gain using the split information value. The classification information is similar to info (D) and is defined as follows:

(4)

This value represents the information generated by dividing the training DataSet D into a V partition corresponding to the V output of the property a test. Information gain rate Definition:

(5)

Select the attribute with the maximum gain rate as the split attribute.

(3) Gini indicator

The Gini indicator is used in the cart. Gini metric data partitioning or training tuple set D's purity, defined as:

(6)

Here the data set (both discrete values, for continuous values, described below) see the information Gain rate node selection:

The above training set has 4 attributes, namely the attribute collection A={outlook, temperature, humidity, windy}, and the class label has 2, namely the class tag set C={yes, no}, respectively, is suitable for outdoor sports and not suitable for outdoor sports, is actually a two classification problem.
DataSet D contains 14 training samples, of which 9 are of category "Yes" and 5 for category "No", then the entropy of information is calculated: The value of the formula (1)

`1`	`Info(D) = -9/14 * log2(9/14) - 5/14 * log2(5/14) = 0.940`

The information entropy is computed separately for each attribute set in the attribute collection, as follows:

`1`	`Info(OUTLOOK) = 5/14 * [- 2/5 * log2(2/5) – 3/5 * log2(3/5)] + 4/14 * [ - 4/4 * log2(4/4) - 0/4 * log2(0/4)] + 5/14 * [ - 3/5 * log2(3/5) – 2/5 * log2(2/5)] = 0.694`

`2`	`Info(TEMPERATURE) = 4/14 * [- 2/4 * log2(2/4) – 2/4 * log2(2/4)] + 6/14 * [ - 4/6 * log2(4/6) - 2/6 * log2(2/6)] + 4/14 * [ - 3/4 * log2(3/4) – 1/4 * log2(1/4)] = 0.911`

`3`	`Info(HUMIDITY) = 7/14 * [- 3/7 * log2(3/7) – 4/7 * log2(4/7)] + 7/14 * [ - 6/7 * log2(6/7) - 1/7 * log2(1/7)] = 0.789`

`4`	`Info(WINDY) = 6/14 * [- 3/6 * log2(3/6) – 3/6 * log2(3/6)] + 8/14 * [ - 6/8 * log2(6/8) - 2/8 * log2(2/8)] = 0.892`

Based on the above data, we can calculate the information gain value that is dependent on selecting the first root node, which is calculated as follows:

`1`	`Gain(OUTLOOK) = Info(D) - Info(OUTLOOK) = 0.940 - 0.694 = 0.246`

`2`	`Gain(TEMPERATURE) = Info(D) - Info(TEMPERATURE) = 0.940 - 0.911 = 0.029`

`3`	`Gain(HUMIDITY) = Info(D) - Info(HUMIDITY) = 0.940 - 0.789 = 0.151`

`4`	`Gain(WINDY) = Info(D) - Info(WINDY) = 0.940 - 0.892 = 0.048`

Next, we calculate the split information metric Splitinfo, which is recorded as H (V):

Outlook properties

Property Outlook has 3 values, where Sunny has 5 samples, rainy has 5 samples, overcast has 4 samples, then

`1`	`H(OUTLOOK) = - 5/14 * log2(5/14) - 5/14 * log2(5/14) - 4/14 * log2(4/14) = 1.577406282852345`

Temperature property

Attribute temperature has 3 values, in which hot has 4 samples, mild has 6 samples, cool has 4 samples, then

`1`	`H(TEMPERATURE) = - 4/14 * log2(4/14) - 6/14 * log2(6/14) - 4/14 * log2(4/14) = 1.5566567074628228`

Humidity Property

Attribute humidity has 2 values, where normal has 7 samples and high has 7 samples, then

`1`	`H(HUMIDITY) = - 7/14 * log2(7/14) - 7/14 * log2(7/14) = 1.0`

Windy Property

Attribute windy has 2 values, where true has 6 samples, false has 8 samples, then

`1`	`H(WINDY) = - 6/14 * log2(6/14) - 8/14 * log2(8/14) = 0.9852281360342516`

Based on the results above, we can calculate the information gain rate as follows:

`1`	`IGR(OUTLOOK) = Info(OUTLOOK) / H(OUTLOOK) = 0.246/1.577406282852345 = 0.15595221261270145`

`2`	`IGR(TEMPERATURE) = Info(TEMPERATURE) / H(TEMPERATURE) = 0.029 / 1.5566567074628228 = 0.018629669509642094`

`3`	`IGR(HUMIDITY) = Info(HUMIDITY) / H(HUMIDITY) = 0.151/1.0 = 0.151`

`4`	`IGR(WINDY) = Info(WINDY) / H(WINDY) = 0.048/0.9852281360342516 = 0.048719680492692784`

According to the obtained information gain rate, the attribute in the selection attribute set is used as the decision tree node, and the node is split. From the above information gain rate IGR The information gain rate of Outlook is the largest, so we select it as the first node.

4. Algorithmic features

4.1 Pruning of decision trees

When the decision tree is created, because of noise and outliers in the data, many branches reflect anomalies in the training data. Pruning methods are used to deal with this problem of overfitting data. Usually pruning methods use statistical measures to cut off the most unreliable branches.

Pruning generally divided into two methods: first pruning and post-pruning.

Pruning is done by pruning the tree in advance by stopping it prematurely (such as deciding that a node no longer splits or divides a subset of the training tuple). Once stopped, the node becomes a leaf, and the leaf may take its own class as the most frequently held subset of the class. First pruning there are many methods, such as (1) When the decision tree reaches a certain height to stop the growth of the decision Tree; (2) The instances that reach this node have the same eigenvectors, without necessarily belonging to the same class, or stop growing (3) when the number of instances reaching this node is less than a certain threshold, the tree growth can be stopped. The disadvantage is that it is not possible to handle those special cases where the amount of data is smaller (4) calculates the gain of each extension to the performance of the system, and if it is less than a certain threshold, it can stop growing. First pruning has a disadvantage is the field of vision effect, that is, under the same standards, perhaps the current expansion can not meet the requirements, but further expansion and can meet the requirements. This will prematurely stop the decision tree from growing.

Another more common method is post-pruning, which is formed by cutting off the subtree by a fully grown tree. Replace the node by removing its branches and using the leaves. Leaves are usually marked with the most frequent categories in the subtree. There are two methods of post-pruning:

The first method, and the simplest method, is called the pruning based on miscalculation. This idea is straightforward, the complete decision tree is not over-fitting, and I'm going to get a test data set to correct it. For each subtree of a non-leaf node in the full decision tree, we try to replace it with a leaf node, the category of the leaf node, which is replaced by the most existing class in the training sample covered by the subtree, thus creating a simplified decision tree and comparing the performance of the two decision trees in the test data set, If the simplification of the decision tree in the test dataset is less error, and the subtree does not contain another similar characteristics of the subtree (so-called similar characteristics, refers to the tree replaced with a leaf node, the test data set of the characteristics of the lower rate of miscalculation), then the subtree can be replaced by leaf nodes. The algorithm iterates through all subtrees in a bottom-up manner until no subtree can be replaced and the performance of the test data set is improved, the algorithm can be terminated.

The first method is straightforward, but requires an additional test data set, can you not do this extra data set? In order to solve this problem, we put forward the pessimistic pruning. Pessimistic pruning is the recursive estimation of the false rate of the sample nodes covered by each internal node. After pruning, the inner node becomes a leaf node, and the category of the leaf node is determined by the optimal leaf node of the original internal node. Then the error rate of the node before and after pruning is compared to determine whether to prune. The method is consistent with the first approach mentioned earlier, and the difference is in estimating the error rate of the internal nodes of the classification tree before pruning.

If a subtree (with multiple leaf nodes) is replaced by a leaf node, the rate of miscalculation on the training set is definitely rising, but not necessarily on the new data. So we need to calculate the miscalculation of the tree and add an empirical penalty factor. For a leaf node, which covers the N_i sample, which has an E error, then the error rate of the leaf node is (e+0.5)/n_i. This 0.5 (refer to the continuous correction for details) is a penalty factor, then a subtree, which has a leaf node of L, then the false rate of this subtree is estimated. In this way, we can see that a subtree has multiple sub-nodes, but due to the addition of a penalty factor, the miscalculation rate of the subtree may not be cheap. After pruning, the internal node becomes the leaf node, and the number of false errors J also needs to add a penalty factor to become j+0.5. Whether or not the subtree can be pruned depends on the error j+0.5 in the standard error after pruning. For sample error rate E, we can estimate it as a variety of distribution models based on experience, such as a two-item distribution, or a normal distribution.

So a tree for a data, the error classification of a sample value of 1, the correct classification of a sample value of 0, the tree error classification of the probability (the rate of miscarriage) is e_1 (can be counted out), then the number of false count of the tree is two distribution, we can estimate the number of errors of the tree mean and standard deviation:

which

When the tree is replaced with a leaf node, the number of false errors of the leaves is also a Bernoulli distribution, where N is the number of data reaching the leaf node, the probability of the false rate of e_2 is (j+0.5)/n, so the number of false errors of the leaf node is

Using the training data, the subtree is always smaller than the error caused by replacing a leaf node, but the error calculation method after using the correction is not so, and when the number of false errors in the subtree is greater than the number of false positives in the corresponding leaf node, it is decided to prune:

This condition is the standard of pruning.

Popular point, is to see the error rate after pruning will become very large (than the error rate before pruning plus its standard deviation is also large), if the error rate after pruning becomes very high, then do not prune, otherwise, pruning. Here's a concrete example of how the pruning is done.

For example: This is a sub-decision tree, where T1,T2,T3,T4,T5 is a non-leaf node, t6,t7,t8,t9,t10,t11 is a leaf node, here we can see n= sample sum 80, where a class 55 samples, Class B 25 samples.

Node	E (Subtree)	SD (Subtree)	E (Subtree) +sd (subtree)	E (Leaf)	Whether to prune
T1	8	2.68	10.68	25.5	Whether
T2	5	2.14	7.14	10.5	Whether
T3	3	1.60	4.60	5.5	Whether
T4	4	1.92	5.92	4.5	Is
T5	1	0.95	1.95	4.5	Whether

At this point, only the node T4 to meet the pruning criteria, we can cut off the node T4, that is, to directly change the T4 to leaf node A.

But it does not have to be a large standard deviation, the method is extended to a pruning method based on the ideal confidence interval (confidence intervals, CI), which models the error rate E of the leaf node into a random variable subject to two-item distribution, and an upper bound for a confidence interval threshold CI E_ Max, which causes the E<e_max to 1-ci (the default CI value for the C4.5 algorithm is 0.25), and if P (E<e_max) >1-ci, then prune. One more step, we can use normal distribution to approximate e (as long as n is large enough), based on these constraints, the upper bounds of the expected error of the C4.5 algorithm E_max (generally with Wilson score interval) are:

The medium Z selection is based on the ideal confidence interval, assuming that z is a normal random variable with a 0 mean and a unit variance, which is n (0,1). Why Wilson score interval is chosen as an upper bound, The main reason is that the upper bound can have some good properties in the case of a few samples or the existence of extreme probability data sets. See the links below: About Wilson score interval see: Http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Normal_ approximation_interval 4.2 for continuous data processing discretization: discretization of continuous-type attribute variables to form a training set of decision trees in three steps: 1. sort the sample (corresponding to the root node) or subset of samples (corresponding subtree) according to the size of the continuous variable from small to large, 2. assume that the property corresponds to a different property value of a total of N, So there is a total of N-1 possible candidate segmentation threshold points, the value of each candidate segmentation threshold point for the value of the above sorted attribute values of 22 consecutive elements in the midpoint 3. with the information gain rate select the best division 4.3 Handling of missing values missing values: In some cases, the data available for use may be missing values for some properties. For example (X, y) is a training instance in the sample set S, x= (F1_v,f2_v, ... Fn_v). But the value of its property fi fi_v unknown. processing Strategy: 1. a strategy for handling missing attribute values is the most common value of this attribute in the training instance assigned to it node T, 2. Another more complex strategy is to give a probability to each possible value of fi. For example, given a Boolean property of FI, if the node T contains 6 known fi_v=1 and 4 fi_v=0 instances, then the probability of fi_v=1 is 0.6, and the fi_v=0 probability is 0.4. Thus, 60% of instance X is assigned to the branch of Fi_v=1, and 40% is assigned to another branch. The purpose of these fragment samples (fractional examples) is to calculate the information gain, in addition, if there is a second missing value attribute must be tested, these samples can be further fine in the subsequent tree branchScore of (used in C4.5) 3. simple processing strategy is to discard these samples 4.4 c4.5 algorithm advantages and disadvantages Advantages: The resulting classification rules are easy to understand and the accuracy rate is high. Disadvantage: In the process of constructing the tree, the data sets need to be scanned and sorted several times, resulting in inefficient algorithm.

Blue whale

The algorithm decision tree is an algorithm for classifying and predicting new data by measuring historical data. In simple terms, the decision tree algorithm is to find the characteristics of the data by analyzing the historical data with definite results. Based on this, the results of the new data are forecasted.

The decision tree consists of 3 main parts, namely, decision node, branch, and leaf node. The decision node at the top of the decision tree is the root decision node. Each branch has a new decision node. Below the decision node is the leaf node. Each decision node represents a data category or attribute to be categorized, and each leaf node represents a result. The entire decision-making process starts from the root decision node, from top to bottom. Different results are given at each decision node based on the classification of the data.

Constructing a decision tree is a complex task. Below we will introduce the ID3 algorithm and the concept of "information entropy" in the decision tree. and create a simple decision tree by hand to illustrate the process and ideas of the whole building.

ID3 algorithm

There are many ways to construct a decision tree, and ID3 is one of the algorithms. The ID3 algorithm was first introduced by Roscun (J. Ross Quinlan) at the University of Sydney in 1975 as a classification prediction algorithm, with the core of "information entropy". The ID3 algorithm considers the property of "mutual information" to be a good attribute, obtains "mutual information" by computing the "information entropy" of each category or attribute in historical data, and chooses the highest category or attribute of "mutual information" as the decision node in the decision tree, and divides the value of the category or attribute as a branch. Repeat this process until a complete decision tree is created.

The meaning and classification of information entropy

Information entropy is an important index in the theory, which was put forward by Shannon in 1948. Shannon borrowed the concept of entropy in thermodynamics to describe the uncertainty of information. Therefore, entropy in informatics is related to the entropy of thermodynamics. According to Charles H. Bennett's re-interpretation of Maxwell's demon, the destruction of information is an irreversible process, so the destruction of information is in accordance with the second law of thermodynamics. The generation of information is the process of introducing negative (thermodynamic) entropy into the system. Therefore the symbol of information entropy should be opposite to the thermodynamic entropy.

It is simple to say that information entropy is a measure of information and, more precisely, an indicator of uncertainty or degree of confusion in information. The greater the uncertainty of information, the greater the entropy. The main factor that determines the uncertainty or complexity of information is probability. The entropy-related concepts used in decision trees are three: information entropy, conditional entropy and mutual information. The meanings and calculation methods of these three concepts are described below respectively.

Information entropy is an index used to measure the uncertainty of information in a meta-model. The greater the uncertainty of information, the greater the value of entropy. The main factor that affects entropy is probability. The one-dimensional model referred to here is a single event, and uncertainty is the possibility of a different outcome for an event. For example, a coin toss may result in two of the results, namely, positive and negative. And the result of each toss is a very uncertain message. Because according to our experience or historical data, the probability of a homogeneous coin appearing on the front and the opposite side is equal to 50%. It is therefore difficult to judge whether the next occurrence is positive or negative. The entropy of the coin toss is also high. And if the historical data tells us that this coin has been positive for 99 times in the past 100 trials, the quality of the coin is uneven and the probability of positive results is high. Then we can easily judge the next result. At this point the entropy is very low, only 0.08.

We consider the coin toss event as a random variable s, which may have a value of 2, namely positive x1 and reverse X2. The probabilities for each of these values are P1 and P2, respectively. We want to obtain a random variable s value results at least 1 tests, the number of trials and random variables s possible number of values (2) Logarithmic function log is linked. Log2=1 (base 2). So the formula for entropy is:

In the case of tossing coins, we use the probability of the one-dimensional model itself, which is the first 100 times of historical data, to eliminate the uncertainty of the judgment result. For many real-life problems, it is impossible to judge by their own probabilities alone. For example, for weather conditions, we cannot judge tomorrow's weather by the probability of a sunny day, a rainy day, or haze in historical data, like a coin toss, because there are many kinds of weather and there are many factors that affect the weather. Similarly, for users of the site, we are unable to determine whether the user will complete the purchase on the next visit through their historical purchase frequency. Because there is uncertainty about the purchase behavior of the user, more information is needed to eliminate these uncertainties. For example, the user's historical behavior in advertising ideas, promotional activities, commodity prices, delivery time and other information. So here we can't judge and predict with only one meta model, we need to get more information and understand the relationship between the user's buying behavior and other factors through the two-dollar model or the higher-order model to eliminate the uncertainty. The metric that measures this relationship is called conditional entropy.

Conditional entropy is the elimination of uncertainties in a unary model by obtaining more information. That is, the entropy of a one-dimensional model is reduced by a two-dollar or multivariate model. The more information we know, the less uncertain the information is. For example, when using only the unary model, we cannot judge whether this user will also buy this time based on the frequency of purchase in the user's historical data. Because the uncertainty is too big. After the promotion, commodity price and other information, in the two-dollar model we can find the link between user purchase and promotion, or commodity price change. And to reduce uncertainty by purchasing the probability of appearing together with the promotion, and the probability of buying at different promotions.

The conditional entropy is calculated using two probabilities, namely, the joint probability P (c) of the purchase and promotion activities, and the conditional probability E (c) When the purchase occurs when different promotions occur. The following is the formula for conditional entropy E (t,x). The lower the value of conditional entropy, the smaller the uncertainty of the two-element model.

Mutual information is an indicator used to measure the correlation between information. When two messages are fully correlated, the mutual information is 1 and irrelevant is 0. In the previous example, how high is the correlation between the user's purchase and the promotion, which we can measure through the mutual information indicator. The difference between entropy and conditional entropy is the specific method of calculation. The user buys the Entropy e (T) minus the entropy E (t,x) that the user buys when the promotion occurs. The following formula is calculated:

Entropy, conditional entropy and mutual information are the three key indexes of constructing decision tree. Below we will illustrate the process of creating a decision tree using an example from Wikipedia.

Building a decision Tree instance

This is the historical data of a golf club, which records the history of the user playing golf in different weather conditions. What we want to do is to predict whether the user will play golf by building a decision tree. Here the user whether to play is a one-dimensional model, with uncertainty, high entropy. We cannot judge whether a user will come tomorrow, only by the frequency of Yes and No. Therefore, it is necessary to use weather information to reduce uncertainty. 4 weather conditions are recorded below, and we begin the first step in building a decision tree by computing conditional entropy and mutual information: building a root decision point.

Building the root decision node

The way to build root decision points is to look for the one of the 4 most relevant weather conditions that are associated with golfing. First, let's look at the entropy of the one-dimensional model of play golf to see how uncertain this thing is.

Entropy of the one-dimensional model

In a unary model, it is a very uncertain thing to predict play golf only by the probability of historical data, in 14 historical data, the probability of playing is 64%, the probability of not playing is 36%. The entropy value reaches 0.940. This is similar to the previous example of tossing a coin. When we can't change the probability of historical data, we need to use more information to reduce uncertainty. Which is to calculate the conditional entropy.

Conditional entropy of binary model

Calculating the conditional entropy of a two-tuple model requires knowing the combined probability of play golf with 4 weather conditions, and the conditional probabilities of play golf appearing in different weather conditions. Let's calculate these two kinds of probabilities separately.

Joint probabilities

These are the combined probability values of 4 weather conditions and play golf that are calculated separately.

Conditional probabilities

At the same time, we also calculated the conditional probabilities of play golf in 4 different weather conditions. The conditional entropy between 4 weather conditions and play golf is obtained by combining probability and conditional probability.

Mutual information

After the entropy of the one-dimensional model of the play golf is known and the entropy of the two-yuan model under different weather conditions. We can measure what kind of weather is most relevant to play golf with mutual information.

Through the value of mutual information can be found, 4 kinds of weather Outlook value is the largest. Indicates that Outlook is the most relevant to play golf. So we chose Outlook as the root node of the decision tree to build the decision tree.

Building the root node

Throughout the decision tree, Outlook is the root node of the decision tree because it is the most relevant to play golf. With Outlook as the root node, there are three branches in the decision tree, respectively, three different values for Outlook Sunny,overcast and rainy. The overcast play golf is yes, so the leaf node of this branch is yes. (see later when building a branch decision node) Another two branches we will use the same method as before, by calculating entropy, conditional entropy and mutual information to select the next branch of the decision node.

Building Branch decision nodes

Below we continue to build sunny,overcast and rainy the three branches of the decision-making nodes, first of all to see the overcast node, the node has only one result, so there is no need to continue splitting.

Building Branch Nodes

Outlook Node Overcast Branch

In the overcast branch under the Outlook root node, Play Golf has only one result yes, so the overcast branch stops splitting. The value of the leaf node is yes.

Outlook Node Sunny Branch

In the sunny branch under the Outlook root node, another table is formed separately. Because of Outlook and the root node of the decision tree, there are 3 weather scenarios to consider, and we continue to determine the decision node for this table. Identify the decision nodes under the sunny branch from 3 weather conditions. The method and the procedure are consistent with the preceding, the entropy, conditional entropy and mutual information are calculated, and the maximal mutual information is divided as the decision-making node of sunny branch.

First, we calculate the entropy of the one-dimensional model of play golf, and we can see that the probability distributions of no and yes are 40% and 60% according to the historical data of play golf itself in the sunny branch, and the entropy value is 0.971. Very high uncertainty. So we continue to calculate the conditional entropy.

Here are the results of the combined probability and conditional probabilities of three weather conditions, respectively, with play golf. Wind is a bit different here, and wind is false when all the values for play golf are yes.

The conditional probabilities of three weather conditions and play golf were calculated, where wind's value was 0.

Mutual information

Calculates the mutual information value of three weather conditions and play golf, which is relevance. The higher the value, the greater the correlation. Wind has the highest mutual information value in three weather, which is 0.971. Indicates that wind and play golf are the most relevant sunny branches. So choose wind as the decision node of the sunny branch.

Building Branch decision nodes (Windy)

Under the sunny branch of the Outlook root node, the value of the calculated mutual information wind is the most relevant to play golf, so wind acts as a decision-making node for sunny. Wind has two branches, false and true, respectively. When wind is false, the result of Play golf is yes. When wind is true, the result is no.

Outlook Node Rainy Branch

The Outlook root node also has a branch that is rainy. The following is the Branch data table for rainy under Outlook. We select the decision node under the rainy branch from this table. Because of Outlook and the root node of the decision tree, wind becomes the decision node under the sunny branch, so there are only two types of temp and humidity left to consider in the weather situation.

First calculate the entropy of play golf under the rainy branch. From the historical data, the probability of no and yes is 60% and 40%, the entropy is 0.971, and the uncertainty of the one-dimensional model depends on its own probability. Add two weather information to calculate the conditional entropy.

The situation is similar to the sunny branch by calculating the combined probability and conditional probability of two weather conditions and play golf. Humidity should be highly relevant to play golf.

The conditional entropy of temp and humidity and play golf is calculated, where the conditional entropy of humidity and play golf is 0.

Mutual information

Play golf entropy subtracts two kinds of weather and the conditional entropy of play golf to get the value of mutual information. The humidity value is the largest, indicating the highest correlation. Therefore humidity is selected as the decision node for the rainy branch.

Building Branch decision nodes (humidity)

Under the rainy branch of Outlook, the humidity as a decision node has two branches, respectively, high and normal. All high branches correspond to the no of play golf, and all normal branches correspond to play golf yes. So cease to continue splitting.

So far we've built a decision tree with historical data on play golf and weather conditions. Below we look at the relationship between the entire decision tree and the historical data table from a higher dimension.

Data tables and decision Trees

By reverting each decision point in the decision tree to the original data table, you can see that each decision point corresponds to a single data table. Starting from the root decision node, we use computational entropy to find the most relevant weather information for play golf, to establish decision points and branches, and iterate over the process iteratively. Until the complete decision tree is finally built.

Using Decision trees for forecasting

At the beginning of the article we said that decision trees are used for classification and prediction. The specific process is as follows. When we build a decision tree, when new information is sent, we use the existing decision tree logic to judge the new information structure. When the content of the information is consistent with the decision tree, it goes to the next branch and obtains the classification result by the leaf node. For example, when a new day starts, we can tell whether a user will play golf with 4 weather features. The following is the specific forecast process, first look for new information in the root decision node Outlook, according to the value of Outlook to enter into the sunny branch, in the sunny branch to continue to determine the next decision point windy value, the new information windy the value of false, Returns yes based on the logic in the decision tree. Therefore, in the new information through the forecast of weather conditions users will come to play golf.

Increase accuracy by random forest

Decision trees are based on known historical data and probabilities, and the prediction of a decision tree may be less accurate, and the best way to improve accuracy is to build random forests (Forest). The so-called random forest is a random sampling of the historical data table from the generation of multiple sampling history table, the history table for each sample to generate a decision tree. Since the data is put back into the general table each time a sample table is generated, each decision tree is independent and unrelated. Make multiple decision trees into a random forest. When a new data is generated, each decision tree in the forest is judged separately, with the most votes being the result of the final judgment. In order to improve the correct probability.

Decision Tree algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More